python 2.7 - combine stripping white space and html tags -
i'm looking possibility strip html tags , white space parsed text using beautiful soup. problem can't combine these two.
here whole script:
# -*- coding: utf-8 -*- urllib2 import urlopen bs4 import beautifulsoup bs word = "drop" url = ('http://civil.ge/eng/category.php?id=10') soup = bs(urlopen(url).read()) titz = soup.find("div", {"class": "archtype_category_block"}) t in titz.find_all('div', {'class': 'archive_type_article_title'}): if word in t.encode('utf-8').strip(): print t.prettify()
the result prettify()
is:
<div class="archive_type_article_title"> prosecutors drop objection release of ex-mod officials pretrial detention </div>
and get_text()
clean text lots of white space before , after it. solutions this?
thanks!
i used python 3 , wasn't able reproduce spacing problem. maybe answer!
i change print t.prettify()
print t.prettify().join(mystring.split())
, see if fixes problem.
also, code first archtype_category_block
, maybe want, if want of them have change titz = soup.find("div", {"class": "archtype_category_block"})
for titz in soup.find_all("div", {"class": "archtype_category_block"}):
Comments
Post a Comment