i'm using beautifulsoup in python retrieve information website 1 indeed
i'm trying retrieve information on "location" of job postings, can found @ 1 of 2 levels of nested html.
sometimes, text want within tags (name="span", attrs={"class":"location"})
othertimes, text want in tag (name='span', attrs={"itemprop":"addresslocality"}) that's nested within first tag above.
i'm trying write loop check see if location text want (e.g., "new york, ny") within first tag, and, if not, retrieve second tag.
currently, best can come is:
for item in soup.find_all(name='span', attrs={"class":"location"}): print(item.rendercontents())
however, gives me undesirable output of:
new york, ny 10001 new york, ny new york, ny 10154 <span itemprop="addresslocality">new york, ny</span> <span itemprop="addresslocality">new york, ny</span> <span itemprop="addresslocality">new york, ny</span> <span itemprop="addresslocality">new york, ny</span> <span itemprop="addresslocality">new york, ny</span> <span itemprop="addresslocality">new york, ny 10016 <span style="font-size: smaller">(gramercy area)</span></span> <span itemprop="addresslocality">new york, ny</span> <span itemprop="addresslocality">manhattan, ny</span> <span itemprop="addresslocality">new york, ny</span> <span itemprop="addresslocality">new york, ny 10016 <span style="font-size: smaller">(gramercy area)</span></span> new york, ny new york, ny 10154
i ideally have of appearing text stay how is, , strip out "span itemprop="addresslocality"", etc. other results. i've tried writing few try/except statements accomplish this, haven't gotten work.
i save entire contents list , write separate code strip out additional burdensome text, appreciate more elegant way of accomplishing within initial retrieval.
could me this? thank consideration!
if can contrive span
class=location
(and assuming these items want in document) then, whether nested or not, contain same text
.
>>> bs4 import beautifulsoup >>> soup = beautifulsoup('<span class="location" itemprop="address" itemscope itemtype="http://schema.org/postaladdress"><span itemprop="addresslocality">new york, ny</span></span>', 'lxml') >>> soup.text 'new york, ny' >>> soup = beautifulsoup('<span class=location>new york, ny</span>', 'lxml') >>> soup.text 'new york, ny'
edit: getting whole list.
>>> import requests >>> bs4 import beautifulsoup >>> url = 'https://www.indeed.com/jobs?q=data%20scientist%20$20,000&l=new%20york&start=10/' >>> page = requests.get(url).text >>> soup = beautifulsoup(page, 'lxml') >>> spans = soup.findall('span', attrs={'class': 'location'}) >>> span in spans: ... span.text ... 'new york, ny 10154' 'new york, ny 10003' 'new york, ny' 'new york, ny' 'new york, ny' 'new york, ny' 'new york, ny' 'new york, ny' 'new york, ny' 'new york, ny' 'new york, ny' 'new york, ny' 'new york, ny 10018 (clinton area)' 'new york, ny' 'new york, ny 10001'
Comments
Post a Comment