i trying extract data this page. extract texts between 2 strings (item 1a risk factors , item 1b unresolved staff comments). difficult come right regular expression that.
import re import html2text url = "https://www.sec.gov/archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm" html = urllib.urlopen(url).read() text = html2text.html2text(html) regex= '(?<=item 1a risk factors)(.*)(?=item 1b unresolved)' match = re.search(regex, text, flags=re.ignorecase) print match
the above code returns 'none'. suggestions?
if want use regex, may use below code runs in python 3.5.2. try printing "text" see actual value of item 1a different see in webpage (item\ \;1a). hope helps.
import urllib.request urllib.error import urlerror, httperror import re import contextlib mainpage = "https://www.sec.gov/archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm" try: contextlib.closing(urllib.request.urlopen(mainpage)) url: htmltext = url.read().decode('utf-8') #print(htmltext) except httperror e: print("httperror") except urlerror e: print("urlerror") else: results = re.findall(r'(?=item\&\#160\;1a\.(.*)(risk factors))(.*)(?=item\&\#160\;1b\.(.*)(unresolved))',htmltext) print (results)
Comments
Post a Comment