regex - Extracting data from a HTML page (Python) -


i trying extract data this page. extract texts between 2 strings (item 1a risk factors , item 1b unresolved staff comments). difficult come right regular expression that.

import re import html2text  url = "https://www.sec.gov/archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm" html = urllib.urlopen(url).read()  text = html2text.html2text(html)  regex= '(?<=item 1a risk factors)(.*)(?=item 1b unresolved)'  match = re.search(regex, text, flags=re.ignorecase)  print match 

the above code returns 'none'. suggestions?

if want use regex, may use below code runs in python 3.5.2. try printing "text" see actual value of item 1a different see in webpage (item\&#160\;1a). hope helps.

import urllib.request urllib.error import urlerror, httperror import re import contextlib  mainpage = "https://www.sec.gov/archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm"  try:     contextlib.closing(urllib.request.urlopen(mainpage)) url:         htmltext = url.read().decode('utf-8')         #print(htmltext) except httperror e:     print("httperror")  except urlerror e:     print("urlerror")  else:     results = re.findall(r'(?=item\&\#160\;1a\.(.*)(risk factors))(.*)(?=item\&\#160\;1b\.(.*)(unresolved))',htmltext)     print (results) 

Comments