i thought had this, fell apart. i'm starting scraper pulls data chinese website. when isolate , print elements looking works fine ("print element" , "print text"). however, when add elements dictionary , print dictionary (print holder), goes "\x85\xe6\xb0" on me. trying .encode('utf-8') part of appending process throws new errors. may not matter because going dumped csv, makes troubleshooting hard. doing when add element dictionary mess encoding?
thanks!
from bs4 import beautifulsoup import urllib #csv csv writer import csv #intended data structure list of dictionaries # holder = [{'headline': theheadline, 'url': theurl, 'date1': date1, 'date2': date2, 'date3':date3}, {'headline': theheadline, 'url': theurl, 'date1': date1, 'date2': date2, 'date3':date3}) #initiates dictionary hold output holder = [] txt_contents = "http://sousuo.gov.cn/s.htm?q=&n=80&p=&t=paper&advance=true&title=&content=&puborg=&pcodejiguan=%e5%9b%bd%e5%8f%91&pcodeyear=2016&pcodenum=&childtype=&subchildtype=&filetype=&timetype=timeqb&mintime=&maxtime=&sort=pubtime&nocorrect=&sorttype=1" #opens output doc output_txt = open("output.txt", "w") #opens output doc output_txt = open("output.txt", "w") def headliner(url): #opens url read access this_url = urllib.urlopen(url).read() #creates new bs holder based on url soup = beautifulsoup(this_url, 'lxml') #creates headline section headline_text = '' #this bundles of headlines headline = soup.find_all('h3') #for each individual headline.... element in headline: headline_text += ''.join(element.findall(text = true)).encode('utf-8').strip() #this necessary turn findall output text print element text = element.text.encode('utf-8') #prints each headline print text print "*******" #creates dictionary headline temp_dict = {} #puts headline in dictionary temp_dict['headline'] = text #appends temp_dict main list holder.append(temp_dict) output_txt.write(str(text)) #output_txt.write(holder) headliner(txt_contents) print holder output_txt.close()
the encoding isn't being messed up. it's different ways of representing same thing:
>>> s = '漢字' >>> s '\xe6\xbc\xa2\xe5\xad\x97' >>> print(s) 漢字 >>> s.__repr__() "'\\xe6\\xbc\\xa2\\xe5\\xad\\x97'" >>> s.__str__() '\xe6\xbc\xa2\xe5\xad\x97' >>> print(s.__repr__()) '\xe6\xbc\xa2\xe5\xad\x97' >>> print(s.__str__()) 漢字
the last piece of puzzle know when put object in container, prints repr
represent objects inside container in container's representations:
>>> ls = [s] >>> print(ls) ['\xe6\xbc\xa2\xe5\xad\x97']
perhaps become more clear if define our own custom object:
>>> class a(object): ... def __str__(self): ... return "str" ... def __repr__(self): ... return "repr" ... >>> a() repr >>> print(a()) str >>> ayes = [a() _ in range(5)] >>> ayes [repr, repr, repr, repr, repr] >>> print(ayes[0]) str >>>
Comments
Post a Comment