python - Adding element to dictionary breaks encoding -

i thought had this, fell apart. i'm starting scraper pulls data chinese website. when isolate , print elements looking works fine ("print element" , "print text"). however, when add elements dictionary , print dictionary (print holder), goes "\x85\xe6\xb0" on me. trying .encode('utf-8') part of appending process throws new errors. may not matter because going dumped csv, makes troubleshooting hard. doing when add element dictionary mess encoding?

thanks!

from bs4 import beautifulsoup import urllib #csv csv writer import csv  #intended data structure list of dictionaries # holder = [{'headline': theheadline, 'url': theurl, 'date1': date1, 'date2': date2, 'date3':date3}, {'headline': theheadline, 'url': theurl, 'date1': date1, 'date2': date2, 'date3':date3})   #initiates dictionary hold output  holder = []  txt_contents = "http://sousuo.gov.cn/s.htm?q=&n=80&p=&t=paper&advance=true&title=&content=&puborg=&pcodejiguan=%e5%9b%bd%e5%8f%91&pcodeyear=2016&pcodenum=&childtype=&subchildtype=&filetype=&timetype=timeqb&mintime=&maxtime=&sort=pubtime&nocorrect=&sorttype=1"  #opens output doc output_txt = open("output.txt", "w")  #opens output doc output_txt = open("output.txt", "w")  def headliner(url):       #opens url read access     this_url = urllib.urlopen(url).read()     #creates new bs holder based on url     soup = beautifulsoup(this_url, 'lxml')      #creates headline section     headline_text = ''     #this bundles of headlines     headline = soup.find_all('h3')     #for each individual headline....     element in headline:             headline_text += ''.join(element.findall(text = true)).encode('utf-8').strip()             #this necessary turn findall output text             print element             text = element.text.encode('utf-8')             #prints each headline             print text             print "*******"             #creates dictionary headline             temp_dict = {}             #puts headline in dictionary             temp_dict['headline'] = text              #appends temp_dict main list             holder.append(temp_dict)              output_txt.write(str(text))             #output_txt.write(holder)  headliner(txt_contents) print holder  output_txt.close()

the encoding isn't being messed up. it's different ways of representing same thing:

>>> s = '漢字' >>> s '\xe6\xbc\xa2\xe5\xad\x97' >>> print(s) 漢字 >>> s.__repr__() "'\\xe6\\xbc\\xa2\\xe5\\xad\\x97'" >>> s.__str__() '\xe6\xbc\xa2\xe5\xad\x97' >>> print(s.__repr__()) '\xe6\xbc\xa2\xe5\xad\x97' >>> print(s.__str__()) 漢字

the last piece of puzzle know when put object in container, prints repr represent objects inside container in container's representations:

>>> ls = [s] >>> print(ls) ['\xe6\xbc\xa2\xe5\xad\x97']

perhaps become more clear if define our own custom object:

>>> class a(object): ...     def __str__(self): ...         return "str" ...     def __repr__(self): ...         return "repr" ... >>> a() repr >>> print(a()) str >>> ayes  = [a() _ in range(5)] >>> ayes [repr, repr, repr, repr, repr] >>> print(ayes[0]) str >>>

test

Search This Blog

python - Adding element to dictionary breaks encoding -

Comments

Post a Comment