Python Beautiful Soup gb2312 Windows-1252 乱码问题


Headnotes

Many equations and formulas look intimidating. However, when you hunt them down, they are definitely not! Just papertigers!

Now let's hunt the papertiger.

当初,Python 的编码把我折磨得死去活来,一直准备写一篇文章总结一下。先把我看到的两篇对我最有用的文章贴出来。然后再来慢慢修改整理。
文章1:
Beautiful Soup gb2312乱码问题

http://groups.google.com/group/python-cn/browse_thread/thread/cb418ce811563524

 请注意 gb2312 不是 “gb2312”,凡 gb2312 的请换成 gb18030.
微软将 gb2312 和 gbk 映射为 gb18030,方便了一些人,也迷惑了一些人。

即,实际上该网页是GB18030的编码,所以按照这里:

上午解决了网页解析乱码的问题

http://blog.csdn.net/fanfan19881119/article/details/6789366

(原始出处为:http://leeon.me/a/beautifulsoup-chinese-page-resolve

的方法,传递GB18030给fromEncoding,才可以:

 page = urllib2.build_opener().open(req).read()
soup = BeautifulSoup(page, fromEncoding=”GB18030“)

   而其中,也解释了,为何HTML标称的GBK的编码,但是却解析出来为windows-1252了:

 最近需要写一个python的RSS抓取解析程序,使用了feed parser。但是对于百度新闻的RSS,其编码方式为gb2312,feed parser探测出来的编码却是windows-1252,结果中文内容都是一堆乱码。问题在于,并不是feedparser不能识别出gb2312编码,而是国人们往往将gb2312与gbk编码等同,某些已经使用了gbk编码里的字符的,仍然声称内容为gb2312编码。feedparser对gb2312编码严格遵循gb2312字符集范围,当探测到超出这一范围的字符,便将编码回退到windows-1252。由于百度的RSS实际使用的应该是gbk编码,里面含有超出gb2312范围的字符,于是feedparser便擅自决定了将编码退回windows-1252,导致了中文乱码的现象。

 

文章2:

【整理】Python中实际上已经得到了正确的Unicode或某种编码的字符,但是看起来或打印出来却是乱码

【整理】Python中实际上已经得到了正确的Unicode或某种编码的字符,但是看起来或打印出来却是乱码

 


Footnotes

There are many excellent tutorials out there. Some tutorials are too intuitive and it's helpful, but you cannot get it straight on the math details. Some focused on dymestifying math. Some focused on code. I found the best tutorials that give you the conceptual ideas and are possible for implementation without being blind to the math details. Drop a comment if I failed. It would be really appreciable.


If you want to cite this article, please cite this article as:

Lachlan Chen, "Python Beautiful Soup gb2312 Windows-1252 乱码问题," in EarnFromScratch, 5月 25, 2016, https://www.earnfs.com/zh/html/1271.htm.

or

@misc{lachlanchen2020tutorial,
title=Python Beautiful Soup gb2312 Windows-1252 乱码问题,
author={Chen, Lachlan},
year=5月 25, 2016
}


EarnFromScratch (9月 24, 2021) Python Beautiful Soup gb2312 Windows-1252 乱码问题. Retrieved from https://www.earnfs.com/zh/html/1271.htm.
"Python Beautiful Soup gb2312 Windows-1252 乱码问题." EarnFromScratch - 9月 24, 2021, https://www.earnfs.com/zh/html/1271.htm
EarnFromScratch 5月 25, 2016 Python Beautiful Soup gb2312 Windows-1252 乱码问题., viewed 9月 24, 2021,<https://www.earnfs.com/zh/html/1271.htm>
EarnFromScratch - Python Beautiful Soup gb2312 Windows-1252 乱码问题. [Internet]. [Accessed 9月 24, 2021]. Available from: https://www.earnfs.com/zh/html/1271.htm
"Python Beautiful Soup gb2312 Windows-1252 乱码问题." EarnFromScratch - Accessed 9月 24, 2021. https://www.earnfs.com/zh/html/1271.htm
"Python Beautiful Soup gb2312 Windows-1252 乱码问题." EarnFromScratch [Online]. Available: https://www.earnfs.com/zh/html/1271.htm. [Accessed: 9月 24, 2021]


发表评论