Python爬虫编码格式问题GB2312转换utf8

, Read in about 2 min

Posted by Ryen on Thursday, March 4, 2021
With final update in December 26,2023

本文总阅读量

在最近的图片爬虫中遇到的问题是:爬取网页得到的结果如下(部分)  里面的中文出现乱码。

<!DOCTYPE html>
<html lang='zh-CN'>
<head>
<meta charset='gb2312'>
<meta content='IE=edge' http-equiv='X-UA-Compatible'>
<title>2017Äê11ÔÂ10ÈÕÃâ·Ñ´úÀíip µÚ1Ò³</title>
<meta name="keywords" content="´úÀíip£¬´úÀíip¼ì²â£¬´úÀíipÑéÖ¤£¬¿ÉÓôúÀíip£¬×îдúÀíip£¬½ñÈÕ¿ÉÓôúÀíip£¬Ãâ·Ñ´úÀíip">
<meta name="description" content="ip181ÊÇÒ»¼ÒרΪ´úÀíipʹÓÃÕß´òÔìµÄ´úÀíip¼ì²âƽ̨£¬ÕâÀï²»½öÌṩרҵµÄ´úÀíipÑéÖ¤·þÎñ£¬»¹ÎªÄúÌṩ×îеÄÃâ·Ñ´
úÀíip£¬ÊµÊ±¸üдúÀíip¡£">
<link href="/ip181.css" media="all" rel="stylesheet" />
</head>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh-cn">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=GB2312" />
<meta http-equiv="X-UA-Compatible" content="IE=8">
<meta http-equiv="Cache-Control" content="no-transform" />
<meta http-equiv="Cache-Control" content="no-siteapp" />
<title>XiuRenÐãÈËÍøµÚ3044ÆÚÎľ²¶ùдÕæ,Îľ²¶ù,Îľ²¶ùÌ×ͼ - - XiuRen</title>
<meta name="keywords" content="[XiuRenÐãÈËÍø]No.3044,ÄÛÄ£Îľ²¶ù,°Å±ÈÍÞÍÞ,ÌìʹÃæÈÝ,·ÛÉ«·þÊÎ,»ëÔ²ÇÌÍÎ,ÁÃÈËÓÕ»óдÕæ"/>
<meta name="description" content="[XiuRenÐãÈËÍø]No.3044_ÄÛÄ£Îľ²¶ù°Å±ÈÍÞÍÞÌìʹÃæÈÝ·ÛÉ«·þÊζ»ëÔ²ÇÌÍÎÁÃÈËÓÕ»óдÕæ57P"/>

使用

print response.encoding

输出结果是:

ISO-8859-1

使用的办法是:

print response.text.encode('ISO-8859-1').decode(requests.utils.get_encodings_from_content(response.text)[0])

结果:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh-cn">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=GB2312" />
<meta http-equiv="X-UA-Compatible" content="IE=8">
<meta http-equiv="Cache-Control" content="no-transform" />
<meta http-equiv="Cache-Control" content="no-siteapp" />
<title>XiuRen秀人网第3044期文静儿写真,文静儿,文静儿套图 - - XiuRen</title>
<meta name="keywords" content="[XiuRen秀人网]No.3044,嫩模文静儿,芭比娃娃,天使面容,粉色服饰,浑圆翘臀,撩人诱惑写真"/>
<meta name="description" content="[XiuRen秀人网]No.3044_嫩模文静儿芭比娃娃天使面容粉色服饰露浑圆翘臀撩人诱惑写真57P"/>

好的!大功告成,哈哈哈哈 ,写真还是挺不错(api可以看)

「真诚赞赏,手留余香」

Ryen's Blog

真诚赞赏,手留余香

使用微信扫描二维码完成支付