當前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫文字全是乱码_pythone爬虫编码自适应解决网页乱码

發(fā)布時間：2024/1/23 python 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫文字全是乱码_pythone爬虫编码自适应解决网页乱码小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

該樓層疑似違規(guī)已被系統(tǒng)折疊?隱藏此樓查看此樓

#coding=utf-8

import chardet #字符集檢測

import urllib.parse

import urllib.request

import re

import ssl

#跳過 SSL證書

ssl._create_default_https_context=ssl._create_unverified_context

rr = re.compile(r"\bcharset[=:\"\s]{1,3}([-_A-Z0-9]+)",re.I)

def getCode(string):

p = rr.findall(string)

if len(p)>0:

print(u'編碼方式: ' + p[0])

return p[0]

print(u'沒找到編碼方式')

return ''

#getCode(r'iiifjjd charset:" utf_8iidi-oo">')

def getHtml(url):

headers={

"User-Agent": 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',

'Referer': url

}

values = {

'name': 'hao_hao',

'ie': 'utf-8'

}

data = urllib.parse.urlencode(values)

req = urllib.request.Request(url=url+'?'+data, headers=headers)

#req = urllib.request.Request(url+'?'+data)

response = urllib.request.urlopen(req)

#1 從響應頭中找編碼方式

page = getCode(response.headers['Content-Type'])

#2 從網(wǎng)頁源代碼中找編碼方式

if page == '':

for line in response.readlines():

page = getCode(line.decode())

if page !='': break

the_page = response.read()

#3 chardet字符集檢測進行內(nèi)容分析. https://mm.taobao.com/search_tstar_model.html GBK 識別成 GB2312 所以不好用. 前兩個方法都不行再用

if page =='':

chardit1 = chardet.detect(the_page)

page = chardit1['encoding']

print(u'chardet字符集檢測\r\n編碼方式: ' + page)

#打印響應頭數(shù)據(jù).

print(response.headers)

#需要時關(guān)閉連接

#response.close()

#都找不到編碼方式

if page =='': return ''

return the_page.decode(page) #解碼.

#return the_page.decode(page).encode('utf-8')

print ('===============================================')

#gbk

html = getHtml("https://mm.tao[請把這幾個字刪掉]bao.com/search_tstar_model.html")

print (html)

print ('===============================================')

#utf-8

html = getHtml("http://kyfw.123[請把這幾個字刪抻]06.cn/otn/leftTicket/init")

print (html)

print ('===============================================')

總結(jié)

以上是生活随笔為你收集整理的python爬虫文字全是乱码_pythone爬虫编码自适应解决网页乱码的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：各地实时摄像头_智能摄像头——城市治安防
下一篇： python中表示单一数据的类型被称为_

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

python爬虫文字全是乱码_pythone爬虫编码自适应 解决网页乱码

總結(jié)

python爬虫文字全是乱码_pythone爬虫编码自适应解决网页乱码