當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫初学（3）

發(fā)布時間：2023/12/14 python 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫初学（3）小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

已知在一個網(wǎng)頁中，他的頭部信息描述了很多這個頁面的有用信息，包括該網(wǎng)頁的字符編碼，等等一系列對抓取爬蟲有用的信息。為了能夠利用這些信息，就要用一些特定的函數(shù)或者發(fā)放獲取到這些信息。

獲取函數(shù)頭部信息的代碼為：

import urllib #導(dǎo)入模塊 url = "http://www.iplaypython.com/" html = urllib.urlopen(url) head_info = html.info()
head_info里面就包含網(wǎng)站的頭部信息，type之后，發(fā)現(xiàn)他的類型為一個實例。那么dir(head_info)一下，就會發(fā)現(xiàn)這個實例所能引用的方法有哪些

>>> import urllib >>> dir(urllib.urlopen('http://www.163.com').info()) ['__contains__', '__delitem__', '__doc__', '__getitem__', '__init__', '__iter__', '__len__', '__module__', '__setitem__', '__str__', 'addcontinue', 'addheader', 'dict', 'encodingheader', 'fp', 'get', 'getaddr', 'getaddrlist', 'getallmatchingheaders', 'getdate', 'getdate_tz', 'getencoding', 'getfirstmatchingheader', 'getheader', 'getheaders', 'getmaintype', 'getparam', 'getparamnames', 'getplist', 'getrawheader', 'getsubtype', 'gettype', 'has_key', 'headers', 'iscomment', 'isheader', 'islast', 'items', 'keys', 'maintype', 'parseplist', 'parsetype', 'plist', 'plisttext', 'readheaders', 'rewindbody', 'seekable', 'setdefault', 'startofbody', 'startofheaders', 'status', 'subtype', 'type', 'typeheader', 'unixfrom', 'values'] >>>

其中g(shù)etparam這個方法就可以獲取一下網(wǎng)站的字符編碼，這個函數(shù)不是每次調(diào)用的時候去檢查一遍網(wǎng)頁的所有內(nèi)容看是什么編碼，它是在網(wǎng)頁的頭部信息中獲取到的。

Expires: Thu, 10 Apr 2014 15:51:37 GMT Date: Thu, 10 Apr 2014 15:50:17 GMT Server: nginx Content-Type: text/html; charset=GBK Vary: Accept-Encoding,User-Agent,Accept Cache-Control: max-age=80 X-Via: 1.1 sicyd210:8110 (Cdn Cache Server V2.0), 1.1 hnyd196:4 (Cdn Cache Server V2.0) Connection: close
上面是網(wǎng)易主頁的網(wǎng)站頭部信息，可以看到，網(wǎng)頁的字符編碼在charset中。所以，當(dāng)我們要對于這個網(wǎng)頁頭部信息獲取他的字符編碼的時候，可以調(diào)用getparam函數(shù)，并且向函數(shù)傳遞一個參數(shù)'charset'，那么該函數(shù)就會返回頭部信息中charset所保存的內(nèi)容，即字符編碼。但是如果某些網(wǎng)站沒有把字符編碼的信息放在函數(shù)頭部的話，那么我們就獲取不到這個值，返回的就是NONE，即某個網(wǎng)頁的頭部信息中沒有charset這個值，那么就不能獲取。爬取的這些頭部信息不是從網(wǎng)頁中來爬取的，是在它的網(wǎng)站服務(wù)器上面。

chardet字符集檢測模塊：

如果某些網(wǎng)站因為編寫不規(guī)范，導(dǎo)致我們在用getparam的時候并不能獲取到網(wǎng)頁的字符編碼模式，返回的是空值，我們就要用chardet模塊中的detect方法來檢測某個網(wǎng)頁的字符編碼。

#coding=utf-8 import chardet import urllib #導(dǎo)入模塊 url = "http://www.iplaypython.com/" html = urllib.urlopen(url).read() print chardet.detect(html)向detect方法傳入該網(wǎng)頁的內(nèi)容，就會檢測到這個網(wǎng)頁所用的字符集。

{'confidence': 0.99, 'encoding': 'utf-8'}
返回的是一個字典，第一個是這個結(jié)果的可信度，第二個是檢測出來的這個網(wǎng)頁的字符編碼方式。這個模塊的方法，是會檢測你網(wǎng)頁的全部內(nèi)容，以這個為準(zhǔn)，不會相信任何網(wǎng)頁中顯示表明的字符編碼方式。這個函數(shù)還能檢測任何以字符串形式作為參數(shù)傳遞進(jìn)來的字符編碼方式。

可以把上述的檢測功能封裝為一個函數(shù)，作為檢測網(wǎng)頁字符編碼的一個模塊。

總結(jié)

以上是生活随笔為你收集整理的python爬虫初学（3）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：计算机一级要注意什么保存,计算机一级考试
下一篇： PCA主成分分析python实现