當前位置：首頁 > 编程语言 > python >内容正文

python

《python3网络爬虫开发实战》--基本库的使用

發布時間：2025/7/14 python 30 豆豆

生活随笔收集整理的這篇文章主要介紹了《python3网络爬虫开发实战》--基本库的使用小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1. urllib:

request:它是最基本的?HTTP?請求模塊，可以用來模擬發送請求?。?就像在瀏覽器里輸入網撾然后回車一樣，只需要給庫方法傳入?URL?以及額外的參數，就可以模擬實現這個過程了?。

error:

parse:一個工具模塊，提供了許多?URL處理方法，比如拆分、解析?、?合并等。

robotparser:主要是用來識別網站的?robots.txt文件，然后判斷哪些網站可以爬，哪些網站不可以爬，它其實用得?比較少?。

2. Handle類：

? ?當需要實現高級的功能時，使用Handle

1 import http.cookiejar,urllib.request 2 3 filename = 'cookies.txt' 4 #cookie = http.cookiejar.CookieJar 5 #cookie = http.cookiejar.MozillaCookieJar(filename) 6 cookie = http.cookiejar.LWPCookieJar(filename) 7 cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True) 8 handle = urllib.request.HTTPCookieProcessor(cookie) 9 opener = urllib.request.build_opener(handle) 10 response = opener.open('http://www.baidu.com') 11 #for item in cookie: 12 # print(item.name+"="+item.value) 13 14 #cookie.save(ignore_discard=True, ignore_expires=True) 15 print(response.read().decode('utf-8'))

3. urljoin

我們可以提供一個?base_url?(基礎鏈接?)作為第一個參數，將新的鏈接作為第二個參數，該方法會分析?base_url?的?scheme、?netloc?和?path這?3個內容并對新鏈接缺失的部分進行補充，最后返回結果?。

4. urlencode()

1 from urllib.parse import urlencode 2 3 params = { 4 'name': 'germey', 5 'age': '23' 6 } 7 base_url = 'http://www.baidu.com?' 8 url = base_url+urlencode(params) 9 print(url)

5.parse_qs

? ?反序列化,將get請求的參數，轉回字典

1 from urllib.parse import parse_qs 2 query= 'name=germey&age=22' 3 print(parse_qs(query))

? ?parse_qsl:轉化為元組組成的列表

1 from urllib.parse import parse_qsl 2 print(parse_qsl(query))

6. quote

? ?將內容轉化為URL編碼模式

7.分析Robots協議

? 1. robots協議

? ? ?Robots?協議也稱作爬蟲協議、機器人協議，它的全名叫作網絡爬蟲排除標準(?Robots?ExclusionProtocol)，用來告訴爬蟲和搜索引擎哪些頁面可以抓取，哪些不可以抓取。它通常是一個叫作?robots.txt的文本文件，一般放在網站的根目錄下?。

? 2. robotparser

? ? set_url:用來設置?robots.txt?文件的鏈接。?如果在創建?RobotFileParser?對象時傳入了鏈接，那么就不需要再使用這個方法設置了

? ? read:讀取?robots.txt?文件并進行分析。?注意，這個方法執行一個讀取和分析操作，如果不調用這個方法?，?接下來的判斷都會為?False，所以一定記得調用這個方法。?這個方法不會返回任何內容，但是執行了讀取操作?。

? ? parse:用來解析robots.txt文件，傳人的參數是robots.txt某些行的內容，它會按照robots.txt的語法規則來分析這些內容?。

? ?can_fetch:該方法傳人兩個參數?，?第一個是?User-agent，第二個是要抓取的?URL。?返回的內容是該搜索引擎是否可以抓取這個?URL，返回結果是?True?或?Falsea

? ? mtime:返回的是上次抓取和分析?robots.txt的時間，這對于長時間分析和抓取的搜索爬蟲是很有必要的，你可能需要定期檢查來抓取最新的?robots.txt。

? ? modified:

1 from urllib.robotparser import RobotFileParser 2 3 rp = RobotFileParser() 4 rp.set_url('http://www.jianshu.com/robots.txt') 5 rp.read() 6 print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d')) 7 print(rp.can_fetch('*', 'http://www.jianshu.com/search?q=python&page=l&type=collections'))

8. requests

它同樣對長時間分析和抓取?的搜索爬蟲很有幫助，將?當前時間設置為上次抓取和分析?robots.txt?的時間?。

1. get:

1 import requests 2 import re 3 4 #瀏覽器標時，如果沒有，會禁止爬取 5 headers = { 6 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36' 7 } 8 r = requests.get("http://www.zhihu.com/explore",headers=headers) 9 pattern = re.compile('explore-feed.*?question.*?>(.*?)</a>',re.S) 10 titles = re.findall(pattern, r.text) 11 print(titles) 12 13 r = requests.get("http://github.com/favicon.ico") 14 with open('favicon.ico','wb') as f: 15 f.write(r.content)

2. post:

1 import requests 2 3 data = { 4 'name': 'name', 5 'age': '22' 6 } 7 r = requests.post("http://httpbin.org/post", data=data) 8 print(r.text) 9 r = requests.get('http://www.zhihu.com') 10 print(type(r.status_code), r.status_code)#得到狀態碼 11 print(type(r.headers), r.headers)#得到響應頭 12 print(type(r.cookies), r.cookies)#得到cookies 13 print(type(r.url), r.url)#得到URL 14 print(type(r.history), r.history)#得到請求歷史

9. request的高級語法：

1.文件上傳：

2. cookies:

1 import requests 2 3 files = {'file':open('favicon.ico', 'rb')} 4 r = requests.post("http://httpbin.org/post", files=files) 5 print(r.text) 6 r = requests.get("http://www.baidu.com") 7 print(r.cookies) 8 for key, value in r.cookies.items(): 9 print(key + '=' + value) 10 11 headers = { 12 'Cookies': 'tst=r; __utma=51854390.2112264675.1539419567.1539419567.1539433913.2; __utmb=51854390.0.10.1539433913; __utmc=51854390; __utmv=51854390.100--|2=registration_date=20160218=1^3=entry_date=20160218=1; __utmz=51854390.1539433913.2.2.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/; tgw_l7_route=e0a07617c1a38385364125951b19eef8; q_c1=d3c7341e344d460ead79171d4fd56f6f|1539419563000|1516290905000; _xsrf=713s0UsLfr6m5Weplwb4offGhSqnugCy; z_c0="2|1:0|10:1533128251|4:z_c0|92:Mi4xS2VDaEFnQUFBQUFBZ09DVGo1ZUtEU1lBQUFCZ0FsVk5PX3hPWEFEVXNtMXhSbmhjbG5NSjlHQU9naEpLbkwxYlpB|e71c25127cfb23241089a277f5d7c909165085f901f9d58cf93c5d7ec7420217"; d_c0="AIDgk4-Xig2PTlryga7LwT30h_-3DUHnGbc=|1525419053"; __DAYU_PP=zYA2JUmBnVe2bBjq7qav2ac8d8025bbd; _zap=d299f20c-20cc-4202-a007-5dd6863ccce9', 13 'Host': 'www.zhihu.com', 14 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36', 15 16 } 17 r = requests.get("http://www.zhihu.com",headers=headers) 18 print(r.text)

3. 會話維持:

1 import requests 2 3 requests.get("http://httpbin.org/cookies/set/umber/123456789") 4 r = requests.get("http://httpbin.org/cookies") 5 print(r.text) 6 s = requests.Session() 7 s.get("http://httpbin.org/cookies/set/umber/123456789") 8 r = s.get('http://httpbin.org/cookies') 9 print(r.text) 1 { 2 "cookies": {} 3 } 4 5 { 6 "cookies": { 7 "umber": "123456789" 8 } 9 } 10 11 12 Process finished with exit code 0

4. SSl證書驗證

requests還提供了證書驗證的功能。?當發送?HTTP請求的時候，它會檢查?SSL證書，我們可以使用?verify參數控制是否檢查此證書。?其實如果不加?verify參數的話，默認是?True，會自動驗證。

1 import requests 2 #from requests.packages import urllib3 3 import logging 4 5 logging.captureWarnings(True) 6 #urllib3.disable_warnings() 7 response = requests.get('https://www.12306.cn', verify=False) 8 print(response.status_code)

5. 代理設置

6. 超時設置

r = requests.get('http://www.taobao.com', timeout=1)

7. 身份認證

1 import requests 2 from requests.auth import HTTPBasicAuth 3 4 r = requests.get('http://localhost:5000', auth=HTTPBasicAuth('username', 'password')) 5 print(r.status_code)

8.Prepared Request:將請求表示為數據結構

from requests import Request, Sessionurl = 'http://httpbin.org/post' data = {'name': 'germey' } headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36' } s = Session() req = Request('POST',url,data=data,headers=headers) prepped = s.prepare_request(req) r = s.send(prepped) print(r.text)

10. 正則表達式：

https://www.cnblogs.com/chengchengaqin/p/9708044.html

轉載于:https://www.cnblogs.com/chengchengaqin/p/9784229.html

總結

以上是生活随笔為你收集整理的《python3网络爬虫开发实战》--基本库的使用的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。