當前位置：首頁 > 编程语言 > python >内容正文

python

Python3 爬虫学习笔记 C01 【基本库 urllib 的使用】

發布時間：2023/12/10 python 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python3 爬虫学习笔记 C01 【基本库 urllib 的使用】小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Python3 爬蟲學習筆記第一章 —— 【基本庫 urllib 的使用】

文章目錄

【1.1】 urllib 簡介
【1.2】 urllib.request 發送請求
- 【1.2.1】 urllib.request.urlopen()
- - 【1.2.1.1】基本使用方法
  - 【1.2.1.2】添加參數
  - - ● data 參數
    - ● timeout 參數
- 【1.2.2】 urllib.request.Request()
- 【1.2.3】 urllib.request.urlretrieve()
【1.3】 urllib.error 異常處理
- 【1.3.1】 URLError
- 【1.3.2】 HTTPError
- 【1.3.3】進階用法
【1.4】 urllib.parse 解析 URL
- 【1.4.1】 urllib.parse.urlencode()
- 【1.4.2】 urllib.parse.parse_qs()
- 【1.4.3】 urllib.parse.parse_qsl()
- 【1.4.4】 urllib.parse.urlparse()
- 【1.4.5】 urllib.parse.urlunparse()
- 【1.4.6】 urllib.parse.urlsplit()
- 【1.4.7】 urllib.parse.urlunsplit()
- 【1.4.8】 urllib.parse.urljoin()
- 【1.4.9】 urllib.parse.quote()
- 【1.4.10】 urllib.parse.unquote()
【1.5】 urllib.robotparser 爬取權限判斷
- 【1.5.1】 Robots 協議簡介
- 【1.5.2】 RobotFileParser 類常用方法

【1.1】 urllib 簡介

在 Python 中有兩種方式可以發送 HTTP 請求，分別是自帶的 urllib 庫和第三方的 requests 庫

urllib 庫：Python 內置的 HTTP 請求庫，無需額外安裝即可使用；Python 2 中有 urllib 和 urllib2 兩個庫來實現請求的發送，Python 3 中統一為 urllib。官方文檔：https://docs.python.org/3/library/urllib.html

urllib 所包含的常用模塊：

urllib.request：模擬發送請求；
urllib.error：異常處理模塊，用于捕獲異常；
urllib.parse：解析、拆分、合并URL；
urllib.robotparser：讀取網站的 robots.txt 文件，判斷哪些內容可以爬取。

urllib.request 所包含的常用方法：

urllib.request.urlopen()：打開網址URL，這可以是一個字符串或一個 Request對象；
urllib.request.Request()：在請求的時候傳入一些 headers 等信息；
urllib.request.urlretrieve()：將獲取的URL的內容寫到文件目錄中去。

urllib.error 所包含的兩個異常：

URLError：繼承自 OSError 類，是 error 異常模塊的基類，由 request 模塊產生的異常都可以通過捕獲這個類來處理。
HTTPError：是 URLError 的子類，專門用來處理 HTTP 請求錯誤，比如認證請求失敗等。

urllib.parse 所包含的常用方法：

urllib.parse.urlencode()：將字典參數序列化為 GET 請求參數；
urllib.parse.parse_qs()：將 GET 請求參數反序列化轉回字典；
urllib.parse.parse_qsl()：將參數轉化為元組組成的列表；
urllib.parse.urlparse()：對 URL 進行分段（返回6個結果）；
urllib.parse.urlunparse()：對 URL 進行組合（長度必須為6）；
urllib.parse.urlsplit()：對 URL 進行分段（不單獨解析params部分，返回5個結果）；
urllib.parse.urlunsplit()：對 URL 進行組合（長度必須為5）；
urllib.parse.urljoin()：對 URL 進行組合（沒有長度限制，給定兩個參數，自動分析 scheme、netloc 和 path 這 3 個內容并對新鏈接缺失的部分進行補充，最后返回結果）；
urllib.parse.quote()：將內容轉化為 URL 編碼格式；
urllib.parse.unquote()：對 URL 進行解碼。

urllib.robotparser 所包含的類：

RobotFileParser：根據網站的 robots.txt 文件來判斷一個爬取爬蟲是否有權限來爬取這個網頁

【1.2】 urllib.request 發送請求

【1.2.1】 urllib.request.urlopen()

【1.2.1.1】基本使用方法

urlopen() 函數的 API：

urllib.request.urlopen(url, data=None, [timeout,]*, cafile=None, capath=None, cadefault=False, context=None)

基本使用：運行以下代碼可得到 https://www.itrhx.com/ 的網頁源代碼：

import urllib.requestresponse = urllib.request.urlopen('https://www.itrhx.com/') print(response.read().decode('utf-8'))

輸出響應對象的類型和屬性：

import urllib.requestresponse = urllib.request.urlopen('https://www.itrhx.com/') print(type(response)) # 響應類型 print(response.status) # 返回結果的狀態碼，200代表請求成功 print(response.getheaders()) # 響應的頭信息 print(response.getheader('Server')) # 獲取響應頭的 server 值

運行結果：

<class 'http.client.HTTPResponse'> 200 [('Content-Type', 'text/html; charset=utf-8'), ('Server', 'GitHub.com'), ('Last-Modified', 'Sat, 17 Aug 2019 12:16:48 GMT'), ('ETag', '"5d57f030-10863"'), ('Access-Control-Allow-Origin', '*'), ('Expires', 'Sat, 17 Aug 2019 19:41:25 GMT'), ('Cache-Control', 'max-age=600'), ('X-Proxy-Cache', 'MISS'), ('X-GitHub-Request-Id', 'C748:735D:5B7461:619B95:5D58560B'), ('Content-Length', '67683'), ('Accept-Ranges', 'bytes'), ('Date', 'Sun, 18 Aug 2019 13:28:44 GMT'), ('Via', '1.1 varnish'), ('Age', '228'), ('Connection', 'close'), ('X-Served-By', 'cache-tyo19931-TYO'), ('X-Cache', 'HIT'), ('X-Cache-Hits', '1'), ('X-Timer', 'S1566134924.190474,VS0,VE0'), ('Vary', 'Accept-Encoding'), ('X-Fastly-Request-ID', '25a69f8130fc9cae412d28990a724543d7d05e8b')] GitHub.com

【1.2.1.2】添加參數

根據 urlopen() 函數的 API 可知，除了最基本的 URL 參數以外，我們還可以傳遞其他內容，比如 data（附加數據）、timeout（超時時間）等，以下用 data 和 timeout 參數舉例說明。

● data 參數

如果要添加 data 參數，需要使用 bytes 方法將參數轉化為字節流編碼格式的內容，即 bytes 類型。另外，如果傳遞了這個參數，則它的請求方式就不再是 GET 方式，而是 POST 方式。代碼示例：

import urllib.parse import urllib.requestdata = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8') response = urllib.request.urlopen('http://httpbin.org/post', data=data) print(response.read())

httpbin.org 站點提供 HTTP 請求測試，http://httpbin.org/post 用于測試 POST 請求，示例中傳遞一個值為 hello 的 word 參數。使用 bytes 方法，將其轉碼成 bytes（字節流）類型。該方法的第一個參數需要是 str（字符串）類型，需要用 urllib.parse 模塊里的 urlencode 方法來將參數字典轉化為字符串；第二個參數指定編碼格式為 utf8，運行結果：

b'{"args": {},"data": "", "files": {},"form": {"word": "hello"},"headers": {"Accept-Encoding": "identity", "Content-Length": "10","Content-Type": "application/x-www-form-urlencoded","Host": "httpbin.org","User-Agent": "Python-urllib/3.6"}, "json": null, "origin": "171.115.101.10, 171.115.101.10", "url": "https://httpbin.org/post" }'

● timeout 參數

舉例：

import urllib.requestresponse = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1) print(response.read())

運行結果：

... During handling of the above exception, another exception occurred:Traceback (most recent call last):File "C:/Users/Lenovo/Desktop/1.py", line 2, in <module>response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)... urllib.error.URLError: <urlopen error timed out>

timeout 設置為0.1，0.1秒過后服務器沒有響應，便會拋出 URLError 異常
進階：使用 try except 語句拋出異常

【1.2.2】 urllib.request.Request()

Request() 方法可以在請求的時候傳入一些 data、headers 等信息
Request() 的構造方法：

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

構造方法各個參數的解釋：

url：用于請求 URL，這是必傳參數，其他都是可選參數。
data：如果要傳，必須傳 bytes（字節流）類型的。如果它是字典，可以先用 urllib.parse 模塊里的 urlencode() 編碼。
headers：是一個字典，它就是請求頭，可以在構造請求時通過 headers 參數直接構造，也可以通過調用請求實例的 add_header() 方法添加。添加請求頭最常用的用法就是通過修改 User-Agent 來偽裝瀏覽器，默認的 User-Agent 是 Python-urllib，我們可以通過修改它來偽裝瀏覽器。
origin_req_host：指的是請求方的 host 名稱或者 IP 地址。
unverifiable：表示這個請求是否是無法驗證的，默認是 False，意思就是說用戶沒有足夠權限來選擇接收這個請求的結果。例如，我們請求一個 HTML 文檔中的圖片，但是我們沒有自動抓取圖像的權限，這時 unverifiable 的值就是 True。
method：是一個字符串，用來指示請求使用的方法，比如 GET、POST 和 PUT 等。

簡單舉例：

import urllib.request import urllib.parseurl = 'http://www.baidu.com/'# 定制要偽裝的頭部 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36' } # 構建請求對象 request = urllib.request.Request(url=url, headers=headers) # 發送請求 response = urllib.request.urlopen(request) print(response.read().decode())

【1.2.3】 urllib.request.urlretrieve()

將獲取到的 URL 內容保存到當前文件夾，簡單舉例：

import urllib.requesturl = 'https://www.itrhx.com/images/trhx.png'# response = urllib.request.urlopen(image_url) # with open('trhx.png', 'wb') as fp: # fp.write(response.read())urllib.request.urlretrieve(url, 'trhx.png')

【1.3】 urllib.error 異常處理

【1.3.1】 URLError

如果打開一個不存在的頁面，就會出現 URLError 錯誤，該錯誤有一個 reason 屬性，用于返回錯誤的原因。簡單舉例：

from urllib import request, error try: response = request.urlopen('https://www.itrhx.com/index/') except error.URLError as e: print(e.reason)

輸出結果：

Not Found

【1.3.2】 HTTPError

URLError 的子類，專門用來處理 HTTP 請求錯誤，比如認證請求失敗等。它有如下3個屬性：

code：返回 HTTP 狀態碼，比如 404 表示網頁不存在，500 表示服務器內部錯誤等。
reason：同父類一樣，用于返回錯誤的原因。
headers：返回請求頭。

簡單舉例：

from urllib import request, error try: response = request.urlopen('https://www.itrhx.com/index/') except error.HTTPError as e: print(e.code, e.reason, e.headers)

輸出結果：

404 Not Found Content-Type: text/html; charset=utf-8 Server: GitHub.com ETag: "5d57f030-7f2" Access-Control-Allow-Origin: * X-Proxy-Cache: MISS X-GitHub-Request-Id: 4B46:2F5D:6DE0F1:755BB2:5D5964C5 Content-Length: 2034 Accept-Ranges: bytes Date: Sun, 18 Aug 2019 14:50:41 GMT Via: 1.1 varnish Age: 252 Connection: close X-Served-By: cache-tyo19951-TYO X-Cache: HIT X-Cache-Hits: 1 X-Timer: S1566139842.563134,VS0,VE0 Vary: Accept-Encoding X-Fastly-Request-ID: e9eb0a507be66a866bfaa7c5cc2e1c53b1f7ccab

【1.3.3】進階用法

因為 URLError 是 HTTPError 的父類，所以可以先選擇捕獲子類的錯誤，再去捕獲父類的錯誤，前面的代碼改進：

from urllib import request, error ? try: response = request.urlopen('https://www.itrhx.com/index/') except error.HTTPError as e: print(e.reason, e.code, e.headers) except error.URLError as e: print(e.reason) else: print('Request Successfully')

【1.4】 urllib.parse 解析 URL

【1.4.1】 urllib.parse.urlencode()

將字典參數序列化為 GET 請求參數，示例：

from urllib.parse import urlencode data = {'ie': 'utf-8','wd': 'TRHX', } base_url = 'http://www.baidu.com?' url = base_url + urlencode(data) print(url)

輸出結果：

http://www.baidu.com?ie=utf-8&wd=TRHX

【1.4.2】 urllib.parse.parse_qs()

與 urlencode() 相反，將 GET 請求參數反序列化轉回字典，示例：

from urllib.parse import parse_qs query = 'name=TRHX&age=20' print(parse_qs(query))

輸出結果：

{'name': ['TRHX'], 'age': ['20']}

【1.4.3】 urllib.parse.parse_qsl()

將參數轉化為元組組成的列表，示例：

from urllib.parse import parse_qsl query = 'name=TRHX&age=20' print(parse_qsl(query))

輸出結果：

[('name', 'TRHX'), ('age', '20')]

【1.4.4】 urllib.parse.urlparse()

對 URL 進行分段，返回 6 個結果，示例：

from urllib.parse import urlparse result = urlparse('http://www.baidu.com/index.html;user?id=5#comment') print(type(result), result)

輸出結果：

<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

返回結果為 ParseResult 類型的對象，含 scheme、netloc、path、params、query 和 fragment 6 個部分，依次代表協議、域名、路徑、參數、查詢條件、錨點

【1.4.5】 urllib.parse.urlunparse()

與 urlparse() 相反，對 URL 進行組合，傳入的參數是一個可迭代對象，長度必須是 6，否則會拋出參數數量不足或者過多的問題，示例：

from urllib.parse import urlunparse data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment'] print(urlunparse(data))

輸出結果：

http://www.baidu.com/index.html;user?a=6#comment

【1.4.6】 urllib.parse.urlsplit()

與 urlparse() 方法相似，但是它不再單獨解析 params 部分，只返回 5 個結果。params 會合并到 path 中，示例：

from urllib.parse import urlsplit result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment') print(result)

輸出結果：

SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')

【1.4.7】 urllib.parse.urlunsplit()

與 urlunparse() 方法類似，對 URL 進行組合，傳入的參數也是一個可迭代對象，長度必須為 5，示例：

from urllib.parse import urlunsplit data = ['http', 'www.baidu.com', 'index.html', 'a=6', 'comment'] print(urlunsplit(data))

輸出結果：

http://www.baidu.com/index.html?a=6#comment

【1.4.8】 urllib.parse.urljoin()

對 URL 進行組合，提供兩個 URL 作為兩個參數，將會自動分析 URL 的 scheme、netloc 和 path 這 3 個內容并對新鏈接缺失的部分進行補充，最后返回結果，示例：

from urllib.parse import urljoin print(urljoin('http://www.baidu.com', 'friends.html')) print(urljoin('http://www.baidu.com', 'https://www.itrhx.com/friends.html')) print(urljoin('http://www.baidu.com/friends.html', 'https://www.itrhx.com/friends.html')) print(urljoin('http://www.baidu.com/friends.html', 'https://www.itrhx.com/friends.html?id=2')) print(urljoin('http://www.baidu.com?wd=trhx', 'https://www.itrhx.com/index.html')) print(urljoin('http://www.baidu.com', '?category=2#comment')) print(urljoin('www.baidu.com', '?category=2#comment')) print(urljoin('www.baidu.com#comment', '?category=2'))

輸出結果：

http://www.baidu.com/friends.html https://www.itrhx.com/friends.html https://www.itrhx.com/friends.html https://www.itrhx.com/friends.html?id=2 https://www.itrhx.com/index.html http://www.baidu.com?category=2#comment www.baidu.com?category=2#comment www.baidu.com?category=2

【1.4.9】 urllib.parse.quote()

將內容轉化為 URL 編碼的格式。當 URL 中帶有中文參數時，可以將中文字符轉化為 URL 編碼，示例：

from urllib.parse import quote keyword = '中國' url = 'https://www.baidu.com/s?wd=' + quote(keyword) print(url)

輸出結果：

https://www.baidu.com/s?wd=%E4%B8%AD%E5%9B%BD

【1.4.10】 urllib.parse.unquote()

與 quote() 方法相反，對 URL 進行解碼，示例：

from urllib.parse import unquote url = 'https://www.baidu.com/s?wd=%E4%B8%AD%E5%9B%BD' print(unquote(url))

輸出結果：

https://www.baidu.com/s?wd=中國

【1.5】 urllib.robotparser 爬取權限判斷

【1.5.1】 Robots 協議簡介

Robots 協議即爬蟲協議，用來告訴爬蟲和搜索引擎哪些頁面可以抓取，哪些不可以抓取。它通常是一個叫作 robots.txt 的文本文件，一般放在網站的根目錄下。

robots.txt 基本格式：

User-agent: Disallow: Allow:

User-agent 為搜索爬蟲的名稱，設置為 * 則表示對任何爬蟲皆有效；
Disallow 指定了不允許抓取的目錄，設置為 / 則代表不允許抓取所有頁面；
Allow 指定了允許抓取的目錄，一般和 Disallow 一起使用，一般不會單獨使用，用來排除某些限制。

一些常見的搜索爬蟲名稱及其對應的網站：

爬蟲名稱網站名稱網站地址

BaiduSpider	百度	www.baidu.com
Googlebot	谷歌	www.google.com
360Spider	360	www.so.com
Sogouspider	搜狗	www.sogou.com
YodaoBot	有道	www.youdao.com
Bingbot	必應	www.bing.com
Yahoo! Slurp	雅虎	www.yahoo.com
ia_archiver	Alexa	www.alexa.cn
Scooter	altavista	www.altavista.com

【1.5.2】 RobotFileParser 類常用方法

RobotFileParser 類的聲明：

urllib.robotparser.RobotFileParser(url='')

常用方法及其解釋：

set_url：用來設置 robots.txt 文件的鏈接。如果在創建 RobotFileParser
對象時傳入了鏈接，那么就不需要再用這種方法了。
read：讀取 robots.txt 文件并進行分析。此方法執行一個讀取和分析操作，若不調用此方法，接下來的判斷都會為 False，這個方法不會返回任何內容，但是執行了讀取操作。
parse：解析 robots.txt 文件，傳入的參數是 robots.txt 某些行的內容，它會按照 robots.txt 的語法規則來分析這些內容。
can_fetch：該方法傳入兩個參數，第一個是 User-agent，第二個是要抓取的 URL。返回的內容是該搜索引擎是否可以抓取這個 URL，返回結果是 True 或 False。
mtime：返回的是上次抓取和分析 robots.txt 的時間，此方法可以定期檢查來抓取最新的 robots.txt。
modified：將當前時間設置為上次抓取和分析 robots.txt 的時間。

以簡書為例：

from urllib.robotparser import RobotFileParser rp = RobotFileParser() rp.set_url('http://www.jianshu.com/robots.txt') rp.read() print(rp.can_fetch('*', 'https://www.jianshu.com/p/6d9527300b4c')) print(rp.can_fetch('*', "http://www.jianshu.com/search?q=python&page=1&type=collections"))

輸出結果：

False False

總結

以上是生活随笔為你收集整理的Python3 爬虫学习笔记 C01 【基本库 urllib 的使用】的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：债券到底是什么？债券靠什么赚钱？
下一篇： Python3 爬虫实战 — 豆瓣电影T

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

Python3 爬虫学习笔记 C01 【基本库 urllib 的使用】

文章目錄

【1.1】 urllib 簡介

【1.2】 urllib.request 發送請求

【1.2.1】 urllib.request.urlopen()

【1.2.1.1】 基本使用方法

【1.2.1.2】 添加參數

● data 參數

● timeout 參數

【1.2.2】 urllib.request.Request()

【1.2.3】 urllib.request.urlretrieve()

【1.3】 urllib.error 異常處理

【1.3.1】 URLError

【1.3.2】 HTTPError

【1.3.3】 進階用法

【1.4】 urllib.parse 解析 URL

【1.4.1】 urllib.parse.urlencode()

【1.4.2】 urllib.parse.parse_qs()

【1.4.3】 urllib.parse.parse_qsl()

【1.4.4】 urllib.parse.urlparse()

【1.4.5】 urllib.parse.urlunparse()

【1.4.6】 urllib.parse.urlsplit()

【1.4.7】 urllib.parse.urlunsplit()

【1.4.8】 urllib.parse.urljoin()

【1.4.9】 urllib.parse.quote()

【1.4.10】 urllib.parse.unquote()

【1.5】 urllib.robotparser 爬取權限判斷

【1.5.1】 Robots 協議簡介

【1.5.2】 RobotFileParser 類常用方法

總結

【1.2.1.1】基本使用方法

【1.2.1.2】添加參數

【1.3.3】進階用法