當前位置：首頁 > 编程语言 > python >内容正文

python

自学Python第十四天- 一些有用的模块：urllib、requests 网络编程基础，向爬虫靠拢

發布時間：2024/3/12 python 60 豆豆

生活随笔收集整理的這篇文章主要介紹了自学Python第十四天- 一些有用的模块：urllib、requests 网络编程基础，向爬虫靠拢小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

自學Python第十四天- 一些有用的模塊：urllib、requests 網絡編程基礎，向爬蟲靠攏

fake_useragent 庫
- 安裝 fake_useragent
- 使用
urllib 庫
- urllib.request
- - request 的簡單使用
  - - urllib.request.urlopen 函數的參數
  - 獲取 request 的狀態碼
  - 使用 urllib.request 設置請求頭
  - 發送帶數據的請求
  - https 請求
  - 使用代理服務器
  - 下載數據文件
- urllib.error
- urllib.parse
- urllib.robotparser
requests 庫
- 簡單發送請求
- 關閉連接
- 處理編碼
- 設置請求頭
- - 查看請求頭
- 發送 post 請求
- 發送附加參數的 get 請求
- 通用請求
- 使用cookie
- 處理反爬
- https 請求
- 使用代理
- 響應對象的常用屬性和方法
- 下載二進制文件

python 很大的一個優點就是爬蟲編寫方便，當然在寫爬蟲之前還需要網絡編程的基礎，畢竟爬蟲是基于網絡的程序。

fake_useragent 庫

fake_useragent 是一個第三方庫，作用僅僅是更換請求頭中的 UserAgent ，偽裝瀏覽器。因為用的比較方便，發送的請求都會用到，所以寫在最前。

安裝 fake_useragent

使用 pip install fake-useragent 進行安裝，使用 from fake_useragent import UserAgent 引入使用。

使用

使用 UserAgent 對象的各種方法能生成 useragent 信息，例如:

from fake_useragent import UserAgentuseragent_chrome = UserAgent().chrome # 模擬 chrome 瀏覽器生成 useragent 信息 useragent_random = UserAgent().random # 隨機模擬瀏覽器生成 useragent 信息

使用需要注意的是，盡量及時更新 fake_useragent 庫

urllib 庫

python 進行網絡編程，可以使用 python 的內置庫: urllib 庫。urllib 庫分為以下幾個模塊：

urllib.request :打開和讀取 URL
urllib.error :包含 urllib.request 拋出的異常
urllib.parse :解析 URL
urllib.rebotparser :解析 robots.txt 文件

urllib.request

urllib.request 負責打開和讀取 URL，是 urllib 庫最重要也最常用的模塊。其定義了一些打開 URL 的函數和類，包含授權驗證、重定向、瀏覽器 cookkies 等。它可以模擬瀏覽器的一個請求發起過程。

request 的簡單使用

request 最簡單的用法就是模擬瀏覽器發送一個請求，并獲取應答。

from urllib.request import urlopenurl = 'https://www.baidu.com' # 設置請求地址 resp = urlopen(url) # 發送請求，并獲得應答 print(resp.read().decode('utf-8') # 查看應答信息，因為默認是以二進制查看，所以需要解碼為 utf-8

這樣就獲取到了應答信息。輸出的是 html 文本，也可以寫到文件中使用瀏覽器打開。

with open('mybaidu.html', 'w', encoding='utf-8') as file:file.write(resp.read().decode('utf-8'))

urllib.request.urlopen 函數的參數

urlopen 函數的全部參數如下：
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url :url地址，此參數也是唯一一個必須參數
data :發送到服務器的其他數據對象，默認 None
timeout :訪問超時時間
cafile 和 capath :CA證書的文件名和路徑，使用 HTTPS 需要用到
context :ssl.SSLContext 類型，用來指定 SSL 設置

獲取 request 的狀態碼

發送請求后無論如何會收到一個狀態碼，這個狀態碼會表示該請求的響應狀態。例如 200 成功、404 頁面未找到等。

import urllib.requestmyURL1 = urllib.request.urlopen("https://www.runoob.com/") print(myURL1.getcode()) # 200try:myURL2 = urllib.request.urlopen("https://www.runoob.com/no.html") except urllib.error.HTTPError as e:if e.code == 404:print(404) # 404

具體的狀態碼和含義可以看這里：

HTTP請求狀態碼

使用 urllib.request 設置請求頭

我們抓取網頁一般需要對 headers（網頁頭信息）進行模擬，這時候需要使用到 urllib.request.Request 類：
class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

url :url地址
data :發送請求的其他數據對象
headers :請求頭信息，字典格式
origin_req_host :請求的主機地址，IP 或域名
unverifiable :很少用整個參數，用于設置網頁是否需要驗證
method :請求方法，如 GET、POST、DELETE、PUT等

import urllib.requesturl = 'https://www.runoob.com/?s=' # 菜鳥教程搜索頁面 keyword = 'Python 教程' key_code = urllib.request.quote(keyword) # 對請求進行編碼 url_all = url + key_code header = {'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' } # 頭部信息 request = urllib.request.Request(url_all, headers=header) response = urllib.request.urlopen(request)print(response.read().decode('utf-8'))

上面的例子是將請求頭以實參形式傳入，也可以使用 Request.add_header() 方法添加請求頭信息。

import urllib.requesturl = 'http://www.baidu.com/' req = urllib.request.Request(url) req.add_header("User-Agent",'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36') html = urllib.request.urlopen(req).read()

發送帶數據的請求

因為 Request 類的數據信息必須是字節型的，所以需要使用 urllib.parse 進行編碼

import urllib.request import urllib.parseurl = 'https://www.runoob.com/try/py3/py3_urllib_test.php' # 提交到表單頁面 data = {'name':'RUNOOB', 'tag' : '菜鳥教程'} # 提交數據 header = {'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' } #頭部信息 data = urllib.parse.urlencode(data).encode('utf8') # 對參數進行編碼，解碼使用 urllib.parse.urldecode request=urllib.request.Request(url, data, header) # 請求處理 response=urllib.request.urlopen(request).read() # 讀取結果 print(response.decode('utf-8'))

https 請求

import ssl# 忽略安全認證 context = ssl._create_unverified_context() # 添加到 context 參數中 resp = urllib.request.urlopen(request,context = context)

使用代理服務器

使用代理服務器分為5步：設置代理、創建代理、應用代理、創建 Request 對象、發送訪問請求：

import urllib.request# 設置代理 httpproxy_handler = urllib.request.ProxyHandler({'https': 'daili.com:8888'}) # 創建代理 opener = urllib.request.build_opener(httpproxy_handler) # 將代理應用到全局 urllib.request.install_opener(opener) # 創建 request 對象 request = urllib.request.Request(url) # 發送請求 html = urllib.request.urlopen(request).read()

下載數據文件

可以使用 urllib.request.urlretrieve() 方法下載一些數據文件，例如圖片等。

urllib.request.urlretrieve(url,filename)

urllib.error

urllib.error 模塊為 urllib.request 所引發的異常定義了異常類，基礎異常類是 URLError。urllib.error 包含了兩個方法，URLError 和 HTTPError。

URLError 是 OSError 的一個子類，用于處理程序在遇到問題時會引發此異常（或其派生的異常），包含的屬性 reason 為引發異常的原因。

HTTPError 是 URLError 的一個子類，用于處理特殊 HTTP 錯誤，例如作為認證請求的時候，包含的屬性 code 為 HTTP 的狀態碼， reason 為引發異常的原因，headers 為導致 HTTPError 的特定 HTTP 請求的 HTTP 響應頭。

例如對不存在的頁面抓取并處理異常：

import urllib.request import urllib.errormyURL1 = urllib.request.urlopen("https://www.runoob.com/") print(myURL1.getcode()) # 200try:myURL2 = urllib.request.urlopen("https://www.runoob.com/no.html") except urllib.error.HTTPError as e:if e.code == 404:print(404) # 404 import urllib.request import urllib.errordef download_with_retry(url,num_retries):print('下載中 ',url)try:# 下載網頁并獲取網頁的 html 內容html = urllib.request.urlopen(url).read()except urllib.erroe.URLError as e:if hasattr(e, 'reason'):print('下載失敗')print('失敗原因：', e.reason)if hasattr(e, 'code'):print('服務器不能完成請求')print('錯誤代碼：', e.code)if num_retries>0 and 500<=e.code<600:return download_with_retries(url,num_retries-1)return html

urllib.parse

urllib.parse 用于解析 URL，格式如下：
urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

urlstring 為字符串的 url 地址
scheme 為協議類型，
allow_fragments 參數為 false，則無法識別片段標識符。相反，它們被解析為路徑，參數或查詢組件的一部分，并 fragment 在返回值中設置為空字符串。

from urllib.parse import urlparseo = urlparse("https://www.runoob.com/?s=python+%E6%95%99%E7%A8%8B") print(o) # 輸出結果 # ParseResult(scheme='https', netloc='www.runoob.com', path='/', params='', query='s=python+%E6%95%99%E7%A8%8B', fragment='')

也可以使用 urllib.parse.urlencode() 方法將數據（中文）進行編碼，例如使用 urllib.request.Request 對象發送帶數據的請求時的例子。

urllib.robotparser

urllib.robotparser 用于解析 robots.txt 文件。robots.txt（統一小寫）是一種存放于網站根目錄下的 robots 協議，它通常用于告訴搜索引擎對網站的抓取規則。

requests 庫

requests 庫也是 python 用于網絡編程的一個庫，由于是第三發庫，所以需要進行安裝。

requests 庫的參考文檔

簡單發送請求

import requestsurl = 'https://www.sogou.com/web?query=周杰倫' resp = requests.get(url) print(resp) # 執行結果： <Response [200]> print(resp.text) # 執行結果為頁面內容

關閉連接

請求頭中有一種設置：Connection: keep-alive ，如果不顯式的斷開連接，則會長時間占用TCP連接。通常可以發送一個請求報文，設置 Connection: close 。但是實際上其實可以直接對連接對象進行關閉。

resp.close() # 關閉 resp 連接

處理編碼

一般在響應頭或者是請求到的 html 文檔中，會有文檔編碼 charset=utf-8 的注釋，表示此文檔使用哪種編碼。此時如果直接查看是亂碼的，可以設置編碼后再進行查看

resp.encoding = 'utf-8' # 使用 utf-8 編碼 resp.encoding = resp.apparent_encoding # 使用自適應編碼，在不知道具體編碼時候可以使用此項，正確率很高

設置請求頭

上一個例子中，訪問 https://www.sogou.com/web?query=周杰倫 時可以發現，瀏覽器可以正常訪問，而 python 程序則不行。所以需要設置請求頭，模擬普通瀏覽器發送請求來訪問頁面。

requests 庫的請求頭寫在一個字典內，并且在發送請求時需要傳遞至方法實參。

import requestsurl = 'https://www.sogou.com/web?query=周杰倫' header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36" } resp = requests.get(url, headers=header)print(resp.text)

這里請求頭里的 User-Agent 信息可以使用瀏覽器調試工具獲取。

查看請求頭

在爬蟲測試中，如果被反爬了，則很大幾率是出在請求頭的 User-Agent 里。我們可以先查看下請求頭信息。

import requestsurl = 'https://www.sogou.com/web?query=周杰倫' resp = requests.get(url) print(resp.request.headers) # 輸出請求頭信息

發送 post 請求

發送 post 請求使用 requests.post() 方法。發送的數據是以字典形式傳入實參。

import requestsurl = 'https://fanyi.baidu.com/sug' header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36" } d = {'kw': 'dog'} resp = requests.post(url, headers=header, data=d) print(resp.json()) # 以 json 字符串形式輸出

發送附加參數的 get 請求

get 請求的參數除了可以直接寫在 url 里，還可以使用附加方式發送：

import requestsurl = 'https://movie.douban.com/j/chart/top_list' header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36" } param = {'type': '24','interval_id': '100:90','action': '','start': 0,'limit': 20 } resp = requests.get(url, headers=header, params=param)print(resp.request.url) print(resp.json())

通用請求

可以使用 requests.request() 方法來發送任意類型的請求。

resp = requests.request(method, url, **kwargs)

其中 method 和 url 為必填項，**kwargs 是可選，可以加入參數、數據、json、請求頭、cookies、文件、代理信息 等。

使用cookie

有些網站的頁面請求需要 cookie 信息，一般 cookie 由瀏覽器存儲在本地，使用 session 獲取。使用 requests 進行請求時每次均是一個新的請求，所以先建立 session 進行登錄，然后登錄信息包括 cookie 就保存在了 session 中。

import requests# 建立會話 session = requests.session() # 登錄請求地址 url = 'https://passport.17k.com/ck/user/login' # 請求登錄會話(不需要返回值是因為登錄的信息在cookie中，cookie信息保存在session里) session.post(url, data={'loginName': 'xxxxxxxxx', 'password': 'xxxxxx'}) # 獲取書架上的數據(使用session進行請求，則可以使用保存的cookie信息) resp = session.get('https://user.17k.com/ck/author/shelf?page=1&appKey=xxxxxxxxx') print(resp.json())

也可以使用在頭部信息中添加 cookie 的方式直接進行請求

import requests# 也可以直接請求，在頭部信息中添加 cookie 信息 resp = session.get('https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919', headers={"Cookie": "GUID=xxxxxxxxxxxxxxxxxxxxxxxxxxx; sajssdk_2015_cross_new_user=1; c_channel=0; c_csc=web; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%25xxxxxxxxxxxxxxxxxx33034.jpg-88x8xxxxxxxxxxxxxxxxxid%3D97233034%26nickname%3D%25Exxxxxxxxxxxx25E5%25A4%25A7%25E4%25BA%25A8%26e%3D1672538270%26s%3Dd0bfe4952e8fe4e2; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2297233034%2xxxxxxxxxxxxxxxxxxx181cbf9d99daa9-009b458c304258-57b1a33-2073600-181cbf9d99e67c%22%2C%22prxxxxxxxxxxxxxxxe%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9xxxxxxxxxxxxxxerrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%xxxxxxxxxxxxxxxxxx9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%xxxxxxxxxxxxxxx%22%7D%2C%22first_id%22%xxxxxxxxa388384-bf3dxxxxxxxxxxxxxxxxxx56%22%7D" }) print(resp.json())

處理反爬

通常對于反爬，經常使用的手段就是驗證 User-Agent ，驗證 cookie ，防盜鏈，和 url 加密。對于驗證可以在請求頭部信息中增加響應信息，防盜鏈可以在請求頭增加 Referer 信息，url 加密就需要進行分析了。

這里以下載梨視頻的視頻文件舉例。通過分析頁面和源代碼可以發現視頻 url 是經過了一個替換變形，那么獲取到響應信息再替換回來就能得到正確的視頻信息。

# 1. 拿到contId # 2. 拿到videoStatus返回的json -> srcURL # 3. srcURL里面的內容進行修整 # 4. 下載視頻import requests# 視頻頁面地址 url = 'https://pearvideo.com/video_1759848' contId = url.split('_')[1] headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36","Referer": url # 防盜鏈：進行溯源，即請求的上一級頁面 } # 視頻信息請求地址 videoStatusUrl = f'https://pearvideo.com/videoStatus.jsp?contId={contId}&mrd=0.5621134865680251' # 獲取相應json resp = requests.get(videoStatusUrl, headers=headers) # 從json中獲取正確的視頻地址 dic = resp.json() srcUrl = dic['videoInfo']['videos']['srcUrl'] systemTime = dic['systemTime'] srcUrl = srcUrl.replace(systemTime, f'cont-{contId}') # 下載視頻 filename = './download/' + srcUrl.split('/')[-1] with open(filename, mode='wb') as f:f.write(requests.get(srcUrl).content)

https 請求

resp = requests.get(url,verify=False) # 忽略安全認證

使用代理

requests 使用代理和 urllib 使用代理類似，將代理字典（包含協議、地址+端口）以實參形式傳入請求的形參 proxies 即可。

proxies = {"http": "39.130.150.44:80" # 舉個例子 } resp = requests.get(url, headers=head, proxies=proxy)

響應對象的常用屬性和方法

不管使用 get 還是 post 發送了請求后，會返回一個 Response 對象，可以通過此對象的一些屬性和方法獲得我們想要的數據：
r = requests.get('https://www.baidu.com')

r.status_code :返回狀態碼
r.reason :返回請求失敗原因(請求成功返回 OK)
r.cookies :返回請求 cookies
r.headers :返回響應頭
r.encoding :返回或設置響應內容的編碼
r.content :返回響應內容(二進制內容)
r.text :返回響應內容的文本(相當于 r.content.decode(‘utf-8’))
r.json() :將響應內容 json 化并返回
r.request :返回發送的請求信息

下載二進制文件

可以使用 r.content 直接獲取二進制文件內容

img_resp = requests.get(srcUrl) # 請求圖片 with open('img.jpg', mode='wb') as f:f.write(img_resp.content) # 寫入二進制數據

也可以使用二進制流的方式獲取原始套接字響應。使用分片能夠處理大文件，可以邊獲得二進制響應邊寫入文件。當使用流下載時，優先推薦此種方法。

img_resp = requests.get(srcUrl, stream=True) with open('img.jpg', mode='wb') as f:for chunk in img_resp.iter_content(chunk_size): # chunk_size 是分片讀取的每一片的大小，單位字節f.write(chunk)

總結

以上是生活随笔為你收集整理的自学Python第十四天- 一些有用的模块：urllib、requests 网络编程基础，向爬虫靠拢的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： More effective C++学习
下一篇： Python day13--网络编程

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

自学Python第十四天- 一些有用的模块：urllib、requests 网络编程基础，向爬虫靠拢

自學Python第十四天- 一些有用的模塊：urllib、requests 網絡編程基礎，向爬蟲靠攏

fake_useragent 庫

安裝 fake_useragent

使用

urllib 庫

urllib.request

request 的簡單使用

urllib.request.urlopen 函數的參數

獲取 request 的狀態碼

使用 urllib.request 設置請求頭

發送帶數據的請求

https 請求

使用代理服務器

下載數據文件

urllib.error

urllib.parse

urllib.robotparser

requests 庫

簡單發送請求

關閉連接

處理編碼

設置請求頭

查看請求頭

發送 post 請求

發送附加參數的 get 請求

通用請求

使用cookie

處理反爬

https 請求

使用代理

響應對象的常用屬性和方法

下載二進制文件

總結