當前位置：首頁 > 编程语言 > python >内容正文

python

Python3中urllib详细使用方法(header,代理,超时,认证,异常处理)

發布時間：2023/12/9 python 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python3中urllib详细使用方法(header,代理,超时,认证,异常处理) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

????????????????????? Python 3 中的 urllib 庫的用法。urllib 是 Python 標準庫中用于網絡請求的庫。該庫有四個模塊，分別是urllib.request，urllib.error，urllib.parse，urllib.robotparser。其中urllib.request，urllib.error兩個庫在爬蟲程序中應用比較頻繁。

urllib是python的一個獲取url(Uniform Resource Locators,統一資源定址器)了，我們可以利用它來抓取遠程的數據進行保存哦，下面整理了一些關于urllib使用中的一些關于header,代理,超時,認證,異常處理處理方法，下面一起來看看。

python3 抓取網頁資源的 N 種方法

1.1、最簡單直接用urllib.request模塊的urlopen（）獲取頁面，page的數據格式為bytes類型，需要decode（）解碼，轉換成str類型。

urlopen返回對象提供方法：

-?????????read() , readline() ,readlines() , fileno() , close() ：對HTTPResponse類型數據進行操作

-?????????info()：返回HTTPMessage對象，表示遠程服務器返回的頭信息

-?????????getcode()：返回Http狀態碼。如果是http請求，200請求成功完成;404網址未找到

-?????????geturl()：返回請求的url

1import urllib.request 2 3 response = urllib.request.urlopen('http://python.org/') 4 5 html = response.read()#獲取網頁源代碼 6print(html.decode('utf-8')) # 轉化為 utf-8 編碼

1.2 設置請求超時

有些請求可能因為網絡原因無法得到響應。因此，我們可以手動設置超時時間。當請求超時，我們可以采取進一步措施，例如選擇直接丟棄該請求或者再請求一次。

import urllib.requesturl = "http://tieba.baidu.com" response = urllib.request.urlopen(url, timeout=1) print(response.read().decode('utf-8'))

又例：

#! /usr/bin/env python3import socketimport urllib.request# timeout in secondstimeout = 2socket.setdefaulttimeout(timeout)# this call to urllib.request.urlopen now uses the default timeout# we have set in the socket modulereq = urllib.request.Request('http://www.python.org/')a = urllib.request.urlopen(req).read()print(a)

2、使用 Request

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

url 參數是請求鏈接，這個是必傳參數，其他的都是可選參數。
data 參數跟 urlopen() 中的 data 參數用法相同。
headers 參數是指定發起的 HTTP 請求的頭部信息。headers 是一個字典。它除了在 Request 中添加，還可以通過調用 Reques t實例的 add_header() 方法來添加請求頭。
origin_req_host 參數指的是請求方的 host 名稱或者 IP 地址。
unverifiable 參數表示這個請求是否是無法驗證的，默認值是False。意思就是說用戶沒有足夠權限來選擇接收這個請求的結果。例如我們請求一個HTML文檔中的圖片，但是我們沒有自動抓取圖像的權限，我們就要將 unverifiable 的值設置成 True。
method 參數指的是發起的 HTTP 請求的方式，有 GET、POST、DELETE、PUT等

2.1簡單的使用Request

使用request（）來包裝請求，再通過urlopen（）獲取頁面。

使用 Request 偽裝成瀏覽器發起 HTTP 請求。如果不設置 headers 中的 User-Agent，默認的User-Agent是Python-urllib/3.5。可能一些網站會將該請求攔截，所以需要偽裝成瀏覽器發起請求。我使用的 User-Agent 是 Chrome 瀏覽器。

import urllib.requesturl = "http://tieba.baidu.com/" headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36' } request = urllib.request.Request(url=url, headers=headers) response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))

2.2、發送數據

#! /usr/bin/env python3import urllib.parseimport urllib.requesturl = 'http://localhost/login.php'user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'values = { 'act' : 'login', 'login[email]' : 'abc@abc.com', 'login[password]' : '123456' }data = urllib.parse.urlencode(values)req = urllib.request.Request(url, data)req.add_header('Referer', 'http://www.python.org/')response = urllib.request.urlopen(req)the_page = response.read()print(the_page.decode("utf8"))

又例如：

from urllib import request, parse2 url = r'http://www.lagou.com/jobs/positionAjax.json?'3 headers = {4 'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '5 r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',6 'Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label',7 'Connection': 'keep-alive'8 }9 data = { 10 'first': 'true', 11 'pn': 1, 12 'kd': 'Python' 13 } 14 data = parse.urlencode(data).encode('utf-8') 15 req = request.Request(url, headers=headers, data=data) 16 page = request.urlopen(req).read() 17 page = page.decode('utf-8')

?data需要被轉碼成字節流。而 data 是一個字典。我們需要使用 urllib.parse.urlencode() 將字典轉化為字符串。再使用 bytes() 轉為字節流。最后使用 urlopen() 發起請求，請求是模擬用 POST 方式提交表單數據。?

2.3、發送數據和header

#! /usr/bin/env python3import urllib.parseimport urllib.requesturl = 'http://localhost/login.php'user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'values = { 'act' : 'login', 'login[email]' : 'abc@abc.com', 'login[password]' : '123456' }headers = { 'User-Agent' : user_agent }data = urllib.parse.urlencode(values)req = urllib.request.Request(url, data, headers)response = urllib.request.urlopen(req)the_page = response.read()print(the_page.decode("utf8"))

3、Request 高級用法

如果我們需要在請求中添加代理、處理請求的 Cookies，我們需要用到Handler和OpenerDirector。

1） Handler
Handler 的中文意思是處理者、處理器。 Handler 能處理請求（HTTP、HTTPS、FTP等）中的各種事情。它的具體實現是這個類 urllib.request.BaseHandler。它是所有的 Handler 的基類，其提供了最基本的Handler的方法，例如default_open()、protocol_request()等。
繼承 BaseHandler 有很多個，我就列舉幾個比較常見的類：

ProxyHandler：為請求設置代理
HTTPCookieProcessor：處理 HTTP 請求中的 Cookies
HTTPDefaultErrorHandler：處理 HTTP 響應錯誤。
HTTPRedirectHandler：處理 HTTP 重定向。
HTTPPasswordMgr：用于管理密碼，它維護了用戶名密碼的表。
HTTPBasicAuthHandler：用于登錄認證，一般和 HTTPPasswordMgr 結合使用。

2） OpenerDirector
對于 OpenerDirector，我們可以稱之為 Opener。我們之前用過 urlopen() 這個方法，實際上它就是 urllib 為我們提供的一個Opener。那 Opener 和 Handler 又有什么關系？opener 對象是由 build_opener(handler) 方法來創建出來。我們需要創建自定義的 opener，就需要使用 install_opener(opener)方法。值得注意的是，install_opener 實例化會得到一個全局的 OpenerDirector 對象。

3.1、使用代理：

?????? 我們已經了解了 opener 和 handler，接下來我們就通過示例來深入學習。第一個例子是為 HTTP 請求設置代理
有些網站做了瀏覽頻率限制。如果我們請求該網站頻率過高。該網站會被封 IP，禁止我們的訪問。所以我們需要使用代理來突破這“枷鎖”。

import urllib.requesturl = "http://tieba.baidu.com/" headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36' }proxy_handler = urllib.request.ProxyHandler({'http': 'web-proxy.oa.com:8080','https': 'web-proxy.oa.com:8080' }) opener = urllib.request.build_opener(proxy_handler) urllib.request.install_opener(opener)request = urllib.request.Request(url=url, headers=headers) response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))

?又例如：

#! /usr/bin/env python3import urllib.requestproxy_support = urllib.request.ProxyHandler({'sock5': 'localhost:1080'})opener = urllib.request.build_opener(proxy_support)urllib.request.install_opener(opener)a = urllib.request.urlopen("http://www.python.org/").read().decode("utf8")print(a)

3.2、認證登錄

有些網站需要攜帶賬號和密碼進行登錄之后才能繼續瀏覽網頁。碰到這樣的網站，我們需要用到認證登錄。我們首先需要使用 HTTPPasswordMgrWithDefaultRealm() 實例化一個賬號密碼管理對象；然后使用 add_password() 函數添加賬號和密碼；接著使用 HTTPBasicAuthHandler() 得到 hander；再使用 build_opener() 獲取 opener 對象；最后使用 opener 的 open() 函數發起請求。

第二個例子是攜帶賬號和密碼請求登錄百度貼吧，代碼如下：

import urllib.requesturl = "http://tieba.baidu.com/" user = 'user' password = 'password' pwdmgr = urllib.request.HTTPPasswordMgrWithDefaultRealm() pwdmgr.add_password(None，url ，user ，password)auth_handler = urllib.request.HTTPBasicAuthHandler(pwdmgr) opener = urllib.request.build_opener(auth_handler) response = opener.open(url) print(response.read().decode('utf-8'))

HTTP 認證

#! /usr/bin/env python3import urllib.request# create a password managerpassword_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()# Add the username and password.# If we knew the realm, we could use it instead of None.top_level_url = "https://www.python.org/"password_mgr.add_password(None, top_level_url, 'rekfan', 'xxxxxx')handler = urllib.request.HTTPBasicAuthHandler(password_mgr)# create "opener" (OpenerDirector instance)opener = urllib.request.build_opener(handler)# use the opener to fetch a URLa_url = "https://www.python.org/"x = opener.open(a_url)print(x.read())# Install the opener.# Now all calls to urllib.request.urlopen use our opener.urllib.request.install_opener(opener)a = urllib.request.urlopen(a_url).read().decode('utf8')print(a)

3.3、Cookies設置

如果請求的頁面每次需要身份驗證，我們可以使用 Cookies 來自動登錄，免去重復登錄驗證的操作。獲取 Cookies 需要使用 http.cookiejar.CookieJar() 實例化一個 Cookies 對象。再用 urllib.request.HTTPCookieProcessor 構建出 handler 對象。最后使用 opener 的 open() 函數即可。

第三個例子是獲取請求百度貼吧的 Cookies 并保存到文件中，代碼如下

import http.cookiejar import urllib.requesturl = "http://tieba.baidu.com/" fileName = 'cookie.txt'cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open(url)f = open(fileName,'a') for item in cookie:f.write(item.name+" = "+item.value+'\n') f.close()

4、HTTPResponse

從上面的例子可知，使用 urllib.request.urlopen() 或者 opener.open(url) 返回結果是一個 http.client.HTTPResponse 對象。它具有 msg、version、status、reason、debuglevel、closed等屬性以及read()、readinto()、getheader(name)、getheaders()、fileno()等函數。

5、錯誤解析：

發起請求難免會出現各種異常，我們需要對異常進行處理，這樣會使得程序比較人性化。
異常處理主要用到兩個類，urllib.error.URLError和urllib.error.HTTPError。

URLError
URLError 是 urllib.error 異常類的基類, 可以捕獲由urllib.request 產生的異常。
它具有一個屬性reason，即返回錯誤的原因。
捕獲 URL 異常的示例代碼：

import urllib.request import urllib.errorurl = "http://www.google.com" try:response = request.urlopen(url) except error.URLError as e:print(e.reason)作者：猴哥Yuri 鏈接：https://www.jianshu.com/p/2e190438bd9c 來源：簡書簡書著作權歸作者所有，任何形式的轉載都請聯系作者獲得授權并注明出處。

HTTPError HTTPError 是 UEKRrror 的子類，專門處理 HTTP 和 HTTPS 請求的錯誤。它具有三個屬性。 1)code：HTTP 請求返回的狀態碼。 1)renson：與父類用法一樣，表示返回錯誤的原因。 1)headers`：HTTP 請求返回的響應頭信息。

獲取 HTTP 異常的示例代碼, 輸出了錯誤狀態碼、錯誤原因、服務器響應頭

import urllib.request import urllib.errorurl = "http://www.google.com" try:response = request.urlopen(url) except error.HTTPError as e:print('code: ' + e.code + '\n')print('reason: ' + e.reason + '\n')print('headers: ' + e.headers + '\n')

http 錯誤

#! /usr/bin/env python3import urllib.requestreq = urllib.request.Request('http://python.org/')try: urllib.request.urlopen(req)except urllib.error.HTTPError as e:print(e.code)print(e.read().decode("utf8"))

6、異常處理1

#! /usr/bin/env python3from urllib.request import Request, urlopenfrom urllib.error import URLError, HTTPErrorreq = Request('http://www.python.org/')try:response = urlopen(req)except HTTPError as e:print('The (www.python.org)server couldn't fulfill the request.')print('Error code: ', e.code)except URLError as e:print('We failed to reach a server.')print('Reason: ', e.reason)else:print("good!")print(response.read().decode("utf8"))

異常處理2

#! /usr/bin/env python3from urllib.request import Request, urlopenfrom urllib.error import URLErrorreq = Request("http://www.python.org/")try:response = urlopen(req)except URLError as e:if hasattr(e, 'reason'):print('We failed to reach a server.')print('Reason: ', e.reason)elif hasattr(e, 'code'):print('The server couldn't fulfill the request.')print('Error code: ', e.code)else: print("good!")print(response.read().decode("utf8"))

總結

以上是生活随笔為你收集整理的Python3中urllib详细使用方法(header,代理,超时,认证,异常处理)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Linux C 函数练习
下一篇： Python的multiprocessi