當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬虫基本库的使用---urllib库

發布時間：2025/5/22 编程问答 17 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫基本库的使用---urllib库小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

使用urllib---Python內置的HTTP請求模塊
- urllib包含模塊：request模塊、error模塊、parse模塊、robotparser模塊
  - 發送請求
    - 使用 urllib 的 request模塊，實現請求的發送并得到響應
    - urlopen()
      - 用urllib.request 里的urlopen()方法發送一個請求
      - 輸入： import urllib.request# 向指定的url發送請求，并返回服務器響應的類文件對象 response = urllib.request.urlopen('https://www.python.org') # 這里所指定的url是https://www.python.org# read()方法讀取文件全部內容 html = response.read()# decode()的作用是將其他編碼的字符串轉換成unicode編碼 print(html.decode('utf-8')) 部分輸出：
        涉及方法decode()---該方法返回解碼后的字符串。其中有編碼方法encode()
        備注：urllib.request 里的 urlopen()不支持構造HTTP請求，不能給編寫的請求添加head,無法模擬真實的瀏覽器發送請求。
        
        ?type()方法輸出響應的類型：
        import urllib.request# 向指定的url發送請求，并返回服務器響應的類文件對象 response = urllib.request.urlopen('https://www.python.org')print(type(response))
        
        # 輸出結果如下： <class 'http.client.HTTPResponse'>
        # 它是一個 HTTPResposne類型的對象，主要包含 read()、 readinto()、 getheader(name)、getheaders()、 fileno()等方法，以及 msg、 version、 status、 reason、 debuglevel、 ιlosed等屬性?
      - 實例（部分方法或屬性）： import urllib.requestresponse = urllib.request.urlopen('https://www.python.org') print(response.status) # status屬性：返回響應的狀態碼，如200代表請求成功 print(response.getheaders()) # getheaders()方法：返回響應的頭信息 print(response.getheader('Server')) # getheader('name')方法：獲取響應頭中的name值# 輸出： 200 [('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'DENY'), ('Via', '1.1 vegur'), ('Via', '1.1 varnish'), ('Content-Length', '49425'), ('Accept-Ranges', 'bytes'), ('Date', 'Fri, 14 Jun 2019 04:36:05 GMT'), ('Via', '1.1 varnish'), ('Age', '569'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2125-IAD, cache-hnd18748-HND'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '3, 736'), ('X-Timer', 'S1560486966.523393,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')] nginx
      - urllib.request.urlopen(url, data=None, [timeout,]*, cafile=None, capath=None, cadefault=False, context=None)
        
        重要參數：
        url：可以是請求的鏈接，也可以是請求(Request)的對象；
        data: 請求中附加送給服務器的數據(如：用戶名和密碼等);
        timeout：超時的時間，以秒為單位，超過多長時間即報錯;
        
        data參數
        使用參數data，需要使用bytes()方法將參數轉化為字節流編碼格式的內容，即bytes類型實例：
        import urllib.parse import urllib.requestdata = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf-8') response = urllib.request.urlopen('http://httpbin.org/post', data=data) print(response.read())
        
        # 請求的站點是httpbin.org，它可以提供HTTP測試請求。
        # 次例子中的URL是http://httpbin.org/post，這個鏈接可以用來測試POST請求，
        # 它可以輸出請求的一些信息，其中包含我們傳遞的data參數
        代碼使用的其他方法：
        
        urllib.parse模塊里的urlencode()方法將參數字典轉化為字符串
        bytes()?返回值為一個新的不可修改字節數組，每個數字元素都必須在0 - 255范圍內，和bytearray函數的具有相同的行為，差別僅僅是返回的字節數組不可修改 # bytes([source[, encoding[, errors]]]) # 第一個參數需要是str（字符串）類型 # 第二個參數指定編碼格式 # 如果沒有輸入任何參數，默認就是初始化數組為0個元素# 例如 byte = bytes('LiYihua', encoding='utf-8') print(byte)# 輸出： b'LiYihua'
        timeout參數
        timeout參數用于設置超時時間，單位為秒，即如果請求超出了設置的這個時間，?還沒有得到響應?，?就會拋出異常。
        例子1：
        該程序在運行時間0.1s過后，服務器沒有響應，于是拋出錯誤URL Error異常（錯誤原因是超時）
        
        例子2： 1 import socket 2 import urllib.request 3 import urllib.error 4 5 try: 6 response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1) # 設置超時時間0.1s 7 except urllib.error.URLError as e: 8 # 判斷異常是socket.timeout類型(意思就是超時異常)　　e.reason獲取的是錯誤的原因 9 if isinstance(e.reason, socket.timeout): 10 print('TIME OUT') 11 12 # 輸出： 13 TIME OUT 14 15 16 17 在python中： 18 e 一般是捕捉到的錯誤對象 19 e.code 是錯誤代碼 20 e.reason獲取的是錯誤的原因
        ?
        其他參數
        context參數，它必須是ssl.SSLContext類型，用來指定SSL設置
        cafile和capath兩個參數分別是指定CA證書和它的路徑，這個在請求HTTPS鏈接時會有用
      - Request
        
        urlopen()方法可以實現最基本請求的發起，Request更強大（比urlopen()方法）
        Request例子： 1 import urllib.request 2 3 request = urllib.request.Request('https://python.org') # 將請求獨立成一個對象 4 response = urllib.request.urlopen(request) # 同樣用urlopen()方法來發送請求 5 6 print(response.read().decode('utf-8')) 7 8 9 10 # 輸出： 11 <!doctype html> 12  13  14  15 <html class="no-js" lang="en" dir="ltr">  16 17 <head> 18 <meta charset="utf-8">.............. 19 ....................此處省略XXX字符 20 <![endif]--> 21 22  27 28 29 30 31 32 33 </body> 34 </html>
        ?
        
        class urllib.request.Request(url, data=None, headers={ }, origin_req_host=None, unverifiable=False, mothod=None)
        url參數: 請求URL
        data參數：Post 提交的數據, 默認為 None ，當 data 不為 None 時, urlopen() 提交方式為 Post
        headers參數：也就是請求頭，headers參數可以在構造請求時使用，也可以用add_header()方法來添加
        請求頭最常用的用法:修改User-Agent來偽裝瀏覽器（如偽裝Firefox：
        Mozilla/s.o?(X11;?U;?Linux?i686)?Gecko/20071127 Firefox/2.0.0.11
        ）
        origin_req_host參數：指的是請求方的host名稱或者IP地址
        
        unverifiable參數：
        
        表示這個請求是否是無法驗證?的，默認是?False，意思就是說用戶沒
        
        有足夠權限來選擇接收這個請求的結果。?例如，我們請求一個?HTML文檔中的圖片，但是我
        
        們沒有向動抓取圖像的權限，這時?unverifiable?的值就是?True。
        
        method參數:它是一個字符串，用來指示請求使用的方法（如：GET、POST、PUT等）
        例子： 1 from urllib import request, parse 2 3 url = 'https://python.org/post' # 要請求的URL 4 5 headers = { 6 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT', 7 'Host': 'httpbin.org' 8 } # 指定請求頭User-Agent,Host 9 10 dict = { 11 'name': 'Germey' 12 } # 要提交的數據 13 data = bytes(parse.urlencode(dict), encoding='utf-8') # 要提交的數據是dict類型，先用bytes()方法，將其轉為字符串 14 15 req = request.Request(url=url, data=data, headers=headers, method='POST') 16 # 這里使用Request()方法，用了四個參數 17 18 response = request.urlopen(req) # urlopen()發送請求 19 print(response.read().decode('utf-8')) # 用decode()方法,解碼所獲得的字符串，即讀取到的response,解碼格式為utf-8 20 21 22 # 輸出: 23 { 24 "args”:{}, 25 ”data”: "" 26 "files”{}, 27 ” form": { 28 ”name”:”Germey” 29 }, 30 ”headers”:{ 31 ”Accept-Encoding”.”identity”, 32 ”Content-Length " : ” 11”, "Content-Type”·”application/x-www-form- urlencoded”, ”Host”·”httpbin.org”, 33 ”User-Agent”:”問。zilla/4.0 (compatible;問SIE S.S; Windows NT)” 34 }, 35 "json": null, 36 ”origin”.”219.224.169.11”, 37 ” url ” : ” http://httpbin.org/post ” 38 }
        add_header()方法來添加headers
        req =request.Request(url=url, data=data, method='POST’) req .add_header('User-Agent', 'Mozilla/4 .0 (compatible; MSIE 5.5; Windows NT)')
      - 高級用法
        
        Request雖然可以構造請求，但是對于一些更高級的操作（比如Cookies處理，代理設置等），就需要更強大的工具Handler了
        
        BaseHandler
        
        各種Handler子類繼承BaseHandler類
        部分例子：
        
        HITPDefaultErrorHandler:用于處理HTTP響應錯誤，錯誤都會拋出?HTTPError類型的異常。
        
        HTTPRedirectHandler:用于處理重定向?。
        
        HTTPCookieProcessor:?用于處理?Cookies。
        
        ProxyHandler:用于設置代理?，?默認代理為空?。
        
        HπPPasswordMgr:用于管理密碼，它維護了用戶名和密碼的表。
        
        HTTPBasicAuthHandler:?用于管理認證，如果一個鏈接打開時需要認證，那么可以用它來解決認證問題?。
        Handler類官方文檔:https://docs.python.org/3/library/urllib.request.html#urllib.request.BaseHandler
        
        驗證
        ：在登錄某些網站時，需要輸入用戶名和密碼，驗證成功后才能查看頁面，這時可以借助HTTPBasicAuthHandler from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener from urllib.error import URLErrorusername = 'username' password = 'password' url = 'http://localhost:5000/'p = HTTPPasswordMgrWithDefaultRealm() # 創建一個密碼管理對象，用來保存 HTTP 請求相關的用戶名和密碼 p.add_password(None, url, username, password) # 添加url，用戶名，密碼 auth_handler = HTTPBasicAuthHandler(p) # 來處理代理的身份驗證 opener = build_opener(auth_handler) # 利用build_opener()方法構建一個Openertry:result = opener.open(url) # 利用Opener的open()方法打開鏈接，完成驗證html = result.read().decode('utf-8') # 讀取返回的結果，解碼返回結果print(html) except URLError as e:print(e.reason) # 獲取錯誤的原因
        ?可以修改username、password、url來爬取自己想爬取的網站
        
        代理
        from urllib.error import URLError from urllib.request import ProxyHandler, build_opener# ProxyHandler()使用代理IP, 它的參數是一個字典，鍵名是協議類型（比如HTTP或者HTTPS等），鍵值是代理鏈接，可以添加多個代理 proxy_handler = ProxyHandler({'http': 'http://127.0.0.1:9743','https': 'https://127.0.0.1:9743'} ) opener = build_opener(proxy_handler) # 利用build_opener()方法，構造一個Openertry:response = opener.open('https://www.baidu.com') # 發送請求print(response.read().decode('utf-8')) except URLError as e:print(e.reason)
        ?
        
        Cookies
        
        爬一些需要登錄的網站，就要用到cookie相關的一些模塊來操作了
        
        http.cookiejar.CookieJar()
        import http.cookiejar # http.cookiejar.CookieJar() # 1、管理儲存cookie，向傳出的http請求添加cookie # 2、cookie存儲在內存中，CookieJar示例回收后cookie將自動消失 import urllib.requestcookie = http.cookiejar.CookieJar() # 創建cookiejar實例對象 handler = urllib.request.HTTPCookieProcessor(cookie) # 根據創建的cookie生成cookie的管理器 opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com')for item in cookie:print(item.name+"="+item.value)# 輸出 BAIDUID=FB2B1F3E51F9DD2626C586989E016F7B:FG=1 BIDUPSID=FB2B1F3E51F9DD2626C586989E016F7B H_PS_PSSID=29272_1443_21084_29135_29238_28519_29098_29369_28839_29221_20718 PSTM=1560654641 delPer=0 BDSVRTM=0 BD_HOME=0
        
        http.cookiejar.MozillaCookiejar()
        
        該方法在生成文件時用到，可以用來處理Cookies和文件相關的事件，比如讀取和保存Cookies，可以將Cookies保存成Mozilla型瀏覽器的Cookies格式 import http.cookiejar # http.cookiejar.MozillaCookiejar # 1、是FileCookieJar的子類 # 2、與moccilla瀏覽器兼容 import urllib.requestfile_name = 'cookies.txt' cookie = http.cookiejar.MozillaCookieJar(file_name) # 創建cookiejar實例對象 handler = urllib.request.HTTPCookieProcessor(cookie) # 根據創建的cookie生成cookie的管理器 opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True) # 保存cookie到文件# 運行后，生成文件cookies.txt,文件內容如下
        
        http.cookiejar.LWPCookieJar()
        　　LWPCookieJar，可以保存Cookies，保存成
        libwww-perl
        (LWP)格式的Cookies文件
        LwpCookieJar
        是FileCookieJar的子類
        與libwww-perl標準兼容
        改變上面一個代碼例子中的一句代碼將 cookie = http.cookiejar.MozillaCookieJar(file_name) 改為 cookie = http.cookiejar.LWPCookieJar(file_name)# 運行后，生成一個文件cookies.txt，文件內容如下
        
        讀取并利用生成的Cookies文件
        
        例如打開LWPCookies格式文件 1 import http.cookiejar 2 import urllib.request 3 4 cookie = http.cookiejar.LWPCookieJar() # 創建cookiejar實例對象 5 cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True) # load()方法來讀取本地的Cookies文件 6 handler = urllib.request.HTTPCookieProcessor(cookie) # 根據創建的cookie生成cookie的管理器 7 opener = urllib.request.build_opener(handler) # 利用build_opener()方法，構造一個Opener 8 response = opener.open('http://www.baidu.com') # 利用Opener的open()方法打開鏈接,發送請求 9 print(response.read().decode('utf-8')) # 讀取、解碼
        運行結果正常的話，會輸出百度網頁的源代碼
  - 處理異常
    - URLError
      1 from urllib import request, error 2 try: 3 response = request.urlopen('https://www.bucunzai_tan90.com/index.htm') 4 print(response.read().decode('utf8')) 5 except error.URLError as e: 6 print(e.reason) 7 8 9 # 打開一個不存在的頁面時，輸出結果是:[Errno 8] nodename nor servname provided, or not known 10 11 # 打開一個存在的頁面時，輸出結果是網頁的源代碼
    - HTTPError
      - 它是URLError的子類，專門用來處理HTTP請求錯誤，比如認證請求失敗等
        
        code:?返回?HTTP狀態碼，比如?404表示網頁不存在，?500表示服務器內部錯誤等。
        
        reason:同父類一樣，用于返回錯誤的原因?。
        
        headers:?返回請求頭。
        1 from urllib import request, error 2 try: 3 response = request.urlopen('https://cuiqingcai.com/index.htm') 4 print(response.read().decode('utf8')) 5 except error.HTTPError as e: 6 print(e.reason, e.code, e.headers, sep='\n\n') 7 # 參數sep是實現分隔符，比如多個參數輸出時想要輸出中間的分隔字符 8 9 10 # 輸出結果： 11 Not Found 12 13 404 14 15 Server: nginx/1.10.3 (Ubuntu) 16 Date: Sun, 16 Jun 2019 10:53:09 GMT 17 Content-Type: text/html; charset=UTF-8 18 Transfer-Encoding: chunked 19 Connection: close 20 Set-Cookie: PHPSESSID=vrvrfqq88eck9speankj0ogus0; path=/ 21 Pragma: no-cache 22 Vary: Cookie 23 Expires: Wed, 11 Jan 1984 05:00:00 GMT 24 Cache-Control: no-cache, must-revalidate, max-age=0 25 Link: <https://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"
        index.html通常是一個網站的首頁，也叫導航頁，也就是在這個頁面上包含了網站上的基本鏈接
        1 # 更好的寫法是，先處理子類，再處理父類，最后處理正常邏輯 2 3 from urllib import request, error 4 try: 5 response = request.urlopen('https://cuiqingcai.com/index.htm') 6 # print(response.read().decode('utf8')) 7 except error.HTTPError as e: # 處理HTTPError子類 8 print(e.reason, e.code, e.headers, sep='\n\n') 9 except error.URLError as e: # 處理URLError父類 10 print(e.reason) 11 else: # 處理正常邏輯 12 print('Request Successful')
        關于上面的reason屬性，返回的不一定是字符串，也可能是一個對象。如返回: <class 'socket.timeout'> 等等
    - 解析鏈接
      - ullib.parse定義了處理URL的標準接口
      - 它支持file、ftp、 hdl、 https、?imap、mms?、?news?、 prospero?、 telnet等協議的URL處理
      - urlparse()
        
        實現URL的識別和分段 1 from urllib.parse import urlparse 2 3 # 實現URL的分段 4 result = urlparse('http://www.baidu.com/index.html;user?id=5#comment') 5 print(type(result), result, sep='\n')　　 # 輸出的result是一個元組 6 7 8 # 輸出： 9 <class 'urllib.parse.ParseResult'> 10 ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment') 11 12 13 # scheme='協議', netloc='域名', path='訪問路徑', params='參數', query='查詢條件'(?后面), fragment='錨點'(#號后面)
        網頁鏈接標準格式 scheme://netloc/path?;params?query#fragment
        
        urllib.parse.urlparse(urlstring, scheme='', allwo_fragments=True)
        
        uelstring：要解析的URL。 scheme：所給URL沒協議時，scheme='XXX'，XXX是默認協議，否則scheme='所給URL協議'。?
        
        allwo_fragments：是否可以忽略fragament。
      - urlunparse()
        
        實現URL的構造： 1 from urllib.parse import urlunparse 2 # urllib.parse.urlunparse()，接受的參數是一個可迭代對象，它的長度必須是6 3 4 # 這里的data用了列表，也可以用元組或者特定的數據結構 5 data1 = ['http', 'www.baidu.com', '/index.html', 'user', 'id=5', 'comment'] 6 data2 = ['', 'www.baidu.com', '/index.html', 'user', 'id=5', 'comment'] 7 data3 = ['http', '', '/index.html', 'user', 'id=5', 'comment'] 8 data4 = ['http', 'www.baidu.com', '', 'user', 'id=5', 'comment'] 9 data5 = ['http', 'www.baidu.com', '/index.html', '', 'id=5', 'comment'] 10 data6 = ['http', 'www.baidu.com', '/index.html', 'user', '', 'comment'] 11 data7 = ['http', 'www.baidu.com', '/index.html', 'user', 'id=5', ''] 12 print("缺少協議：\t"+urlunparse(data2), "缺少域名：\t"+urlunparse(data3), 13 "缺少訪問路徑：\t"+urlunparse(data4), "缺少參數：\t"+urlunparse(data5), 14 "缺少查詢條件：\t"+urlunparse(data6), "缺少錨點：\t"+urlunparse(data7), 15 "標準鏈接：\t"+urlunparse(data1), sep='\n\n') 16 17 18 # 輸出對比：
        ?
      - urlsplit()
        
        實現URL的識別和分段： 1 from urllib.parse import urlsplit 2 3 result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment') 4 print(result, result.scheme, result[4], sep='\n') 5 6 7 # 輸出結果： 8 SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment') 9 http 10 comment 11 12 # urlsplit()方法與urlparse()方法很相似，urlsplit()方法與urlparse()相比，urlsplit()將path和params合在一起放在path中,而urlparse()中，path和params是分開的
      - urlunsplit()
        
        實現URL的構造：
      - 1 from urllib.parse import urlunsplit 2 # urlunsplit()方法與urlunparse()方法類似，urlunsplit()傳入的參數是一個可迭代的對象， 3 # 不同之處是path和params是否合在一起（urlunsplit是合在一起的） 4 5 data = ('http', 'wwww.baidu.com', 'index.html;user', 'id=5', 'comment') 6 print(urlunsplit(data)) 7 8 # 輸出結果： 9 http://wwww.baidu.com/index.html;user?id=5#comment
      - urljoin()
        
        完成鏈接的合并： 1 from urllib.parse import urljoin 2 3 # 完成鏈接的合并（前提是必須有特定長度的對象，鏈接的每一部分都要清晰分開） 4 5 print(urljoin('http://www.baidu.com', 'FAQ.html')) 6 print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html')) 7 print(urljoin ('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html')) 8 print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2')) 9 print(urljoin ('http://www.baidu.com d=abc', 'https://cuiqingcai.com/index.php')) 10 print(urljoin('http://www.baidu.com', '?category=2#comment')) 11 print(urljoin('www.baidu.com', '?category=2#comment')) 12 print(urljoin('www.baidu.com#comment', '?category=2')) 13 14 15 # 輸出： 16 http://www.baidu.com/FAQ.html
        17 https://cuiqingcai.com/FAQ.html 18 https://cuiqingcai.com/FAQ.html 19 https://cuiqingcai.com/FAQ.html?question=2 20 https://cuiqingcai.com/index.php 21 http://www.baidu.com?category=2#comment 22 www.baidu.com?category=2#comment 23 www.baidu.com?category=2
      - urlencode()
        
        urlencode()可以把key-value這樣的鍵值對轉換成我們想要的格式，返回的是a=1&b=2這樣的字符串 1 from urllib.parse import urlencode 2 3 params = {} 4 params['name'] = 'Tom' 5 params['age'] = 21 6 7 base_url = 'http://wwww.baidu.com?' 8 url = base_url + urlencode(params) 9 print(url) 10 11 # 輸出： 12 http://wwww.baidu.com?name=Tom&age=21
      - parse_qs()
        如果說urlencode()方法實現序列化，那么parse_qs()就是反序列化 1 from urllib.parse import parse_qs 2 3 query = 'name=Tom&age=21' 4 print(parse_qs(query)) 5 6 7 # 輸出： 8 {'name': ['Tom'], 'age': ['21']}
      - parse_qsl()
        
        parse_qsl()方法與parse_qs()方法很相似，parse_qsl()返回的是列表，列表中的每個元素是一個元組，parse_qs()返回的是字典 1 from urllib.parse import parse_qsl 2 3 query = 'name=Tom&age=21' 4 print(parse_qsl(query)) 5 6 7 # 輸出： 8 [('name', 'Tom'), ('age', '21')]
      - quote()
        
        將內容轉化為URL編碼的格式 1 from urllib.parse import quote 2 3 keyword = '壁紙' 4 url = 'https://www.baidu.com/s?wd=' + quote(keyword) 5 print(url) 6 7 8 # 輸出： 9 https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8
      - unquote()
        
        進行URL解碼 1 from urllib.parse import unquote 2 3 url = 'https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8' 4 print(unquote(url)) 5 6 7 # 輸出： 8 https://www.baidu.com/s?wd=壁紙
    - 分析Robots協議
      - Robots協議（爬蟲協議、機器人協議）---網絡爬蟲排除標準（Robots Exclusion Protocol）
      - 爬蟲訪問一個站點時，它首先會檢查這個站點根目錄下是否存在robots.txt文件，如果存在，搜索爬蟲會根據其中定義的范圍來爬取。
      - robots.txt樣例：
        User-agent: Baiduspider 　　代表規則對百度爬蟲是有效的（還有很多，例如Googlebot、360Spider等）
      - 常見爬蟲名稱
      - robotparser
        
        urllib.robotparser.RobotFileParser(url='')根據某網站的robots.txt文件來判斷一個爬取爬蟲是否有權限來爬取這個網頁
        set_url() 用來設置robot.txt文件的鏈接
        read() 讀取robots.txt文件并進行分析
        parse() 解析robots.txt文件，傳入的參數是robots.txt某些行內容
        can_fetch(User-agent='', URL='') ?返回內容是該搜索引擎是否可以抓取這個URL，返回結果是True或False
        mtime() 返回上一次抓取和分析robots.txt的時間
        modified() 將當前時間設置為上次抓取和分析robots.txt的時間 1 from urllib.robotparser import RobotFileParser 2 3 rp = RobotFileParser() 4 rp.set_url('http://www.jianshu.com/robots.txt') # 設置robots.txt文件的鏈接 5 rp.read() # 讀取robots.txt文件并進行分析 6 print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d')) # 輸出該搜索引擎是否可以抓取這個URL 7 print(rp.can_fetch('*', 'http://www.jianshu.com/search?q=python&page=1&type=collections')) 8 9 10 # 輸出： 11 False 12 False 13 14 # False也就是說該搜索引擎不能抓取這個URL
        1 from urllib.robotparser import RobotFileParser 2 from urllib.request import urlopen 3 4 rp = RobotFileParser() 5 rp.parse(urlopen('http://www.jianshu.com/robots.txt').read().decode('utf-8').split('\n')) 6 print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d')) 7 print(rp.can_fetch('*', 'http://www.jianshu.com/search?q=python&page=1&type=collections')) 8 9 10 11 # 輸出結果與上面一個例子一樣，只是上一個例子用read()方法，這個例子用parse()方法

轉載于:https://www.cnblogs.com/liyihua/p/11017209.html

總結

以上是生活随笔為你收集整理的爬虫基本库的使用---urllib库的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

编程问答

爬虫基本库的使用---urllib库

使用urllib---Python內置的HTTP請求模塊

發送請求

urlopen()

Request

高級用法

BaseHandler

驗證

代理

Cookies

http.cookiejar.CookieJar()

http.cookiejar.MozillaCookiejar()

http.cookiejar.LWPCookieJar()

讀取并利用 生成的Cookies文件

處理異常

URLError

HTTPError

解析鏈接

urlparse()

urlunparse()

urlsplit()

urlunsplit()

urljoin()

urlencode()

parse_qsl()

quote()

unquote()

分析Robots協議

Robots協議（爬蟲協議、機器人協議）---網絡爬蟲排除標準（Robots Exclusion Protocol）

常見爬蟲名稱

robotparser

總結

讀取并利用生成的Cookies文件