當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫第二课：url解析

發(fā)布時(shí)間：2025/3/15 python 39 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫第二课：url解析小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

實(shí)例域名： http://www.baidu.com/index.html;user?id=S#comment
URL通過特定的分隔符解析：
冒號(hào)’:‘前面的就是 scheme，代表協(xié)議；第一個(gè)’／‘符號(hào)前面便是 netloc ，即域名，后面是 path，即訪問路徑（其實(shí)包含‘/’本身）；分號(hào)’;‘后面是 params ，代表參數(shù)；問號(hào)’？‘后面是查詢條件 query，一般用作 GET 類型的 URL; 井號(hào)’＃'后面是錨點(diǎn)，用于直接定位頁面內(nèi)部的下拉位置。

urllib.parse.urlparse(urlstring, scheme=”, allow_fragments=True)
scheme：它是默認(rèn)的協(xié)議（比如 http 或 https 等）假如這個(gè)鏈接沒有帶協(xié)議信息，會(huì)將這個(gè)作為默認(rèn)的協(xié)議。
allow_fragments ：即是否忽略fragment。如果它被設(shè)置為 False，干ragment部分就會(huì)被忽略，它會(huì)被解析為 path、 parameters 或者 query 的一部分，而 fragment 部分為空。
返回一個(gè)元組，我們可以用索引順序來獲取，也可以用屬性名獲取。

實(shí)例代碼：

from urllib.parse import *url = 'http://www.baidu.com/index.html;user?id=5#comment' result1 = urlparse(url) print(type(result1)) print('result1 : ', result1) result1 = urlparse(url, allow_fragments=False) print(result1)# urlsplit() ：和 urlparse（）方法非常相似，只不過它不再單獨(dú)解析 params 這一部分，只返回 5 個(gè)結(jié)果。同樣返回也是元組。 result2 = urlsplit(url) print('result2 : ', result2)# urlunparse()：接受一個(gè)可迭代對(duì)象（字典取得是key部分，賊傻），長(zhǎng)度必須是6.組裝為一個(gè)URL。 data1 = ['http', 'www.baidu.com', '/index.html', 'user', 'id=S', '#comment', ] print(urlunparse(data1))# urlunsplit() ：與 urlunparse（）類似，傳入的參數(shù)也是一個(gè)可迭代對(duì)象，唯一的區(qū)別是長(zhǎng)度必須為 5。 data2 = ['http', 'www.baidu.com', '/index.html;user', 'id=S', '#comment', ] print(urlunsplit(data2))# urljoin() ：提供一個(gè) base_url （基礎(chǔ)鏈接）作為第一個(gè)參數(shù)，將新的鏈接作為第二個(gè)參數(shù)，該方法會(huì)分析 base_url 的 scheme、 netloc # 和 path 這 3 個(gè)內(nèi)容并對(duì)新鏈接缺失的部分進(jìn)行補(bǔ)充，最后返回結(jié)果。 print(urljoin('http://www.baidu.com', '?category=2#comment'))# urlencode() ：將一個(gè)字典構(gòu)造為 GET 請(qǐng)求參數(shù) params = {'scheme': 'http', 'netloc': 'www.baidu.com', 'path': '/index.html', 'name': 'germey', 'age': 22} print(urlencode(params)) params = {'name': 'germey', 'age': 22} print('http://www.baidu.com?' + urlencode(params)) #所以u(píng)rlencode只是加參數(shù)，并不負(fù)責(zé)解析字典# parse_qs()：將參數(shù)轉(zhuǎn)化字典 # parse_qsl()：將參數(shù)轉(zhuǎn)化為元組組成的列表 print(parse_qs('name=germey&age=22')) print(parse_qsl('name=germey&age=22'))# quote() ：URL 中帶有中文參數(shù)時(shí)，有時(shí)可能會(huì)導(dǎo)致亂碼的問題，用這個(gè)方法可以將文字符轉(zhuǎn)化為 URL 編碼。 # unquote() ：它可以進(jìn)行 URL解碼。 keyword = '壁紙' url = 'https://www.baidu.com/s?wd='+ quote(keyword) print(url) print(unquote(url))

程序運(yùn)行結(jié)果：

<class 'urllib.parse.ParseResult'> result1 : ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment') ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='') result2 : SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment') http://www.baidu.com/index.html;user?id=S##comment http://www.baidu.com/index.html;user?id=S##comment http://www.baidu.com?category=2#comment scheme=http&netloc=www.baidu.com&path=%2Findex.html&name=germey&age=22 http://www.baidu.com?name=germey&age=22 {'name': ['germey'], 'age': ['22']} [('name', 'germey'), ('age', '22')] https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8 https://www.baidu.com/s?wd=壁紙

總結(jié)

以上是生活随笔為你收集整理的python爬虫第二课：url解析的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： windows7官方原版_从零开始学装系
下一篇：吴恩达 coursera AI 专项四第