當前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

Urllib库函数、代理、爬取案例

發布時間：2023/11/27 生活经验 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 Urllib库函数、代理、爬取案例小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Urllib庫函數、代理、爬取案例

urllib庫-urlopen函數用法

from urllib import  request
resp=request.urlopen('http://www.baidu.com')
print(resp.read())

部分結果

b'<!DOCTYPE html><!--STATUS OK-->\n\n\n    <html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name="referrer"><meta name="theme-color" content="#2932e1"><meta name="description"

urllib庫-urlretrieve函數

request.urlretrieve('http://www.baidu.com/','baidu.html')
# 直接將百度網頁html文件下載
request.urlretrieve('https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fwww.huaduhn.com%2Fuserfiles%2F322%2F201407%2F11%2F201407110004.jpg&refer=http%3A%2F%2Fwww.huaduhn.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=jpeg?sec=1627958798&t=b7a5c2084792e07337bf68ca56d4f58b','dangqi.jpg')
# 將圖片下載至對應路徑

urlencode函數

用瀏覽器發送請求的時候，如果url中包含了中文或是其他特殊字符，那么瀏覽器會自動給我們進行編碼。而如果使用代碼發送請求，那必須手動進行編碼，這是就應該使用urlencode函數實現。urlencode可以把字典數據轉換為URL編碼的數據

from urllib import parseparams={'name':'張三',"age":18,'greet':'hello world'}result=parse.urlencode(params)print(result)
'''
結果：
name=%E5%BC%A0%E4%B8%89&age=18&greet=hello+world
'''

url='http://www.baidu.com/s'
params1={"wd":"劉德華"}
qs=parse.urlencode(params1)
print(qs)
# 結果：wd=%E5%88%98%E5%BE%B7%E5%8D%8Eurl=url+"?"+qs  #https://www.baidu.com/s?wd=%E7%99%BE%E5%BA%A6  形成形如這樣的形式：/s?wd=......
resp=request.urlopen(url)
print(resp.read())'''
部分結果：
[b'<!DOCTYPE html><!--STATUS OK-->\n']
[b'\n', b'\n', b'    <html><head><meta http-equiv="Content-Type" 
content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" 
content="IE=edge,chrome=1"><meta 
content="always" name="referrer"><meta 
name="theme-color" content="#2932e1"><meta name="description" 
content="\xe5\x85\xa8\xe7\x90\x83\xe9\xa2
\x86\xe5\x85\x88\xe7\x9a\x84\xe4\xb8\xad'''

parse_qs函數：

可以將編碼后的url參數進行解碼

params={'name':'張三',"age":18,'greet':'hello world'}
result=parse.urlencode(params)
result1=parse.parse_qs(result)
print(result)
print(result1)'''
name=%E5%BC%A0%E4%B8%89&age=18&greet=hello+world
{'name': ['張三'], 'age': ['18'], 'greet': ['hello world']}'''

urllib庫urlparse和urlsplit函數用法

對一個url中的各個組成部分進行分割，則可用以上兩函數

from urllib import  request,parse
url='http://www.baidu.com/s?username=zhiliao'
result=parse.urlsplit(url)
result1=parse.urlparse(url)
print(result)
print(result1)
print('scheme:',result.scheme)
print('netloc:',result.netloc)
print('path:',result.path)
print('query:',result.query)'''
結果：
SplitResult(scheme='http', netloc='www.baidu.com', path='/s', query='username=zhiliao', fragment='')
ParseResult(scheme='http', netloc='www.baidu.com', path='/s', params='', query='username=zhiliao', fragment='')
scheme: http
netloc: www.baidu.com
path: /s
query: username=zhiliao'''

用Request爬取拉勾網職位信息-實戰

from urllib import request
url='https://www.lagou.com/'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.64'
}req=request.Request(url,headers=headers)resp=request.urlopen(req)
print(resp.read())'''
部分結果：
b'<!DOCTYPE html>\n<html>\n<head>\n\t<!-- meta -->\n\t<meta charset="UTF-8">\n
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />\n
<meta name="renderer" content="webkit">\n<meta property="qc:admins" content="23635710066417756375" />\n
<meta name="baidu-site-verification" content="QIQ6KC1oZ6" />\n
<meta content="\xe6\x8b\x89\xe5\x8b\xbe\xe6\x8b\x9b\xe8\x81\x98,
'''url='https://www.lagou.com/beijing-zhaopin/Python/?labelWords=label'headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.64'}
req=request.Request(url,headers=headers,method='GET')
resp=request.urlopen(req)
print(resp.read())

ProxyHander處理器（代理設置）

使用urllib.request中ProxyHander（）方法來設置代理服務器，每隔一段時間換一個代理，
當IP被禁，依然可以換其他IP 繼續爬取。

from urllib import request# 沒有使用代理的
'''
url='http://httpbin.org/ip'
resp=request.urlopen(url)
print(resp.read())
'''
# b'{\n  "origin": "111.79.225.123"\n}\n'# 使用代理
url='http://httpbin.org/ip'
#1.使用ProxyHandler傳入代理構建一個handler
handler=request.ProxyHandler({"http":"112.250.107.37:53281"})
# 2.使用上述hander構建一個opener
opener=request.build_opener(handler)
# 3.使用opener發送一個請求
resp=opener.open(url)
print(resp.read())

總結

以上是生活随笔為你收集整理的Urllib库函数、代理、爬取案例的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。