當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

python抓取数据时失败_爬取数据缺失的补坑，Python数据爬取的坑坑洼洼如何铲平...

發(fā)布時(shí)間：2023/12/2 python 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 python抓取数据时失败_爬取数据缺失的补坑，Python数据爬取的坑坑洼洼如何铲平... 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

渣渣業(yè)余選手講解，關(guān)于爬取數(shù)據(jù)缺失的補(bǔ)坑，一點(diǎn)點(diǎn)關(guān)于Python數(shù)據(jù)爬取的坑坑洼洼如何鏟平，個(gè)人的一些心得體會(huì)，還有結(jié)合實(shí)例的數(shù)據(jù)缺失的補(bǔ)全，幾點(diǎn)參考，僅供觀賞，如有雷同，那肯定是我抄襲的！

在使用Python爬取數(shù)據(jù)的過(guò)程中，尤其是用你自身電腦進(jìn)行數(shù)據(jù)抓取，往往會(huì)有網(wǎng)絡(luò)延遲，或者兼職網(wǎng)管拔插重啟網(wǎng)絡(luò)的情況發(fā)生，這是渣渣碰到的非常普遍的情況，當(dāng)然推薦還是推薦使用服務(wù)器抓取數(shù)據(jù)。

當(dāng)然這是比較常見(jiàn)和可控的網(wǎng)絡(luò)爬取的異常，處理還是有不少方法或者說(shuō)是方案的，也是這里著重談?wù)劦呐廊?shù)據(jù)缺失的補(bǔ)坑。

補(bǔ)坑一：timeou=x 的設(shè)置

requests抓取網(wǎng)頁(yè)數(shù)據(jù)中，timeou屬性建議一定要設(shè)置，一般為timeou=5，建議設(shè)置5s以上，如果你的網(wǎng)絡(luò)差，或者抓取的網(wǎng)頁(yè)服務(wù)器延遲比較厲害，比如國(guó)內(nèi)訪問(wèn)國(guó)外網(wǎng)站服務(wù)器，建議設(shè)置10s以上！

為什么要設(shè)置imeou=x呢？

避免網(wǎng)絡(luò)延遲，程序卡死，死機(jī)，連報(bào)錯(cuò)都不會(huì)出現(xiàn)，一直停滯在網(wǎng)頁(yè)訪問(wèn)的過(guò)程中，這在 pyinstaller 打包的exe程序使用中尤為常見(jiàn)！

超時(shí)（timeout）

為防止服務(wù)器不能及時(shí)響應(yīng)，大部分發(fā)至外部服務(wù)器的請(qǐng)求都應(yīng)該帶著 timeout 參數(shù)。

在默認(rèn)情況下，除非顯式指定了 timeout 值，requests 是不會(huì)自動(dòng)進(jìn)行超時(shí)處理的。

如果沒(méi)有 timeout，你的代碼可能會(huì)掛起若干分鐘甚至更長(zhǎng)時(shí)間。

連接超時(shí)指的是在你的客戶端實(shí)現(xiàn)到遠(yuǎn)端機(jī)器端口的連接時(shí)（對(duì)應(yīng)的是 connect() ），Request 會(huì)等待的秒數(shù)。

一個(gè)很好的實(shí)踐方法是把連接超時(shí)設(shè)為比 3 的倍數(shù)略大的一個(gè)數(shù)值，因?yàn)?TCP 數(shù)據(jù)包重傳窗口 (TCP packet retransmission window) 的默認(rèn)大小是 3。

在爬蟲(chóng)代理這一塊我們經(jīng)常會(huì)遇到請(qǐng)求超時(shí)的問(wèn)題，代碼就卡在哪里，不報(bào)錯(cuò)也沒(méi)有requests請(qǐng)求的響應(yīng)。

通常的處理是在requests.get()語(yǔ)句中加入timeout限制請(qǐng)求時(shí)間req = requests.get(url, headers=headers, proxies=proxies, timeout=5)

如果發(fā)現(xiàn)設(shè)置timeout=5后長(zhǎng)時(shí)間不響應(yīng)問(wèn)題依然存在，可以將timeout里的參數(shù)細(xì)化

作出如下修改后，問(wèn)題就消失了req = requests.get(url, headers=headers, proxies=proxies, timeout=(3,7))

timeout是用作設(shè)置響應(yīng)時(shí)間的，響應(yīng)時(shí)間分為連接時(shí)間和讀取時(shí)間，timeout(3,7)表示的連接時(shí)間是3，響應(yīng)時(shí)間是7，如果只寫(xiě)一個(gè)的話，就是連接和讀取的timeout總和！

來(lái)源：CSDN博主「明天依舊可好」

補(bǔ)坑二：requests超時(shí)重試

requests訪問(wèn)重試的設(shè)置，你非常熟悉的錯(cuò)誤信息中顯示的是 read timeout（讀取超時(shí)）報(bào)錯(cuò)。

超時(shí)重試的設(shè)置，雖然不能完全避免讀取超時(shí)報(bào)錯(cuò)，但能夠大大提升你的數(shù)據(jù)獲取量，避免偶爾的網(wǎng)絡(luò)超時(shí)而無(wú)法獲取數(shù)據(jù)，避免你后期大量補(bǔ)坑數(shù)據(jù)。

一般超時(shí)我們不會(huì)立即返回，而會(huì)設(shè)置一個(gè)三次重連的機(jī)制。def gethtml(url):

i = 0

while i < 3:

try:

html = requests.get(url, timeout=5).text

return html

except requests.exceptions.RequestException:

i += 1

其實(shí) requests 已經(jīng)幫我們封裝好了。（但是代碼好像變多了...）import time

import requests

from requests.adapters import HTTPAdapter

s = requests.Session()

s.mount('http://', HTTPAdapter(max_retries=3))

s.mount('https://', HTTPAdapter(max_retries=3))

print(time.strftime('%Y-%m-%d %H:%M:%S'))

try:

r = s.get('http://www.google.com.hk', timeout=5)

return r.text

except requests.exceptions.RequestException as e:

print(e)

print(time.strftime('%Y-%m-%d %H:%M:%S'))

max_retries 為最大重試次數(shù)，重試3次，加上最初的一次請(qǐng)求，一共是4次，所以上述代碼運(yùn)行耗時(shí)是20秒而不是15秒2020-01-11 15:34:03

HTTPConnectionPool(host='www.google.com.hk', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(, 'Connection to www.google.com.hk timed out. (connect timeout=5)'))

2020-01-11 15:34:23

來(lái)源：大齡碼農(nóng)的Python之路

補(bǔ)坑三：urlretrieve（）函數(shù) 下載圖片

解決urlretrieve下載不完整問(wèn)題且避免用時(shí)過(guò)長(zhǎng)

下載文件出現(xiàn)urllib.ContentTooShortError且重新下載文件會(huì)存在用時(shí)過(guò)長(zhǎng)的問(wèn)題，而且往往會(huì)嘗試好幾次，甚至十幾次，偶爾會(huì)陷入死循環(huán)，這種情況是非常不理想的。為此，筆者利用socket模塊，使得每次重新下載的時(shí)間變短，且避免陷入死循環(huán)，從而提高運(yùn)行效率。

以下為代碼：import socket

import urllib.request

#設(shè)置超時(shí)時(shí)間為30s

socket.setdefaulttimeout(30)

#解決下載不完全問(wèn)題且避免陷入死循環(huán)

try:

urllib.request.urlretrieve(url,image_name)

except socket.timeout:

count = 1

while count <= 5:

try:

urllib.request.urlretrieve(url,image_name)

break

except socket.timeout:

err_info = 'Reloading for %d time'%count if count == 1 else 'Reloading for %d times'%count

print(err_info)

count += 1

if count > 5:

print("downloading picture fialed!")

來(lái)源：CSDN博主「山陰少年」

補(bǔ)坑四：time.sleep的使用

Python time sleep() 函數(shù)推遲調(diào)用線程的運(yùn)行，可通過(guò)參數(shù)secs指秒數(shù)，表示進(jìn)程掛起的時(shí)間。

某些網(wǎng)頁(yè)請(qǐng)求過(guò)快，如果沒(méi)有設(shè)置延遲1-2s，你是不會(huì)抓取到數(shù)據(jù)的！

當(dāng)然這種情況還是比較少數(shù)！

想要順利采集數(shù)據(jù)，不管什么方法，目的只有一個(gè)：記錄下最后的狀態(tài)，也就是你的抓取日志文件系統(tǒng)一定要完善！

附：

一次完整的數(shù)據(jù)補(bǔ)坑實(shí)例：

異常處理記錄源碼：s = requests.session()

s.mount('http://', HTTPAdapter(max_retries=3))

s.mount('https://', HTTPAdapter(max_retries=3))

try:

print(f">>> 開(kāi)始下載 {img_name}圖片 ...")

r=s.get(img_url,headers=ua(),timeout=15)

with open(f'{path}/{img_name}','wb') as f:

f.write(r.content)

print(f">>>下載 {img_name}圖片成功！")

time.sleep(2)

except requests.exceptions.RequestException as e:

print(f"{img_name}圖片-{img_url}下載失敗！")

with open(f'{path}/imgspider.txt','a+') as f:

f.write(f'{img_url},{img_name},{path}-下載失敗，錯(cuò)誤代碼：{e}！\n')

下載圖片報(bào)錯(cuò)：

異常文件記錄數(shù)據(jù)：https://www.red-dot.org/index.php?f=65894&token=2aa10bf1c4ad54ea3b55f0f35f57abb4ba22cc76&eID=tx_solr_image&size=large&usage=overview,1_1_KRELL Automotive.jpg,2019Communication Design/Film & Animation-下載失敗，錯(cuò)誤代碼：HTTPSConnectionPool(host='www.red-dot.org', port=443): Max retries exceeded with url: /index.php?f=65894&token=2aa10bf1c4ad54ea3b55f0f35f57abb4ba22cc76&eID=tx_solr_image&size=large&usage=overview (Caused by ReadTimeoutError("HTTPSConnectionPool(host='www.red-dot.org', port=443): Read timed out. (read timeout=15)"))！

https://www.red-dot.org/index.php?f=65913&token=8cf9f213e28d0e923e1d7c3ea856210502f57df3&eID=tx_solr_image&size=large&usage=overview,1_2_OLX – Free Delivery.jpg,2019Communication Design/Film & Animation-下載失敗，錯(cuò)誤代碼：HTTPSConnectionPool(host='www.red-dot.org', port=443): Read timed out.！

https://www.red-dot.org/index.php?f=65908&token=426484d233356d6a1d4b8044f4994e1d7f8c141a&eID=tx_solr_image&size=large&usage=overview,1_3_Dentsu Aegis Network’s Data Training – Data Foundation.jpg,2019Communication Design/Film & Animation-下載失敗，錯(cuò)誤代碼：HTTPSConnectionPool(host='www.red-dot.org', port=443): Max retries exceeded with url: /index.php?f=65908&token=426484d233356d6a1d4b8044f4994e1d7f8c141a&eID=tx_solr_image&size=large&usage=overview (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 11004] getaddrinfo failed'))！

數(shù)據(jù)補(bǔ)坑思路：

第一步：搜索到異常記錄文件，獲取到文件路徑

第二步：打開(kāi)文件，獲取到相關(guān)數(shù)據(jù)信息

第三步：重新下載圖片信息，補(bǔ)充圖片數(shù)據(jù)

幾個(gè)關(guān)鍵點(diǎn)：

1.搜索異常文件，我這里是 imgspider.txt#搜索文件

def search(path,key):

"""

文件目錄里搜索想要查找的文件輸出文件所在路徑

:param path: 想要搜索查詢的目錄

:param key: 搜索的文件關(guān)鍵字

:return: 返回目錄

"""

key_paths=[]

#查看當(dāng)前目錄文件列表（包含文件夾）

allfilelist = os.listdir(path)

print(allfilelist)

for filelist in allfilelist:

if "." not in filelist:

filespath=os.path.join(path, filelist)

files= os.listdir(filespath)

print(files)

for file in files:

if "." not in file:

filepath=os.path.join(filespath, file)

file = os.listdir(filepath)

for file_name in file:

if key in file_name:

key_path=os.path.join(filepath,file_name)

print(f'找到文件，路徑為{key_path}')

key_paths.append(key_path)

else:

if key in filelist:

key_path=os.path.join(path, filelist)

print(f'找到文件，路徑為{key_path}')

key_paths.append(key_path)

return key_paths

這里只寫(xiě)到二級(jí)目錄，其實(shí)可以改成遞歸函數(shù)調(diào)用，結(jié)合gui界面制作簡(jiǎn)易文件搜索工具助手！

搜索文件效果：

2.圖片數(shù)據(jù)的處理

字符串分割函數(shù) split

需要提取到三個(gè)信息，也就是異常記錄里的信息內(nèi)容

1.img_url：圖片下載地址

2.img_name：圖片名稱(chēng)

3.path：圖片存儲(chǔ)路徑for data in datas:

img_data=data.split('-下載失敗')[0]

img_url=img_data.split(',')[0]

img_name = img_data.split(',')[1]

path = img_data.split(',')[2]

print(img_name,img_url,path)

補(bǔ)坑效果：

附完整源碼：# -*- coding: utf-8 -*-

#python3.7

# 20200111 by 微信：huguo00289

import os,time,requests

from fake_useragent import UserAgent

from requests.adapters import HTTPAdapter #引入 HTTPAdapter 庫(kù)

#構(gòu)成協(xié)議頭

def ua():

ua=UserAgent()

headers={"User-Agent":ua.random}

return headers

#搜索文件

def search(path,key):

"""

文件目錄里搜索想要查找的文件輸出文件所在路徑

:param path: 想要搜索查詢的目錄

:param key: 搜索的文件關(guān)鍵字

:return: 返回目錄

"""

key_paths=[]

#查看當(dāng)前目錄文件列表（包含文件夾）

allfilelist = os.listdir(path)

print(allfilelist)

for filelist in allfilelist:

if "." not in filelist:

filespath=os.path.join(path, filelist)

files= os.listdir(filespath)

print(files)

for file in files:

if "." not in file:

filepath=os.path.join(filespath, file)

file = os.listdir(filepath)

for file_name in file:

if key in file_name:

key_path=os.path.join(filepath,file_name)

print(f'找到文件，路徑為{key_path}')

key_paths.append(key_path)

else:

if key in filelist:

key_path=os.path.join(path, filelist)

print(f'找到文件，路徑為{key_path}')

key_paths.append(key_path)

return key_paths

#獲取圖片下載失敗的文件記錄路徑

def get_pmimgspider():

img_paths=[]

key = "imgspider"

categorys = [

"Advertising", "Annual Reports", "Apps", "Brand Design & Identity", "Brands", "Corporate Design & Identity",

"Fair Stands", "Film & Animation", "Illustrations", "Interface & User Experience Design",

"Online", "Packaging Design", "Posters", "Publishing & Print Media", "Retail Design", "Sound Design",

"Spatial Communication", "Typography", "Red Dot_Junior Award",

]

for category in categorys:

path = f'2019Communication Design/{category}'

key_paths = search(path, key)

img_paths.extend(key_paths)

print(img_paths)

return img_paths

#下載圖片

def get_img(img_name,img_url,path):

s = requests.session()

s.mount('http://', HTTPAdapter(max_retries=3))

s.mount('https://', HTTPAdapter(max_retries=3))

try:

print(f">>> 開(kāi)始下載 {img_name}圖片 ...")

r=s.get(img_url,headers=ua(),timeout=15)

with open(f'{path}/{img_name}','wb') as f:

f.write(r.content)

print(f">>>下載 {img_name}圖片成功！")

time.sleep(2)

except requests.exceptions.RequestException as e:

print(f"{img_name}圖片-{img_url}下載失敗！")

with open(f'{path}/imgspider.txt','a+') as f:

f.write(f'{img_url},{img_name},{path}-下載失敗，錯(cuò)誤代碼：{e}！\n')

def main2():

img_paths = get_pmimgspider()

for img_path in img_paths:

print(img_path)

with open(img_path) as f:

datas = f.readlines()

print(datas)

for data in datas:

img_data=data.split('-下載失敗')[0]

img_url=img_data.split(',')[0]

img_name = img_data.split(',')[1]

path = img_data.split(',')[2]

print(img_name,img_url,path)

get_img(img_name, img_url, path)

if __name__=="__main__":

main2()

總結(jié)

以上是生活随笔為你收集整理的python抓取数据时失败_爬取数据缺失的补坑，Python数据爬取的坑坑洼洼如何铲平...的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： linux tmp（linux tm）
下一篇：备案资料（资料库备案）