當前位置：首頁 > 编程语言 > python >内容正文

python

python异常数据处理_Python爬虫提高之异常处理

發布時間：2024/7/5 python 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 python异常数据处理_Python爬虫提高之异常处理小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Python爬蟲框架之異常處理

任何訪問服務器獲取數據的請求，都需要做異常處理，當然爬蟲更需要我們對各種異常進行處理。只有這樣才能提高爬蟲的健壯性。如果我們的爬蟲足夠健壯，那么就能確保程序幾個月不停止。

我們從以下幾個方面做出講解：

一：try except處理異常代碼塊

二：普通請求函數的超時處理

三：selenium+chrome | phantomjs 的超時處理

四：自定義函數的死鎖or超時處理

五：自定義線程的死鎖or超時處理

六：自重啟的程序設計

一：基礎try except異常處理

try except的語句可以讓我們的程序跳過代碼中可能出現的異常

try:

pass http://code.py40.com/deliver-article/#

#可能出錯的語句

except Exception as e:

pass

#保留錯誤的url，留待下次重跑

print(e)

finally:

#無論是否處理了異常都繼續運行

print(time.ctime())

try:

passhttp://code.py40.com/deliver-article/#

#可能出錯的語句

exceptExceptionase:

pass

#保留錯誤的url，留待下次重跑

print(e)

finally:

#無論是否處理了異常都繼續運行

print(time.ctime())

二：請求函數的超時處理

2.1:普通請求：

2.1.1單請求類型：

import requests

requests.get(url,timeout=60)

importrequests

requests.get(url,timeout=60)

2.1.2會話保持類型：

import requesocks

session = requesocks.session()

response = session.get(URL,headers=headers,timeout=10)

importrequesocks

session=requesocks.session()

response=session.get(URL,headers=headers,timeout=10)

三：selenium+chrome | phantomjs 的超時處理

2.2.1：selenium+chrome的超時設置

官網原文：http://selenium-python.readthedocs.io/waits.html

顯式等待：、等待某個條件發生，然后再繼續進行代碼。

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()

driver.get("http://somedomain/url_that_delays_loading")

try:

element = WebDriverWait(driver, 10).until( #這里修改時間

EC.presence_of_element_located((By.ID, "myDynamicElement"))

)

finally:

driver.quit()

fromseleniumimportwebdriver

fromselenium.webdriver.common.byimportBy

fromselenium.webdriver.support.uiimportWebDriverWait

fromselenium.webdriver.supportimportexpected_conditionsasEC

driver=webdriver.Firefox()

driver.get("http://somedomain/url_that_delays_loading")

try:

element=WebDriverWait(driver,10).until(#這里修改時間

EC.presence_of_element_located((By.ID,"myDynamicElement"))

)

finally:

driver.quit()

隱式等待：是告訴WebDriver在嘗試查找一個或多個元素（如果它們不是立即可用的）時輪詢DOM一定時間。默認設置為0，一旦設置，將為WebDriver對象實例的生命期設置隱式等待。

from selenium import webdriver

driver = webdriver.Firefox()

driver.implicitly_wait(10) # seconds

driver.get("http://somedomain/url_that_delays_loading")

myDynamicElement = driver.find_element_by_id("myDynamicElement")

fromseleniumimportwebdriver

driver=webdriver.Firefox()

driver.implicitly_wait(10)# seconds

driver.get("http://somedomain/url_that_delays_loading")

myDynamicElement=driver.find_element_by_id("myDynamicElement")

2.2.2：phantomjs的超時設置

這里使用不帶selenium的phantomjs，需要使用js。主要設置語句是

#JavaScript

page.settings.resourceTimeout = 5000; // 等待5秒

var system = require('system');

var args = system.args;

var url = args[1];

var page = require('webpage').create();

page.settings.resourceTimeout = 5000; // 等待5秒

page.onResourceTimeout = function(e) {

console.log(e.errorCode);?? //打印錯誤碼

console.log(e.errorString);//打印錯誤語句

console.log(e.url); ????//打印錯誤url

phantom.exit(1);

};

page.open(url, function(status) {

if(status==='success'){

var html=page.evaluate(function(){

returndocument.documentElement.outerHTML;

});

console.log(html);

}

phantom.exit();

});

//$phantomjs xx.js http://bbs.pcbaby.com.cn/topic-2149414.html

#JavaScript

page.settings.resourceTimeout=5000;//等待5秒

varsystem=require('system');

varargs=system.args;

varurl=args[1];

varpage=require('webpage').create();

page.settings.resourceTimeout=5000;//等待5秒

page.onResourceTimeout=function(e){

console.log(e.errorCode);??//打印錯誤碼

console.log(e.errorString);//打印錯誤語句

console.log(e.url);????//打印錯誤url

phantom.exit(1);

};

page.open(url,function(status){

if(status==='success'){

varhtml=page.evaluate(function(){

returndocument.documentElement.outerHTML;

});

console.log(html);

}

phantom.exit();

});

//$phantomjsxx.jshttp://bbs.pcbaby.com.cn/topic-2149414.html

四：自定義函數的死鎖or超時處理

這個非常重要！！

python是順序執行的，但是如果下一句話可能導致死鎖（比如一個while（1））那么如何強制讓他超時呢？他本身如果沒有帶有超時設置的話，就要自己運行信號（import signal）來處理

#coding:utf-8

import time

import signal

def test(i):

time.sleep(0.999)#模擬超時的情況

print "%d within time"%(i)

return i

def fuc_time(time_out):

# 此為函數超時控制，替換下面的test函數為可能出現未知錯誤死鎖的函數

def handler(signum, frame):

raise AssertionError

try:

signal.signal(signal.SIGALRM, handler)

signal.alarm(time_out)#time_out為超時時間

temp = test(1) #函數設置部分，如果未超時則正常返回數據，

return temp

except AssertionError:

print("%d timeout"%(i))# 超時則報錯

if __name__ == '__main__':

for i in range(1,10):

fuc_time(1)

#coding:utf-8

importtime

importsignal

deftest(i):

time.sleep(0.999)#模擬超時的情況

print"%d within time"%(i)

returni

deffuc_time(time_out):

# 此為函數超時控制，替換下面的test函數為可能出現未知錯誤死鎖的函數

defhandler(signum,frame):

raiseAssertionError

try:

signal.signal(signal.SIGALRM,handler)

signal.alarm(time_out)#time_out為超時時間

temp=test(1)#函數設置部分，如果未超時則正常返回數據，

returntemp

exceptAssertionError:

print("%d timeout"%(i))# 超時則報錯

if__name__=='__main__':

foriinrange(1,10):

fuc_time(1)

五：自定義線程的死鎖or超時處理

在某個程序中一方面不適合使用selenium+phantomjs的方式（要實現的功能比較難不適合）因為只能用原生的phantomjs，但是這個問題他本身在極端情況下也有可能停止（在超時設置之前因為某些錯誤）

那么最佳方案就是用python單獨開一個線程（進程）調用原生phantomjs，然后對這個線程進程進行超時控制。

這里用ping這個命令先做測試，

import subprocess

from threading import Timer

import time

kill = lambda process: process.kill()

cmd = ["ping", "www.google.com"]

ping = subprocess.Popen(

cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

my_timer = Timer(5, kill, [ping])#這里設定時間，和命令

try:

my_timer.start()#啟用

stdout, stderr = ping.communicate()#獲得輸出

#print stderr

print(time.ctime())

finally:

print(time.ctime())

my_timer.cancel()

importsubprocess

fromthreadingimportTimer

importtime

kill=lambdaprocess:process.kill()

cmd=["ping","www.google.com"]

ping=subprocess.Popen(

cmd,stdout=subprocess.PIPE,stderr=subprocess.PIPE)

my_timer=Timer(5,kill,[ping])#這里設定時間，和命令

try:

my_timer.start()#啟用

stdout,stderr=ping.communicate()#獲得輸出

#print stderr

print(time.ctime())

finally:

print(time.ctime())

my_timer.cancel()

六：程序自動重啟

比如我們的程序在某種情況下報錯多次，那么當滿足條件后，讓其重啟即可解決大多數問題，當然這只不過是治標不治本而已，如果這個程序重啟沒有大問題（例如讀隊列類型）那么自重啟這是最省力的方式之一。

import time

import sys

import os

def restart_program():

python = sys.executable

os.execl(python, python, * sys.argv)

if __name__ == "__main__":

print 'start...'

print u"3秒后,程序將結束...".encode("utf8")

time.sleep(3)

restart_program()

importtime

importsys

importos

defrestart_program():

python=sys.executable

os.execl(python,python,*sys.argv)

if__name__=="__main__":

print'start...'

printu"3秒后,程序將結束...".encode("utf8")

time.sleep(3)

restart_program()

創作挑戰賽新人創作獎勵來咯，堅持創作打卡瓜分現金大獎

總結

以上是生活随笔為你收集整理的python异常数据处理_Python爬虫提高之异常处理的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：熟悉linux运行环境,实验一熟悉
下一篇： python现在时间命令_Python

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

python异常数据处理_Python爬虫提高之异常处理

總結