日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

设置timeout限制在爬虫中的运用

發布時間:2025/4/16 编程问答 33 豆豆
生活随笔 收集整理的這篇文章主要介紹了 设置timeout限制在爬虫中的运用 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

設置timeout方法

這個有很多種的,下面以urllib為例
下面選取的是網頁是python官網

不使用的timeout的情況

>>> import urllib.request >>> response = urllib.request.urlopen('http://www.python.org') >>>>

使用timeout的情況

情況一:timeout = 0.1

>>> response = urllib.request.urlopen('http://www.python.org',timeout = 0.1) Traceback (most recent call last):File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1254, in do_openh.request(req.get_method(), req.selector, req.data, headers)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1107, in requestself._send_request(method, url, body, headers)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1152, in _send_requestself.endheaders(body)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1103, in endheadersself._send_output(message_body)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 934, in _send_outputself.send(msg)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 877, in sendself.connect()File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 849, in connect(self.host,self.port), self.timeout, self.source_address)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\socket.py", line 712, in create_connectionraise errFile "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\socket.py", line 703, in create_connectionsock.connect(sa) socket.timeout: timed outDuring handling of the above exception, another exception occurred:Traceback (most recent call last):File "<stdin>", line 1, in <module>File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 163, in urlopenreturn opener.open(url, data, timeout)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 466, in openresponse = self._open(req, data)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 484, in _open'_open', req)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 444, in _call_chainresult = func(*args)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1282, in http_openreturn self.do_open(http.client.HTTPConnection, req)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1256, in do_openraise URLError(err) urllib.error.URLError: <urlopen error timed out>

情況二:timeout = 0.5

>>> response = urllib.request.urlopen('http://www.python.org',timeout = 0.5) Traceback (most recent call last):File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1254, in do_openh.request(req.get_method(), req.selector, req.data, headers)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1107, in requestself._send_request(method, url, body, headers)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1152, in _send_requestself.endheaders(body)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1103, in endheadersself._send_output(message_body)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 934, in _send_outputself.send(msg)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 877, in sendself.connect()File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1261, in connectserver_hostname=server_hostname)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\ssl.py", line 385, in wrap_socket_context=self)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\ssl.py", line 760, in __init__self.do_handshake()File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\ssl.py", line 996, in do_handshakeself._sslobj.do_handshake()File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\ssl.py", line 641, in do_handshakeself._sslobj.do_handshake() socket.timeout: _ssl.c:703: The handshake operation timed outDuring handling of the above exception, another exception occurred:Traceback (most recent call last):File "<stdin>", line 1, in <module>File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 163, in urlopenreturn opener.open(url, data, timeout)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 472, in openresponse = meth(req, response)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 582, in http_response'http', request, response, code, msg, hdrs)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 504, in errorresult = self._call_chain(*args)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 444, in _call_chainresult = func(*args)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 696, in http_error_302return self.parent.open(new, timeout=req.timeout)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 466, in openresponse = self._open(req, data)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 484, in _open'_open', req)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 444, in _call_chainresult = func(*args)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1297, in https_opencontext=self._context, check_hostname=self._check_hostname)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1256, in do_openraise URLError(err) urllib.error.URLError: <urlopen error _ssl.c:703: The handshake operation timed out>

情形3: timeout = 1

>>> response = urllib.request.urlopen('http://www.python.org',timeout = 1) >>>

解析

這里,我們發現在設置了timeout之后,一旦超時,會發生報錯,然后任務也就結束了。但是會保證每個任務的時間都是被限制了的。

運用

比如,我們做一個并發的爬蟲(例如用多協程或者多線程實現)。這里,如果不進行爬蟲不設置timeout的話,如果某個子協程在運行的在還在等待的話,就有其他的線程跟著一起等這個線程的響應。(雖然會讓其他的線程或者協程在這時候運行,但是切換所需要的時間的)。如果可以設計到這個timeout的數值比較小(合理的小的話)就會讓這個線程(或者協程)在只用很短的時間就結束爬蟲。如果失敗就先記錄下來,在之后做這個失敗的數據的處理。

可以采用分級的timeout。這樣,失敗一次就放到timeout時間序列更長的隊列當中。這樣通過mlfq這樣的操作來調度這些爬蟲。

這樣方法對于網絡質量不是很穩定的情況下,這個爬蟲效果會比較好。有些時候就沒有必要用那么長的時間來等待。

總結

以上是生活随笔為你收集整理的设置timeout限制在爬虫中的运用的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。