當前位置：首頁 > 编程语言 > python >内容正文

python

python多线程队列爬虫流程图_python 多线程爬虫队列queue问题。

發布時間：2023/12/15 python 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 python多线程队列爬虫流程图_python 多线程爬虫队列queue问题。小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

思路是先構造url列表 all_url

然后

for i in range(0, len(all_url)):

urlqueue.put(all_url[i])

然后get 做到每次從列表中取出url

現在問題是，range后面無法寫成 0到列表長度

會顯示IndexError: list index out of range

意思是索引錯誤：列表索引超出范圍

而且列表是沒有任何問題的，沒有空

而且如果列表長度是2000，那么只能range(0， 1000)，這樣就無任何報錯

這樣就很麻煩

下面是代碼

import requests

from lxml import html

import time

import threading

from queue import Queue

class Spider(threading.Thread):

def __init__(self, name, urlqueue):

super().__init__()

self.name = name

self.urlqueue = urlqueue

def run(self):

headers = {

'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.4094.1 Safari/537.36'

}

print('線程：' + self.name + '啟動')

while not self.urlqueue.empty():

try:

url = self.urlqueue.get()

rep = requests.get(url, headers = headers, timeout = 5)

time.sleep(1)

if rep.status_code == 200:

print("鏈接成功")

self.parse(rep)

print(url + " 爬取完成")

except Exception as e:

print("主頁：：" +url + " 鏈接失敗, 原因：：", e)

pass

print('線程：' + self.name + '結束')

def parse(self, rep):

con = rep.content

sel = html.fromstring(con)

title = sel.xpath('//div[@class="titmain"]/h1/text()')

title = str(title).replace(']', '').replace('[', '').replace("'", '').replace(",", '').replace(r"\r\n", "").replace('"', '').replace(' ', '').replace(r'\xa0', '').replace('?', '').replace('/', '').replace(r'\u3000', ' ')

date = sel.xpath('//div[@class="texttit_m1"]/p/text()')

date = str(date).replace(']', '').replace('[', '').replace("'", '').replace(r'\u3000', ' ')

if len(date) > 20:

file_name = title + ".txt"

a = open(file_name, "w+", encoding='utf-8')

a.write('\n' + str(title) + '\n' + '\n' + str(date))

print(file_name + '保存成功')

a.close

else:

pass

if name == '__main__':

with open('未爬取url.txt') as f:

data = f.readline()

#讀取數據行

james = data.strip().split(',')

#將數據轉換為列表

all_url = []

for jame in james:

a=eval(jame)

#去除ifu兩端引號

all_url.append(a)

print(len(all_url))

start = time.time()

urlqueue = Queue()

threadNum = 3 #線程數量

for i in range(0, 1468):

urlqueue.put(all_url[i]) #問題在這里

del all_url[i]

threads = []

for i in range(1, threadNum+1):

thread = Spider("線程" + str(i), urlqueue)

thread.start()

threads.append(thread)

for thread in threads:

thread.join()

with open('未爬取url.txt', 'w+') as b:

b.write('\n'.join([str(all_url)]))

b.write('\n' + '=' *50 + '\n')

b.close

print(' 未爬取url 保存完成')

end = time.time()

print("-------------------------------")

print("下載完成. 用時{}秒".format(end-start))

另外 url是從txt讀的不知道怎么傳上來，最后構造的all_url列表是肯定沒有問題的

總結

以上是生活随笔為你收集整理的python多线程队列爬虫流程图_python 多线程爬虫队列queue问题。的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：小娜提示抱歉我什么也听不见怎么办
下一篇： python 文件时间戳_调整目录文件时

python

python多线程队列爬虫流程图_python 多线程爬虫 队列queue问题。

總結

python多线程队列爬虫流程图_python 多线程爬虫队列queue问题。