日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

【spider】多线程爬虫

發(fā)布時(shí)間:2023/12/20 编程问答 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 【spider】多线程爬虫 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

多線程工作原理

多線程示意圖

?

Queue(隊(duì)列對(duì)象)

queue是python中的標(biāo)準(zhǔn)庫(kù),可以直接from queue import Queue引用;隊(duì)列是線程間最常用的交換數(shù)據(jù)的形式

?

python下多線程的思考

對(duì)于資源,加鎖是個(gè)重要的環(huán)節(jié)。Queue,是線程安全的,因此在滿足使用條件下,建議使用隊(duì)列

?

創(chuàng)建一個(gè)“隊(duì)列”對(duì)象?

pageQueue = Queue(10)

?


將一個(gè)值放入隊(duì)列中

for page in range(1, 11):
? ?pageQueue.put(page)

?


將一個(gè)值從隊(duì)列中取出

pageQueue.get()

?

隊(duì)列Queue

Queue線程安全
?? ?queue是python中的標(biāo)準(zhǔn)庫(kù),可以直接from queue import Queue引用;隊(duì)列是線程間最常用的交換數(shù)據(jù)的形式
?? ?創(chuàng)建一個(gè)“隊(duì)列”對(duì)象
?? ?隊(duì)列常用方法
?? ??? ?put()
?? ??? ?get(block)
?? ??? ?empty()
?? ??? ?full()
?? ??? ?qsize()

隊(duì)列鎖與線程鎖

import threading from queue import Queue dataQueue = Queue(100) exitFlag = Falseclass MyThread(threading.Thread):def __init__(self,q):super().__init__()self.queue = qdef run(self):super().run()global exitFlagwhile True:if exitFlag:print('++++++++++++++++++++++++++exit')breaktry:print('------------------------',self.queue.get(False))self.queue.task_done()except:passdef main():for i in range(100):dataQueue.put(i)threads = []for i in range(5):thread = MyThread(dataQueue)threads.append(thread)thread.start()# 隊(duì)列鎖# dataQueue.join()global exitFlagexitFlag = Trueprint('exit ------------------------------------------------')# 線程鎖for t in threads:t.join()if __name__ == '__main__':main()

另一個(gè)實(shí)例 爬去讀書網(wǎng)站

import requests from bs4 import BeautifulSoup from queue import Queue import threading from threading import Lock url = 'https://www.dushu.com/book/1175_%d.html' task_queue = Queue(100) parse_queue = Queue(100) headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3', 'Accept-Encoding':'gzip, deflate, br', 'Accept-Language':'zh-CN,zh;q=0.9', 'Cache-Control':'max-age=0', 'Connection':'keep-alive', 'Cookie':'Hm_lvt_8008bbd51b8bc504162e1a61c3741a9d=1572418328; Hm_lpvt_8008bbd51b8bc504162e1a61c3741a9d=1572418390', 'Host':'www.dushu.com', 'Sec-Fetch-Mode':'navigate', 'Sec-Fetch-Site':'none', 'Sec-Fetch-User':'?1', 'Upgrade-Insecure-Requests':'1', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36',} # 解析線程退出的標(biāo)記 exit_flag = False# 相當(dāng)于線程池 class CrawlThread(threading.Thread):def __init__(self, q_task:Queue,q_parse:Queue) -> None:super().__init__()self.q_task = q_taskself.q_parse = q_parsedef run(self) -> None:super().run()self.spider()# 一直干活def spider(self):while True:if self.q_task.empty():print('+++++++爬蟲線程%s執(zhí)行任務(wù)結(jié)束+++++++'%(threading.current_thread().getName()))breaktaskId = self.q_task.get()response = requests.get(url % (taskId), headers = headers)response.encoding = 'utf-8'html = response.textself.q_parse.put((html,taskId))self.q_task.task_done()print('------爬蟲線程:%s-----執(zhí)行任務(wù):%d-------'%(threading.current_thread().getName(),taskId)) # 專心爬蟲 def crawl():for i in range(1,101):task_queue.put(i)for i in range(5):t = CrawlThread(task_queue,parse_queue)t.start()class ParseThread(threading.Thread):def __init__(self,q_parse:Queue,lock:Lock,fp):super().__init__()self.q_parse = q_parseself.lock = lockself.fp = fpdef run(self):super().run()self.parse()def parse(self):while True:if exit_flag:print('-----------解析線程:%s完成任務(wù)退出------------'%(threading.current_thread().getName()))breaktry:html,taskId = self.q_parse.get(block=False)soup = BeautifulSoup(html,'lxml')books = soup.select('div[class="bookslist"] > ul > li')print('----------------',len(books))for book in books:self.lock.acquire()book_url = book.find('img').attrs['src']book_title = book.select('h3 a')[0]['title']book_author = book.select('p')[0].get_text()book_describe = book.select('p')[1].get_text()fp.write('%s\t%s\t%s\t%s\n'%(book_url,book_title,book_author,book_describe))self.lock.release()self.q_parse.task_done()print('**********解析線程:%s完成了第%d頁(yè)解析任務(wù)***********'%(threading.current_thread().getName(),taskId))except :pass # 專心的負(fù)責(zé)網(wǎng)頁(yè)解析,保存 def parse(fp):lock = Lock()for i in range(5):t = ParseThread(parse_queue,lock,fp)t.start() if __name__ == '__main__':crawl()fp = open('./book.txt','a',encoding='utf-8')parse(fp)# 隊(duì)列join:隊(duì)列中的任務(wù)必須結(jié)束,下面才會(huì)執(zhí)行task_queue.join()parse_queue.join()fp.close()exit_flag = Trueprint('代碼執(zhí)行到這里!!!!!!!!!!!!!!')

多線程實(shí)現(xiàn)
?? ?讀書http://www.qwsy.com/shuku.aspx?&page=1
?? ?導(dǎo)包
?? ?定義變量
?? ?創(chuàng)建爬蟲線程并啟動(dòng)
?? ??? ?爬蟲線程
?? ?創(chuàng)建解析線程并啟動(dòng)
?? ??? ?解析線程
?? ??? ??? ?Queue.get(block = True/False)
?? ?join()鎖定線程,確保線程全部執(zhí)行完畢
?? ?結(jié)束任務(wù)

總結(jié)

以上是生活随笔為你收集整理的【spider】多线程爬虫的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。