當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬虫实现并发爬取

發(fā)布時間：2023/12/10 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫实现并发爬取小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

單線程爬蟲，多線程爬蟲，多協(xié)程爬蟲

線程
- 單線程實現(xiàn)
- 多線程實現(xiàn)的流程
協(xié)程
- 協(xié)程爬蟲的流程分析
- gevent
- 協(xié)程實現(xiàn)流程

通過多線程或多進程提高爬蟲效率，比較各自的優(yōu)劣情況,根據(jù)不同的業(yè)務條件選擇不同的方式

爬取的網(wǎng)址 https://wz.sun0769.com/political/index/politicsNewest?id=1&page=1

線程

線程（英語：thread）是操作系統(tǒng)能夠進行運算調度的最小單位。它被包含在進程之中，是進程中的實際運作單位。

單線程實現(xiàn)

from lxml import etree import requests import json import time header = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}local_file = open('duanzi.json','a',encoding='utf-8')def parse_html(html):text = etree.HTML(html)# 返回所有段子的節(jié)點node_list = text.xpath('/html/body/div[2]/div[3]/ul[2]/li')for node in node_list:try:id = node.xpath('./span[1]/text()')[0]state = node.xpath('./span[2]/text()')[0].strip()items = {'id':id,'state':state}# print(items)local_file.write(json.dumps(items)+'\n')except:passurl = 'https://wz.sun0769.com/political/index/politicsNewest?id=1&page=2'def main():for i in range(1,20):# 每一頁的網(wǎng)址url = f'https://wz.sun0769.com/political/index/politicsNewest?id=1&page={i}'html = requests.get(url=url,headers=header).textparse_html(html)if __name__ == '__main__':t1= time.time()main()print(time.time()-t1)

多線程實現(xiàn)的流程

使用一個pageQueue隊列保存要訪問的網(wǎng)頁

同時啟動多個采集線程，每個線程都要從網(wǎng)頁頁碼隊列pageQueue中取出要訪問的頁碼，構建網(wǎng)址，訪問網(wǎng)址并爬取數(shù)據(jù)，操作完一個網(wǎng)頁后再從網(wǎng)頁隊列中選取下一個頁碼，依次進行，直到所有的頁碼都已訪問完畢，所有采集線程保存在threadCrawls中

使用一個dataCode來保存所有的網(wǎng)頁代碼，每個線程獲取到的數(shù)據(jù)都應該放入隊列中

同時啟動多個解析線程，每個線程都從網(wǎng)頁源代碼dataQueue中取出一個網(wǎng)頁源代碼，并進行解析獲得想要的數(shù)據(jù)，解析完成以后再選取下一個進行同樣的操作，直至所有的解析完成。將所有的解析線程存儲在列表threadParses中

將解析的json數(shù)據(jù)存儲在本地文件中

import json import threading from queue import Queue from lxml import etree import time import random import requestscrawl = False # 全局變量，標識pageQueue隊列是否為空class ThreadCrawl(threading.Thread): # 采集線程def __init__(self, threadName, pageQueue, dataQueue):threading.Thread.__init__(self)self.threadName = threadNameself.pageQueue = pageQueueself.dataQueue = dataQueueself.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}# 重寫run函數(shù)def run(self):print("啟動" + self.threadName)while not crawl:try:# 從dataQueue中取出一個頁碼數(shù)字，先進先出page = self.pageQueue.get(False)url = f'https://wz.sun0769.com/political/index/politicsNewest?id=1&page={page}'time.sleep(random.uniform(1,3)) # 降低訪問頻率，防止ip被封content = requests.get(url, headers=self.headers).text# 將爬到的網(wǎng)頁源代碼放入dataQueue隊列中self.dataQueue.put(content)except:passprint("結束" + self.threadName)PARSE_EXIT = Falseclass ThreadParse(threading.Thread): # 解析線程def __init__(self, threadName, dataQueue, localFile, lock):super(ThreadParse, self).__init__()self.threadName = threadNameself.dataQueue = dataQueueself.localFile = localFile # 文件夾self.lock = lock # 互斥鎖def run(self):print("啟動" + self.threadName)while not PARSE_EXIT:try:html = self.dataQueue.get(False)self.parse(html)except:passprint("結束" + self.threadName)def parse(self, html):text = etree.HTML(html)# 返回所有段子的節(jié)點node_list = text.xpath('/html/body/div[2]/div[3]/ul[2]/li')for node in node_list:try:id = node.xpath('./span[1]/text()')[0]state = node.xpath('./span[2]/text()')[0].strip()items = {'id': id,'state': state}# print(items)with self.lock: #print(json.dumps(items))self.localFile.write(json.dumps(items) + '\n')# 在多線程開發(fā)中，為了保護資源的完整性，在訪問共享資源時需要使用共享鎖，線程獲得共享鎖以后才可以訪問文件中的localFile# 并往里添加數(shù)據(jù)，寫入完畢以后將鎖釋放，這樣其他線程就可以訪問這個文件了except:passdef main():pageQueue = Queue(20)for i in range(1, 21):pageQueue.put(i)dataQueue = Queue()localFile = open('多線程.json', 'a')lock = threading.Lock() # 互斥鎖crawlList = ['采集線程1', '采集線程2', '采集線程3']# 創(chuàng)建，啟動和存儲3個采集線程threadCrawls = []for thredName in crawlList:thread = ThreadCrawl(thredName, pageQueue, dataQueue)thread.start() # 啟動線程threadCrawls.append(thread)# 創(chuàng)建三個解析線程parseList = ['解析線程1', '解析線程2', '解析線程3']theradParses = []for threadName in parseList:thread = ThreadParse(threadName, dataQueue, localFile, lock)thread.start()theradParses.append(thread)while not pageQueue.empty():pass# 為空，采集線程退出循環(huán)global crawlcrawl = Trueprint("pageQueue為空")for thread in threadCrawls:thread.join() # 阻塞線程while not dataQueue.empty():passprint('dataQueue為空')global PARSE_EXITPARSE_EXIT = Truefor thread in theradParses:thread.join()with lock:localFile.close()if __name__ == '__main__':t1= time.time()main()print(time.time()-t1)

協(xié)程

協(xié)程是一種比線程更小的執(zhí)行單元，又稱微線程（用戶態(tài)的線程）。在一個線程中可以有多個協(xié)程，但是一次只能只能執(zhí)行一個協(xié)程，當所執(zhí)行的協(xié)程遭遇阻塞時，就會切換到下一個任務繼續(xù)執(zhí)行，從而提高CPU的利用率，適用于IO密集的場景，可以避免線程過多，減少線程切換之間浪費的時間

協(xié)程爬蟲的流程分析

由于協(xié)程的切換不像多線程那樣調度耗費資源，所以不用嚴格的限制協(xié)程的數(shù)量

將要爬取的網(wǎng)址存儲在一個列表中，由于針對每一個網(wǎng)址都需要創(chuàng)建一個協(xié)程，所以需要準備一個待爬取的網(wǎng)址列表

為每一個網(wǎng)址創(chuàng)建一個協(xié)程并啟動該協(xié)程。協(xié)程會依次執(zhí)行，爬取對應的網(wǎng)頁內容，如果一個協(xié)程在執(zhí)行過程中出現(xiàn)網(wǎng)絡阻塞或者其他異常情況，則會立馬切換到下一個協(xié)程，由于協(xié)程的切換不用切換線程，消耗資源較小，所以不用嚴格限制協(xié)程的大小（分情況對待），每個協(xié)程協(xié)程負責爬取網(wǎng)頁，并且將網(wǎng)頁中的目標數(shù)據(jù)解析出來

將爬取到的目標數(shù)據(jù)存儲到一個列表中

遍歷數(shù)據(jù)列表，將數(shù)據(jù)存儲到本地文件中

gevent

gevent是一個基于協(xié)程的Python網(wǎng)絡庫，是一個第三方庫
安裝

pip install genvent

協(xié)程實現(xiàn)流程

定義一個負責負責爬蟲的類，所有爬蟲的工作完全交給該類負責

使用一個隊列data_queue保存所有的數(shù)據(jù)

創(chuàng)建多個協(xié)程任務，每一個協(xié)程都會使用頁碼構建完整的網(wǎng)址，訪問網(wǎng)址爬取和提取有用的數(shù)據(jù)，并將數(shù)據(jù)保存到數(shù)據(jù)隊列中

將dataQueue隊列中的數(shù)據(jù)保存到本地文件中

import time import gevent from lxml import etree import requests from queue import Queueclass Spider(object):def __init__(self):self.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}self.url = 'https://wz.sun0769.com/political/index/politicsNewest?id=1&page='self.dataQueue = Queue()self.count = 0def send_raquest(self,url):print("正在爬取"+url)html = requests.get(url,self.headers).texttime.sleep(1)self.parse_page(html)def parse_page(self,html):text = etree.HTML(html)node_list = text.xpath('/html/body/div[2]/div[3]/ul[2]/li')for node in node_list:try:id = node.xpath('./span[1]/text()')[0]state = node.xpath('./span[2]/text()')[0].strip()items = {'id': id,'state': state}self.count+=1self.dataQueue.put(items)except:passdef start_work(self):arr = []for page in range(1,20):# 創(chuàng)建一個協(xié)程任務對象url = self.url+str(page)job = gevent.spawn(self.send_raquest,url)arr.append(job)# joinall()接受一個列表，將列表中的所有協(xié)程任務添加到任務隊列里執(zhí)行gevent.joinall(arr)local_file = open("協(xié)程.json",'wb+')while not self.dataQueue.empty():content = self.dataQueue.get()result = str(content).encode('utf-8')local_file.write(result+b"\n")local_file.close()print(self.count)if __name__ == '__main__':t1= time.time()spider = Spider()spider.start_work()print(time.time()-t1)

總結

以上是生活随笔為你收集整理的爬虫实现并发爬取的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內容還不錯，歡迎將生活随笔推薦給好友。

爬虫

上一篇：深入浅出InfoPath——动态获取In
下一篇： [转载]CentOS6nbsp;快速搭建