python多线程爬虫框架_普通爬虫vs多线程爬虫vs框架爬虫,Python爬对比
前言
本文的文字及圖片過濾網(wǎng)絡(luò),可以學(xué)習(xí),交流使用,不具有任何商業(yè)用途,如有問題請及時聯(lián)系我們以作處理。
基本開發(fā)環(huán)境
Python 3.6
皮查姆
目標(biāo)網(wǎng)頁分析
網(wǎng)站就選擇發(fā)表情這個網(wǎng)站吧
網(wǎng)站是靜態(tài)網(wǎng)頁,所有的數(shù)據(jù)都保存在div標(biāo)簽中,爬取的難度不大。
根據(jù)標(biāo)簽提取其中的表情包url地址以及標(biāo)題就可以了。
普通爬蟲實現(xiàn)
import requests
import parsel
import re
def change_title(title):
pattern = re.compile(r"[\/\\\:\*\?\"\\|]") # '/ \ : * ? " < > |'
new_title = re.sub(pattern, "_", title) # 替換為下劃線
return new_title
for page in range(0, 201):
url = f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
selector = parsel.Selector(response.text)
divs = selector.css('.tagbqppdiv')
for div in divs:
img_url = div.css('a img::attr(data-original)').get()
title_ = img_url.split('.')[-1]
title = div.css('a img::attr(title)').get()
new_title = change_title(title) + title_
img_content = requests.get(url=img_url, headers=headers).content
path = 'img\\' + new_title
with open(path, mode='wb') as f:
f.write(img_content)
print(title)
代碼簡單的說明:
1,標(biāo)題的替換,因為有一些圖片的標(biāo)題,其中會包含特殊字符,在創(chuàng)建文件的時候特殊字符是不能命名的,所以需要使用正則把有可能出現(xiàn)的特殊字符替換掉。
divs = selector.css('.tagbqppdiv')
for div in divs:
img_url = div.css('a img::attr(data-original)').get()
title_ = img_url.split('.')[-1]
title = div.css('a img::attr(title)').get()
new_title = change_title(title) + title_
2,翻頁爬取以及模擬瀏覽器請求網(wǎng)頁
img_content = requests.get(url=img_url, headers=headers).content
path = 'img\\' + new_title
with open(path, mode='wb') as f:
f.write(img_content)
print(title)
翻頁多點擊下一頁看一下url地址的變化就可以找到相對應(yīng)規(guī)律了,網(wǎng)站是get請求方式,使用請求請求網(wǎng)頁即可,加上標(biāo)題請求頭,偽裝瀏覽器請求,如果不加,網(wǎng)站會識別出你是python爬蟲程序請求訪問的,不過對于這個網(wǎng)站,其實加不加都差不多的。
3,解析數(shù)據(jù)提取想要的數(shù)據(jù)
img_content = requests.get(url=img_url, headers=headers).content
path = 'img\\' + new_title
with open(path, mode='wb') as f:
f.write(img_content)
print(title)
這里我們使用的是parsel解析庫,用的是css選擇器解析的數(shù)據(jù)。
就是根據(jù)標(biāo)簽屬性提取相對應(yīng)的數(shù)據(jù)內(nèi)容。
4,保存數(shù)據(jù)
img_content = requests.get(url=img_url, headers=headers).content
path = 'img\\' + new_title
with open(path, mode='wb') as f:
f.write(img_content)
print(title)
請求表情包url地址,返回獲取內(nèi)容二進制數(shù)據(jù),圖片,視頻,文件等等都是二進制數(shù)據(jù)保存的。如果是文字則是text。
path就是文件保存的路徑,因為是二進制數(shù)據(jù),所以保存方式是wb。
多線程爬蟲實現(xiàn)
import requests
import parsel
import re
import concurrent.futures
def get_response(html_url):
"""模擬瀏覽器請求網(wǎng)址,獲得網(wǎng)頁源代碼"""
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=html_url, headers=headers)
return response
def change_title(title):
"""正則匹配特殊字符標(biāo)題"""
pattern = re.compile(r"[\/\\\:\*\?\"\\|]") # '/ \ : * ? " < > |'
new_title = re.sub(pattern, "_", title) # 替換為下劃線
return new_title
def save(img_url, title):
"""保存表情到本地文件"""
img_content = get_response(img_url).content
path = 'img\\' + title
with open(path, mode='wb') as f:
f.write(img_content)
print(title)
def main(html_url):
"""主函數(shù)"""
response = get_response(html_url)
selector = parsel.Selector(response.text)
divs = selector.css('.tagbqppdiv')
for div in divs:
img_url = div.css('a img::attr(data-original)').get()
title_ = img_url.split('.')[-1]
title = div.css('a img::attr(title)').get()
new_title = change_title(title) + title_
save(img_url, new_title)
if __name__ == '__main__':
executor = concurrent.futures.ThreadPoolExecutor(max_workers=5)
for page in range(0, 201):
url = f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html'
executor.submit(main, url)
executor.shutdown()
簡單的代碼說明:
其實在前文已經(jīng)有鋪墊了,多線程爬蟲就是把每一塊都封裝成函數(shù),讓它每一塊代碼都有它的作用,然后通過線程模塊啟動就好。
executor = concurrent.futures.ThreadPoolExecutor(max_workers=5)
最大的線程數(shù)
scrapy框架爬蟲實現(xiàn)
關(guān)于scrapy框架項目的創(chuàng)建這里只是不過多講了,之前文章有詳細講解過,scrapy框架項目的創(chuàng)建,可以點擊下方鏈接查看
簡單使用scrapy爬蟲框架批量采集網(wǎng)站數(shù)據(jù)
items.py
import scrapy
from ..items import BiaoqingbaoItem
class BiaoqingSpider(scrapy.Spider):
name = 'biaoqing'
allowed_domains = ['fabiaoqing.com']
start_urls = [f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' for page in range(1, 201)]
def parse(self, response):
divs = response.css('#bqb div.ui.segment.imghover div')
for div in divs:
img_url = div.css('a img::attr(data-original)').get()
title = div.css('a img::attr(title)').get()
yield BiaoqingbaoItem(img_url=img_url, title=title)
middlewares.py
BOT_NAME = 'biaoqingbao'
SPIDER_MODULES = ['biaoqingbao.spiders']
NEWSPIDER_MODULE = 'biaoqingbao.spiders'
DOWNLOADER_MIDDLEWARES = {
'biaoqingbao.middlewares.BiaoqingbaoDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {
'biaoqingbao.pipelines.DownloadPicturePipeline': 300,
}
IMAGES_STORE = './images'
pipelines.py
import scrapy
from ..items import BiaoqingbaoItem
class BiaoqingSpider(scrapy.Spider):
name = 'biaoqing'
allowed_domains = ['fabiaoqing.com']
start_urls = [f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' for page in range(1, 201)]
def parse(self, response):
divs = response.css('#bqb div.ui.segment.imghover div')
for div in divs:
img_url = div.css('a img::attr(data-original)').get()
title = div.css('a img::attr(title)').get()
yield BiaoqingbaoItem(img_url=img_url, title=title)
setting.py
BOT_NAME = 'biaoqingbao'
SPIDER_MODULES = ['biaoqingbao.spiders']
NEWSPIDER_MODULE = 'biaoqingbao.spiders'
DOWNLOADER_MIDDLEWARES = {
'biaoqingbao.middlewares.BiaoqingbaoDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {
'biaoqingbao.pipelines.DownloadPicturePipeline': 300,
}
IMAGES_STORE = './images'
標(biāo)清
import scrapy
from ..items import BiaoqingbaoItem
class BiaoqingSpider(scrapy.Spider):
name = 'biaoqing'
allowed_domains = ['fabiaoqing.com']
start_urls = [f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' for page in range(1, 201)]
def parse(self, response):
divs = response.css('#bqb div.ui.segment.imghover div')
for div in divs:
img_url = div.css('a img::attr(data-original)').get()
title = div.css('a img::attr(title)').get()
yield BiaoqingbaoItem(img_url=img_url, title=title)
簡單總結(jié):
三個程序的最大的區(qū)別就在于在于爬取速度的相對,但是如果從寫代碼的時間上面來計算的話,普通是最簡單的,因為對于這樣的靜態(tài)網(wǎng)站根本不需要調(diào)試,可以從頭寫到位,加上空格一共也就是29行的代碼。
總結(jié)
以上是生活随笔為你收集整理的python多线程爬虫框架_普通爬虫vs多线程爬虫vs框架爬虫,Python爬对比的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 读书笔记:如何投论文
- 下一篇: Python学习笔记:接下来