python多线程爬虫框架_普通爬虫vs多线程爬虫vs框架爬虫,Python爬对比
前言
本文的文字及圖片過濾網絡,可以學習,交流使用,不具有任何商業用途,如有問題請及時聯系我們以作處理。
基本開發環境
Python 3.6
皮查姆
目標網頁分析
網站就選擇發表情這個網站吧
網站是靜態網頁,所有的數據都保存在div標簽中,爬取的難度不大。
根據標簽提取其中的表情包url地址以及標題就可以了。
普通爬蟲實現
import requests
import parsel
import re
def change_title(title):
pattern = re.compile(r"[\/\\\:\*\?\"\\|]") # '/ \ : * ? " < > |'
new_title = re.sub(pattern, "_", title) # 替換為下劃線
return new_title
for page in range(0, 201):
url = f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
selector = parsel.Selector(response.text)
divs = selector.css('.tagbqppdiv')
for div in divs:
img_url = div.css('a img::attr(data-original)').get()
title_ = img_url.split('.')[-1]
title = div.css('a img::attr(title)').get()
new_title = change_title(title) + title_
img_content = requests.get(url=img_url, headers=headers).content
path = 'img\\' + new_title
with open(path, mode='wb') as f:
f.write(img_content)
print(title)
代碼簡單的說明:
1,標題的替換,因為有一些圖片的標題,其中會包含特殊字符,在創建文件的時候特殊字符是不能命名的,所以需要使用正則把有可能出現的特殊字符替換掉。
divs = selector.css('.tagbqppdiv')
for div in divs:
img_url = div.css('a img::attr(data-original)').get()
title_ = img_url.split('.')[-1]
title = div.css('a img::attr(title)').get()
new_title = change_title(title) + title_
2,翻頁爬取以及模擬瀏覽器請求網頁
img_content = requests.get(url=img_url, headers=headers).content
path = 'img\\' + new_title
with open(path, mode='wb') as f:
f.write(img_content)
print(title)
翻頁多點擊下一頁看一下url地址的變化就可以找到相對應規律了,網站是get請求方式,使用請求請求網頁即可,加上標題請求頭,偽裝瀏覽器請求,如果不加,網站會識別出你是python爬蟲程序請求訪問的,不過對于這個網站,其實加不加都差不多的。
3,解析數據提取想要的數據
img_content = requests.get(url=img_url, headers=headers).content
path = 'img\\' + new_title
with open(path, mode='wb') as f:
f.write(img_content)
print(title)
這里我們使用的是parsel解析庫,用的是css選擇器解析的數據。
就是根據標簽屬性提取相對應的數據內容。
4,保存數據
img_content = requests.get(url=img_url, headers=headers).content
path = 'img\\' + new_title
with open(path, mode='wb') as f:
f.write(img_content)
print(title)
請求表情包url地址,返回獲取內容二進制數據,圖片,視頻,文件等等都是二進制數據保存的。如果是文字則是text。
path就是文件保存的路徑,因為是二進制數據,所以保存方式是wb。
多線程爬蟲實現
import requests
import parsel
import re
import concurrent.futures
def get_response(html_url):
"""模擬瀏覽器請求網址,獲得網頁源代碼"""
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=html_url, headers=headers)
return response
def change_title(title):
"""正則匹配特殊字符標題"""
pattern = re.compile(r"[\/\\\:\*\?\"\\|]") # '/ \ : * ? " < > |'
new_title = re.sub(pattern, "_", title) # 替換為下劃線
return new_title
def save(img_url, title):
"""保存表情到本地文件"""
img_content = get_response(img_url).content
path = 'img\\' + title
with open(path, mode='wb') as f:
f.write(img_content)
print(title)
def main(html_url):
"""主函數"""
response = get_response(html_url)
selector = parsel.Selector(response.text)
divs = selector.css('.tagbqppdiv')
for div in divs:
img_url = div.css('a img::attr(data-original)').get()
title_ = img_url.split('.')[-1]
title = div.css('a img::attr(title)').get()
new_title = change_title(title) + title_
save(img_url, new_title)
if __name__ == '__main__':
executor = concurrent.futures.ThreadPoolExecutor(max_workers=5)
for page in range(0, 201):
url = f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html'
executor.submit(main, url)
executor.shutdown()
簡單的代碼說明:
其實在前文已經有鋪墊了,多線程爬蟲就是把每一塊都封裝成函數,讓它每一塊代碼都有它的作用,然后通過線程模塊啟動就好。
executor = concurrent.futures.ThreadPoolExecutor(max_workers=5)
最大的線程數
scrapy框架爬蟲實現
關于scrapy框架項目的創建這里只是不過多講了,之前文章有詳細講解過,scrapy框架項目的創建,可以點擊下方鏈接查看
簡單使用scrapy爬蟲框架批量采集網站數據
items.py
import scrapy
from ..items import BiaoqingbaoItem
class BiaoqingSpider(scrapy.Spider):
name = 'biaoqing'
allowed_domains = ['fabiaoqing.com']
start_urls = [f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' for page in range(1, 201)]
def parse(self, response):
divs = response.css('#bqb div.ui.segment.imghover div')
for div in divs:
img_url = div.css('a img::attr(data-original)').get()
title = div.css('a img::attr(title)').get()
yield BiaoqingbaoItem(img_url=img_url, title=title)
middlewares.py
BOT_NAME = 'biaoqingbao'
SPIDER_MODULES = ['biaoqingbao.spiders']
NEWSPIDER_MODULE = 'biaoqingbao.spiders'
DOWNLOADER_MIDDLEWARES = {
'biaoqingbao.middlewares.BiaoqingbaoDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {
'biaoqingbao.pipelines.DownloadPicturePipeline': 300,
}
IMAGES_STORE = './images'
pipelines.py
import scrapy
from ..items import BiaoqingbaoItem
class BiaoqingSpider(scrapy.Spider):
name = 'biaoqing'
allowed_domains = ['fabiaoqing.com']
start_urls = [f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' for page in range(1, 201)]
def parse(self, response):
divs = response.css('#bqb div.ui.segment.imghover div')
for div in divs:
img_url = div.css('a img::attr(data-original)').get()
title = div.css('a img::attr(title)').get()
yield BiaoqingbaoItem(img_url=img_url, title=title)
setting.py
BOT_NAME = 'biaoqingbao'
SPIDER_MODULES = ['biaoqingbao.spiders']
NEWSPIDER_MODULE = 'biaoqingbao.spiders'
DOWNLOADER_MIDDLEWARES = {
'biaoqingbao.middlewares.BiaoqingbaoDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {
'biaoqingbao.pipelines.DownloadPicturePipeline': 300,
}
IMAGES_STORE = './images'
標清
import scrapy
from ..items import BiaoqingbaoItem
class BiaoqingSpider(scrapy.Spider):
name = 'biaoqing'
allowed_domains = ['fabiaoqing.com']
start_urls = [f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' for page in range(1, 201)]
def parse(self, response):
divs = response.css('#bqb div.ui.segment.imghover div')
for div in divs:
img_url = div.css('a img::attr(data-original)').get()
title = div.css('a img::attr(title)').get()
yield BiaoqingbaoItem(img_url=img_url, title=title)
簡單總結:
三個程序的最大的區別就在于在于爬取速度的相對,但是如果從寫代碼的時間上面來計算的話,普通是最簡單的,因為對于這樣的靜態網站根本不需要調試,可以從頭寫到位,加上空格一共也就是29行的代碼。
總結
以上是生活随笔為你收集整理的python多线程爬虫框架_普通爬虫vs多线程爬虫vs框架爬虫,Python爬对比的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 读书笔记:如何投论文
- 下一篇: Python学习笔记:接下来