日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程语言 > python >内容正文

python

python多线程爬虫框架_普通爬虫vs多线程爬虫vs框架爬虫,Python爬对比

發(fā)布時間:2025/3/15 python 22 豆豆
生活随笔 收集整理的這篇文章主要介紹了 python多线程爬虫框架_普通爬虫vs多线程爬虫vs框架爬虫,Python爬对比 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

前言

本文的文字及圖片過濾網(wǎng)絡(luò),可以學(xué)習(xí),交流使用,不具有任何商業(yè)用途,如有問題請及時聯(lián)系我們以作處理。

基本開發(fā)環(huán)境

Python 3.6

皮查姆

目標(biāo)網(wǎng)頁分析

網(wǎng)站就選擇發(fā)表情這個網(wǎng)站吧

網(wǎng)站是靜態(tài)網(wǎng)頁,所有的數(shù)據(jù)都保存在div標(biāo)簽中,爬取的難度不大。

根據(jù)標(biāo)簽提取其中的表情包url地址以及標(biāo)題就可以了。

普通爬蟲實現(xiàn)

import requests

import parsel

import re

def change_title(title):

pattern = re.compile(r"[\/\\\:\*\?\"\\|]") # '/ \ : * ? " < > |'

new_title = re.sub(pattern, "_", title) # 替換為下劃線

return new_title

for page in range(0, 201):

url = f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html'

headers = {

'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'

}

response = requests.get(url=url, headers=headers)

selector = parsel.Selector(response.text)

divs = selector.css('.tagbqppdiv')

for div in divs:

img_url = div.css('a img::attr(data-original)').get()

title_ = img_url.split('.')[-1]

title = div.css('a img::attr(title)').get()

new_title = change_title(title) + title_

img_content = requests.get(url=img_url, headers=headers).content

path = 'img\\' + new_title

with open(path, mode='wb') as f:

f.write(img_content)

print(title)

代碼簡單的說明:

1,標(biāo)題的替換,因為有一些圖片的標(biāo)題,其中會包含特殊字符,在創(chuàng)建文件的時候特殊字符是不能命名的,所以需要使用正則把有可能出現(xiàn)的特殊字符替換掉。

divs = selector.css('.tagbqppdiv')

for div in divs:

img_url = div.css('a img::attr(data-original)').get()

title_ = img_url.split('.')[-1]

title = div.css('a img::attr(title)').get()

new_title = change_title(title) + title_

2,翻頁爬取以及模擬瀏覽器請求網(wǎng)頁

img_content = requests.get(url=img_url, headers=headers).content

path = 'img\\' + new_title

with open(path, mode='wb') as f:

f.write(img_content)

print(title)

翻頁多點擊下一頁看一下url地址的變化就可以找到相對應(yīng)規(guī)律了,網(wǎng)站是get請求方式,使用請求請求網(wǎng)頁即可,加上標(biāo)題請求頭,偽裝瀏覽器請求,如果不加,網(wǎng)站會識別出你是python爬蟲程序請求訪問的,不過對于這個網(wǎng)站,其實加不加都差不多的。

3,解析數(shù)據(jù)提取想要的數(shù)據(jù)

img_content = requests.get(url=img_url, headers=headers).content

path = 'img\\' + new_title

with open(path, mode='wb') as f:

f.write(img_content)

print(title)

這里我們使用的是parsel解析庫,用的是css選擇器解析的數(shù)據(jù)。

就是根據(jù)標(biāo)簽屬性提取相對應(yīng)的數(shù)據(jù)內(nèi)容。

4,保存數(shù)據(jù)

img_content = requests.get(url=img_url, headers=headers).content

path = 'img\\' + new_title

with open(path, mode='wb') as f:

f.write(img_content)

print(title)

請求表情包url地址,返回獲取內(nèi)容二進制數(shù)據(jù),圖片,視頻,文件等等都是二進制數(shù)據(jù)保存的。如果是文字則是text。

path就是文件保存的路徑,因為是二進制數(shù)據(jù),所以保存方式是wb。

多線程爬蟲實現(xiàn)

import requests

import parsel

import re

import concurrent.futures

def get_response(html_url):

"""模擬瀏覽器請求網(wǎng)址,獲得網(wǎng)頁源代碼"""

headers = {

'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'

}

response = requests.get(url=html_url, headers=headers)

return response

def change_title(title):

"""正則匹配特殊字符標(biāo)題"""

pattern = re.compile(r"[\/\\\:\*\?\"\\|]") # '/ \ : * ? " < > |'

new_title = re.sub(pattern, "_", title) # 替換為下劃線

return new_title

def save(img_url, title):

"""保存表情到本地文件"""

img_content = get_response(img_url).content

path = 'img\\' + title

with open(path, mode='wb') as f:

f.write(img_content)

print(title)

def main(html_url):

"""主函數(shù)"""

response = get_response(html_url)

selector = parsel.Selector(response.text)

divs = selector.css('.tagbqppdiv')

for div in divs:

img_url = div.css('a img::attr(data-original)').get()

title_ = img_url.split('.')[-1]

title = div.css('a img::attr(title)').get()

new_title = change_title(title) + title_

save(img_url, new_title)

if __name__ == '__main__':

executor = concurrent.futures.ThreadPoolExecutor(max_workers=5)

for page in range(0, 201):

url = f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html'

executor.submit(main, url)

executor.shutdown()

簡單的代碼說明:

其實在前文已經(jīng)有鋪墊了,多線程爬蟲就是把每一塊都封裝成函數(shù),讓它每一塊代碼都有它的作用,然后通過線程模塊啟動就好。

executor = concurrent.futures.ThreadPoolExecutor(max_workers=5)

最大的線程數(shù)

scrapy框架爬蟲實現(xiàn)

關(guān)于scrapy框架項目的創(chuàng)建這里只是不過多講了,之前文章有詳細講解過,scrapy框架項目的創(chuàng)建,可以點擊下方鏈接查看

簡單使用scrapy爬蟲框架批量采集網(wǎng)站數(shù)據(jù)

items.py

import scrapy

from ..items import BiaoqingbaoItem

class BiaoqingSpider(scrapy.Spider):

name = 'biaoqing'

allowed_domains = ['fabiaoqing.com']

start_urls = [f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' for page in range(1, 201)]

def parse(self, response):

divs = response.css('#bqb div.ui.segment.imghover div')

for div in divs:

img_url = div.css('a img::attr(data-original)').get()

title = div.css('a img::attr(title)').get()

yield BiaoqingbaoItem(img_url=img_url, title=title)

middlewares.py

BOT_NAME = 'biaoqingbao'

SPIDER_MODULES = ['biaoqingbao.spiders']

NEWSPIDER_MODULE = 'biaoqingbao.spiders'

DOWNLOADER_MIDDLEWARES = {

'biaoqingbao.middlewares.BiaoqingbaoDownloaderMiddleware': 543,

}

ITEM_PIPELINES = {

'biaoqingbao.pipelines.DownloadPicturePipeline': 300,

}

IMAGES_STORE = './images'

pipelines.py

import scrapy

from ..items import BiaoqingbaoItem

class BiaoqingSpider(scrapy.Spider):

name = 'biaoqing'

allowed_domains = ['fabiaoqing.com']

start_urls = [f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' for page in range(1, 201)]

def parse(self, response):

divs = response.css('#bqb div.ui.segment.imghover div')

for div in divs:

img_url = div.css('a img::attr(data-original)').get()

title = div.css('a img::attr(title)').get()

yield BiaoqingbaoItem(img_url=img_url, title=title)

setting.py

BOT_NAME = 'biaoqingbao'

SPIDER_MODULES = ['biaoqingbao.spiders']

NEWSPIDER_MODULE = 'biaoqingbao.spiders'

DOWNLOADER_MIDDLEWARES = {

'biaoqingbao.middlewares.BiaoqingbaoDownloaderMiddleware': 543,

}

ITEM_PIPELINES = {

'biaoqingbao.pipelines.DownloadPicturePipeline': 300,

}

IMAGES_STORE = './images'

標(biāo)清

import scrapy

from ..items import BiaoqingbaoItem

class BiaoqingSpider(scrapy.Spider):

name = 'biaoqing'

allowed_domains = ['fabiaoqing.com']

start_urls = [f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' for page in range(1, 201)]

def parse(self, response):

divs = response.css('#bqb div.ui.segment.imghover div')

for div in divs:

img_url = div.css('a img::attr(data-original)').get()

title = div.css('a img::attr(title)').get()

yield BiaoqingbaoItem(img_url=img_url, title=title)

簡單總結(jié):

三個程序的最大的區(qū)別就在于在于爬取速度的相對,但是如果從寫代碼的時間上面來計算的話,普通是最簡單的,因為對于這樣的靜態(tài)網(wǎng)站根本不需要調(diào)試,可以從頭寫到位,加上空格一共也就是29行的代碼。

總結(jié)

以上是生活随笔為你收集整理的python多线程爬虫框架_普通爬虫vs多线程爬虫vs框架爬虫,Python爬对比的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。