當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Scrapy 性能

發(fā)布時間：2024/7/23 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 Scrapy 性能小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

參考：https://blog.csdn.net/s150503/article/details/72571680

CONCURRENT_REQUESTS 與 DOWNLOAD_DELAY

Scrapy 中 CONCURRENT_REQUESTS 與 DOWNLOAD_DELAY 的聯(lián)系，先建立一個項目來找CONCURRENT_REQUESTS與DOWNLOAD_DELAY的聯(lián)系

以豆瓣電影top250 為例

douban_spider.py

# -*- coding: utf-8 -*-import scrapy import time import re from lxml import etree""" scrapy 豆瓣登錄響應(yīng)結(jié)果亂碼問題 https://www.jianshu.com/p/9974fc338242 """class ExampleSpider(scrapy.Spider):name = 'douban'allowed_domains = ['example.com']# start_urls = ['https://movie.douban.com/top250?start={}&filter='.format(i) for i in range(0, 250, 25)]start_urls = ['https://movie.douban.com/top250?start={}&filter='.format(i) for i in range(10000)]custom_settings = {'DEFAULT_REQUEST_HEADERS': {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,''*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',"Accept-Encoding": "gzip, deflate","Accept-Language": "zh-CN,zh;q=0.9","Connection": "keep-alive","Host": "movie.douban.com","Upgrade-Insecure-Requests": "1","User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'' (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36',},'CONCURRENT_REQUESTS': 10,'DOWNLOAD_DELAY': 0.01,'CONCURRENT_REQUESTS_PER_IP': 0,'CONCURRENT_REQUESTS_PER_DOMAIN': 10000,'FEED_EXPORT_ENCODING': 'utf-8'}def parse(self, response):current_url = response.urlprint(current_url)time.sleep(3)returnoffset = re.findall(r'start=(\d+)', current_url)[0]page_num = int(offset) // 25html = etree.HTML(text=response.text)# 先定位到 li 標簽，data 是一個包含25個li標簽的list，就是包含25部電影信息的listdata = html.xpath('//ol[@class="grid_view"]/li')index = 0for d in data:data_title = d.xpath('div/div[2]/div[@class="hd"]/a/span[1]/text()')data_info = d.xpath('div/div[2]/div[@class="bd"]/p[1]/text()')data_quote = d.xpath('div/div[2]/div[@class="bd"]/p[2]/span/text()')data_score = d.xpath('div/div[2]/div[@class="bd"]/div/span[@class="rating_num"]/text()')data_num = d.xpath('div/div[2]/div[@class="bd"]/div/span[4]/text()')data_pic_url = d.xpath('div/div[1]/a/img/@src')print(f"No: {str(page_num * 25 + index + 1)} {data_title}")index += 1passif __name__ == '__main__':from scrapy import cmdlinecmdline.execute('scrapy crawl douban'.split())pass

驗證 1：

'CONCURRENT_REQUESTS': 10, 'DOWNLOAD_DELAY': 0.01,

CONCURRENT_REQUESTS 設(shè)置為 10 時，理論上可以并發(fā) 10個請求。但是??DOWNLOAD_DELAY 設(shè)置為 0.01 時，按??DOWNLOAD_DELAY 來算，可以并發(fā) 1 / 0.01 = 100 個請求，這兩個取最小值為 10，所以并發(fā) 10個請求。

幾乎同一秒并發(fā) 10 個左右的請求

驗證 2：

'CONCURRENT_REQUESTS': 10, 'DOWNLOAD_DELAY': 0.5,

CONCURRENT_REQUESTS 設(shè)置為 10 時，理論上可以并發(fā) 10個請求。但是??DOWNLOAD_DELAY 設(shè)置為 0.5?時，按??DOWNLOAD_DELAY 來算，可以并發(fā) 1 / 0.5?= 2?個請求，這兩個取最小值為 2，所以并發(fā) 2個請求。

總結(jié)：

DOWNLOAD_DELAY 會影響?CONCURRENT_REQUESTS，不能使并發(fā)顯現(xiàn)出來。

思考：

1. 當有 CONCURRENT_REQUESTS，沒有 DOWNLOAD_DELAY 時，服務(wù)器會在同一時間收到大量的請求。

'CONCURRENT_REQUESTS': 10, # 'DOWNLOAD_DELAY': 0.5,

DOWNLOAD_DELAY 注釋后，會使用默認值 0，

2. 當有 CONCURRENT_REQUESTS，有 DOWNLOAD_DELAY 時，服務(wù)器不會在同一時間收到大量的請求。

# 'CONCURRENT_REQUESTS': 0, 'DOWNLOAD_DELAY': 0.5,

CONCURRENT_REQUESTS 注釋后，會使用默認值 16，

總結(jié)

以上是生活随笔為你收集整理的Scrapy 性能的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Scrapy - Request 和 R
下一篇：四个小时不止是敲了30多行代码，还懂了好