當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

19-爬虫之scrapy框架大文件下载06

發(fā)布時(shí)間：2024/9/15 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 19-爬虫之scrapy框架大文件下载06 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

大文件下載
創(chuàng)建一個(gè)爬蟲(chóng)工程：scrapy startproject proName
進(jìn)入工程目錄創(chuàng)建爬蟲(chóng)源文件：scrapy genspider spiderName www.xxx.com
執(zhí)行工程：scrapy crawl spiderName

大文件數(shù)據(jù)是在管道中請(qǐng)求到的

下載管道類是scrapy封裝好的直接調(diào)用即可：

from scrapy.pipelines.images import ImagesPipeline # 該管道提供數(shù)據(jù)下載功能（圖片視頻音頻皆可使用該類）

重寫(xiě)管道類的三個(gè)方法：

def get_media_requests

對(duì)圖片地址發(fā)起請(qǐng)求

def file_path

返回圖片名稱即可

def item_completed

返回item，將其返回給下一個(gè)即將被執(zhí)行的管道類

在配置文件中添加 IMAGES_STORE

IMAGES_STORE=‘dirname’

img.py 爬蟲(chóng)源文件

import scrapy from imgPro.items import ImgproItemclass ImgSpider(scrapy.Spider):name = 'img'#allowed_domains = ['www.xxx.com']start_urls = ['http://www.521609.com/daxuemeinv/']def parse(self, response):li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')for li in li_list:img_src = 'http://www.521609.com'+li.xpath('./a[1]/img/@src').extract_first() #圖片urlimg_name = li.xpath('./a[1]/img/@alt').extract_first()+'.jpg'#圖片名字item = ImgproItem()item['name'] = img_nameitem['src'] = img_srcyield item

items.py

# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ImgproItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()name = scrapy.Field()src = scrapy.Field()

settings.py

# Scrapy settings for imgPro project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'imgPro'SPIDER_MODULES = ['imgPro.spiders'] NEWSPIDER_MODULE = 'imgPro.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"# Obey robots.txt rules ROBOTSTXT_OBEY = False LOG_LEVEL = "ERROR"IMAGES_STORE = 'imgLibs' #定義存儲(chǔ)文件夾（沒(méi)有的話會(huì)自動(dòng)給創(chuàng)建） # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}# Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'imgPro.middlewares.ImgproSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'imgPro.middlewares.ImgproDownloaderMiddleware': 543, #}# Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = {'imgPro.pipelines.ImgsproPipeline': 300, }# Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

pipelines.py

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface # from itemadapter import ItemAdapter# 該默認(rèn)管道無(wú)法幫我們進(jìn)行數(shù)據(jù)請(qǐng)求，因此該管道我們就不使用 # class ImgproPipeline: # def process_item(self, item, spider): # return item# 管道需要接受item中的圖片地址和名稱，然后在管道中請(qǐng)求到圖片的數(shù)據(jù)對(duì)其進(jìn)行持久化存儲(chǔ) from scrapy.pipelines.images import ImagesPipeline # 該管道提供數(shù)據(jù)下載功能（圖片視頻音頻皆可使用該類） import scrapy class ImgsproPipeline(ImagesPipeline):# 根據(jù)圖片地址發(fā)起請(qǐng)求def get_media_requests(self, item, info):print(item)yield scrapy.Request(url=item['src'],meta={'item':item})# 返回圖片名稱即可def file_path(self, request, response=None, info=None):# 通過(guò)request獲取metaitem = request.meta['item']filePath = item['name']return filePath #返回圖片名稱# 我們將item傳遞給下一個(gè)即將被執(zhí)行的管道類def item_completed(self, results, item, info):return item

結(jié)果展示

scrapy的settings.py配置文件介紹

# -*- coding: utf-8 -*-# Scrapy settings for demo1 project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.htmlBOT_NAME = '' #Scrapy項(xiàng)目的名字,這將用來(lái)構(gòu)造默認(rèn) User-Agent,同時(shí)也用來(lái)log,當(dāng)您使用 startproject 命令創(chuàng)建項(xiàng)目時(shí)其也被自動(dòng)賦值。SPIDER_MODULES = [''] #Scrapy搜索spider的模塊列表默認(rèn): [xxx.spiders] NEWSPIDER_MODULE = '' #使用 genspider 命令創(chuàng)建新spider的模塊。默認(rèn): 'xxx.spiders'#爬取的默認(rèn)User-Agent，除非被覆蓋 #USER_AGENT =‘’#如果啟用,Scrapy將會(huì)采用 robots.txt策略 ROBOTSTXT_OBEY = True#Scrapy downloader 并發(fā)請(qǐng)求(concurrent requests)的最大值,默認(rèn): 16 #CONCURRENT_REQUESTS = 32#為同一網(wǎng)站的請(qǐng)求配置延遲（默認(rèn)值：0） # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 下載器在下載同一個(gè)網(wǎng)站下一個(gè)頁(yè)面前需要等待的時(shí)間,該選項(xiàng)可以用來(lái)限制爬取速度,減輕服務(wù)器壓力。同時(shí)也支持小數(shù):0.25 以秒為單位#下載延遲設(shè)置只有一個(gè)有效 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 對(duì)單個(gè)網(wǎng)站進(jìn)行并發(fā)請(qǐng)求的最大值。 #CONCURRENT_REQUESTS_PER_IP = 16 對(duì)單個(gè)IP進(jìn)行并發(fā)請(qǐng)求的最大值。如果非0,則忽略 CONCURRENT_REQUESTS_PER_DOMAIN 設(shè)定,使用該設(shè)定。也就是說(shuō),并發(fā)限制將針對(duì)IP,而不是網(wǎng)站。該設(shè)定也影響 DOWNLOAD_DELAY: 如果 CONCURRENT_REQUESTS_PER_IP 非0,下載延遲應(yīng)用在IP而不是網(wǎng)站上。#禁用Cookie（默認(rèn)情況下啟用） #COOKIES_ENABLED = False#禁用Telnet控制臺(tái)（默認(rèn)啟用） #TELNETCONSOLE_ENABLED = False #覆蓋默認(rèn)請(qǐng)求標(biāo)頭： #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}#啟用或禁用蜘蛛中間件 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'demo1.middlewares.Demo1SpiderMiddleware': 543, #}#啟用或禁用下載器中間件 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'demo1.middlewares.MyCustomDownloaderMiddleware': 543, #}#啟用或禁用擴(kuò)展程序 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}#配置項(xiàng)目管道 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'demo1.pipelines.Demo1Pipeline': 300, #}#啟用和配置AutoThrottle擴(kuò)展（默認(rèn)情況下禁用） # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True#初始下載延遲 #AUTOTHROTTLE_START_DELAY = 5#在高延遲的情況下設(shè)置的最大下載延遲 #AUTOTHROTTLE_MAX_DELAY = 60#Scrapy請(qǐng)求的平均數(shù)量應(yīng)該并行發(fā)送每個(gè)遠(yuǎn)程服務(wù)器 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0#啟用顯示所收到的每個(gè)響應(yīng)的調(diào)節(jié)統(tǒng)計(jì)信息： #AUTOTHROTTLE_DEBUG = False#啟用和配置HTTP緩存（默認(rèn)情況下禁用） # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

總結(jié)

以上是生活随笔為你收集整理的19-爬虫之scrapy框架大文件下载06的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： 18-爬虫之scrapy框架请求传参实现
下一篇： 20-爬虫之scrapy框架CrawlS