19-爬虫之scrapy框架大文件下载06
生活随笔
收集整理的這篇文章主要介紹了
19-爬虫之scrapy框架大文件下载06
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
大文件下載
創(chuàng)建一個(gè)爬蟲工程:scrapy startproject proName
進(jìn)入工程目錄創(chuàng)建爬蟲源文件:scrapy genspider spiderName www.xxx.com
執(zhí)行工程:scrapy crawl spiderName
大文件數(shù)據(jù)是在管道中請求到的
下載管道類是scrapy封裝好的直接調(diào)用即可:
from scrapy.pipelines.images import ImagesPipeline # 該管道提供數(shù)據(jù)下載功能(圖片視頻音頻皆可使用該類)
重寫管道類的三個(gè)方法:
def get_media_requests
- 對圖片地址發(fā)起請求
def file_path
- 返回圖片名稱即可
def item_completed
- 返回item,將其返回給下一個(gè)即將被執(zhí)行的管道類
在配置文件中添加 IMAGES_STORE
- IMAGES_STORE=‘dirname’
img.py 爬蟲源文件
import scrapy from imgPro.items import ImgproItemclass ImgSpider(scrapy.Spider):name = 'img'#allowed_domains = ['www.xxx.com']start_urls = ['http://www.521609.com/daxuemeinv/']def parse(self, response):li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')for li in li_list:img_src = 'http://www.521609.com'+li.xpath('./a[1]/img/@src').extract_first() #圖片urlimg_name = li.xpath('./a[1]/img/@alt').extract_first()+'.jpg'#圖片名字item = ImgproItem()item['name'] = img_nameitem['src'] = img_srcyield itemitems.py
# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ImgproItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()name = scrapy.Field()src = scrapy.Field()settings.py
# Scrapy settings for imgPro project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'imgPro'SPIDER_MODULES = ['imgPro.spiders'] NEWSPIDER_MODULE = 'imgPro.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"# Obey robots.txt rules ROBOTSTXT_OBEY = False LOG_LEVEL = "ERROR"IMAGES_STORE = 'imgLibs' #定義存儲文件夾(沒有的話會自動給創(chuàng)建) # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}# Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'imgPro.middlewares.ImgproSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'imgPro.middlewares.ImgproDownloaderMiddleware': 543, #}# Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = {'imgPro.pipelines.ImgsproPipeline': 300, }# Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'pipelines.py
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface # from itemadapter import ItemAdapter# 該默認(rèn)管道無法幫我們進(jìn)行數(shù)據(jù)請求,因此該管道我們就不使用 # class ImgproPipeline: # def process_item(self, item, spider): # return item# 管道需要接受item中的圖片地址和名稱,然后在管道中請求到圖片的數(shù)據(jù)對其進(jìn)行持久化存儲 from scrapy.pipelines.images import ImagesPipeline # 該管道提供數(shù)據(jù)下載功能(圖片視頻音頻皆可使用該類) import scrapy class ImgsproPipeline(ImagesPipeline):# 根據(jù)圖片地址發(fā)起請求def get_media_requests(self, item, info):print(item)yield scrapy.Request(url=item['src'],meta={'item':item})# 返回圖片名稱即可def file_path(self, request, response=None, info=None):# 通過request獲取metaitem = request.meta['item']filePath = item['name']return filePath #返回圖片名稱# 我們將item傳遞給下一個(gè)即將被執(zhí)行的管道類def item_completed(self, results, item, info):return item結(jié)果展示
scrapy的settings.py配置文件介紹
# -*- coding: utf-8 -*-# Scrapy settings for demo1 project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.htmlBOT_NAME = '' #Scrapy項(xiàng)目的名字,這將用來構(gòu)造默認(rèn) User-Agent,同時(shí)也用來log,當(dāng)您使用 startproject 命令創(chuàng)建項(xiàng)目時(shí)其也被自動賦值。SPIDER_MODULES = [''] #Scrapy搜索spider的模塊列表 默認(rèn): [xxx.spiders] NEWSPIDER_MODULE = '' #使用 genspider 命令創(chuàng)建新spider的模塊。默認(rèn): 'xxx.spiders'#爬取的默認(rèn)User-Agent,除非被覆蓋 #USER_AGENT =‘’#如果啟用,Scrapy將會采用 robots.txt策略 ROBOTSTXT_OBEY = True#Scrapy downloader 并發(fā)請求(concurrent requests)的最大值,默認(rèn): 16 #CONCURRENT_REQUESTS = 32#為同一網(wǎng)站的請求配置延遲(默認(rèn)值:0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 下載器在下載同一個(gè)網(wǎng)站下一個(gè)頁面前需要等待的時(shí)間,該選項(xiàng)可以用來限制爬取速度,減輕服務(wù)器壓力。同時(shí)也支持小數(shù):0.25 以秒為單位#下載延遲設(shè)置只有一個(gè)有效 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 對單個(gè)網(wǎng)站進(jìn)行并發(fā)請求的最大值。 #CONCURRENT_REQUESTS_PER_IP = 16 對單個(gè)IP進(jìn)行并發(fā)請求的最大值。如果非0,則忽略 CONCURRENT_REQUESTS_PER_DOMAIN 設(shè)定,使用該設(shè)定。 也就是說,并發(fā)限制將針對IP,而不是網(wǎng)站。該設(shè)定也影響 DOWNLOAD_DELAY: 如果 CONCURRENT_REQUESTS_PER_IP 非0,下載延遲應(yīng)用在IP而不是網(wǎng)站上。#禁用Cookie(默認(rèn)情況下啟用) #COOKIES_ENABLED = False#禁用Telnet控制臺(默認(rèn)啟用) #TELNETCONSOLE_ENABLED = False #覆蓋默認(rèn)請求標(biāo)頭: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}#啟用或禁用蜘蛛中間件 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'demo1.middlewares.Demo1SpiderMiddleware': 543, #}#啟用或禁用下載器中間件 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'demo1.middlewares.MyCustomDownloaderMiddleware': 543, #}#啟用或禁用擴(kuò)展程序 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}#配置項(xiàng)目管道 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'demo1.pipelines.Demo1Pipeline': 300, #}#啟用和配置AutoThrottle擴(kuò)展(默認(rèn)情況下禁用) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True#初始下載延遲 #AUTOTHROTTLE_START_DELAY = 5#在高延遲的情況下設(shè)置的最大下載延遲 #AUTOTHROTTLE_MAX_DELAY = 60#Scrapy請求的平均數(shù)量應(yīng)該并行發(fā)送每個(gè)遠(yuǎn)程服務(wù)器 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0#啟用顯示所收到的每個(gè)響應(yīng)的調(diào)節(jié)統(tǒng)計(jì)信息: #AUTOTHROTTLE_DEBUG = False#啟用和配置HTTP緩存(默認(rèn)情況下禁用) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'總結(jié)
以上是生活随笔為你收集整理的19-爬虫之scrapy框架大文件下载06的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 18-爬虫之scrapy框架请求传参实现
- 下一篇: 20-爬虫之scrapy框架CrawlS