request设置请求头_收藏 Scrapy框架各组件详细设置
關(guān)于Scrapy
Scrapy是純Python語言實(shí)現(xiàn)的爬蟲框架,簡單、易用、拓展性高是其主要特點(diǎn)。這里不過多介紹Scrapy的基本知識(shí)點(diǎn),主要針對(duì)其高拓展性詳細(xì)介紹各個(gè)主要部件的配置方法。其實(shí)也不詳細(xì),不過應(yīng)該能滿足大多數(shù)人的需求了 : )。當(dāng)然,更多信息可以仔細(xì)閱讀官方文檔。首先還是放一張Scrapy數(shù)據(jù)流的圖供復(fù)習(xí)和參考。接下來進(jìn)入正題,有些具體的示例以某瓣spider為例。創(chuàng)建命令
scrapy?startproject?<Project_name>scrapy?genspider?<spider_name>?<domains>如果想要?jiǎng)?chuàng)建全網(wǎng)爬取的便捷框架crawlspider,則用如下命令:scrapy?genspider?–t?crawl?<spider_name>?<domains>
spider.py
首先介紹最核心的部件spider.py,廢話不多說,上代碼,看注釋
import?scrapy#?有些命令如果有python基礎(chǔ)的都明白,我不做過多介紹
import?json
#?需要做持久化所以導(dǎo)入item,也可以根據(jù)文件夾名慢慢導(dǎo)入
from?..items?import?DoubanItem
class?DoubanSpider(scrapy.Spider):
????name?=?'douban'
????allowed_domains?=?['douban.com']
????#?對(duì)單個(gè)爬蟲設(shè)置請(qǐng)求頭
????custom_settings?=?{??
????????'DEFAULT_REQUEST_HEADERS':?{
????????'Accept':?'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
????????'Accept-Language':?'en',
????????'user-agent':?'Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/74.0.3729.169?Safari/537.36'
????}}
????
????#?很多時(shí)候并不需要重載這個(gè)函數(shù),如果需要定制化起始url或者單獨(dú)設(shè)置請(qǐng)求頭可以選擇重載
????def?start_requests(self):
????????page?=?18
????????base_url?=?'https://xxxx'
????????for?i?in?range(page):
????????????url?=?base_url.format(i?*?20)
????????????req?=?scrapy.Request(url=url,?callback=self.parse)
????????????#?對(duì)某個(gè)請(qǐng)求添加請(qǐng)求頭,后面的請(qǐng)求如果要設(shè)置也是類似方法
????????????#?req.headers['User-Agent']?=?''??
????????????yield?req
????????????
????#?沒有特別要解釋,就是常規(guī)的頁面解析拋給...(看數(shù)據(jù)流就明白了)
????def?parse(self,?response):
????????json_str?=?response.body.decode('utf-8')
????????res_dict?=?json.loads(json_str)
????????for?i?in?res_dict['subjects']:
????????????url?=?i['url']
????????????yield?scrapy.Request(url=url,?callback=self.parse_detailed_page)
????????????
????#?scrapy的response可以直接用xpath解析,基礎(chǔ)東西大家都懂不贅述
????def?parse_detailed_page(self,?response):
????????title?=?response.xpath('//h1/span[1]/text()').extract_first()
????????year?=?response.xpath('//h1/span[2]/text()').extract()[0]
????????image?=?response.xpath('//img[@rel="v:image"]/@src').extract_first()
????????
????????item?=?DoubanItem()
????????item['title']?=?title
????????item['year']?=?year
????????item['image']?=?image
????????#?如果要下載圖片需要單獨(dú)設(shè)置,ImagePipelines,同樣在settings和pipelines都需要相應(yīng)設(shè)置
????????item['image_urls']?=?[image]???
????????yield?item
如果是全網(wǎng)爬取,則框架中spiders的部分開頭會(huì)略有差別
rules?=?(Rule(LinkExtractor(allow=r'http://digimons.net/digimon/.*/index.html'),?callback='parse_item',?follow=False),)關(guān)鍵就是follow的設(shè)置了,是否到達(dá)既定深度和頁面需要自己把握。提一嘴,請(qǐng)求頭可以在三個(gè)地方設(shè)置,決定了請(qǐng)求頭的影響范圍
在settings中設(shè)置,范圍最大,影響整個(gè)框架的所有spider
在spiders類變量處設(shè)置,影響該spider的所有請(qǐng)求
在具體請(qǐng)求中設(shè)置,只影響該request
三處設(shè)置的影響范圍實(shí)際就是從全局到單個(gè)爬蟲到單個(gè)請(qǐng)求。如果同時(shí)存在則單個(gè)請(qǐng)求的headers設(shè)置優(yōu)先級(jí)最高!
items.py
import?scrapyclass?DoubanItem(scrapy.Item):
????title?=?scrapy.Field()
????year?=?scrapy.Field()
????image?=?scrapy.Field()
????#?下載圖片的ImagePipelines也需要設(shè)置items
????image_urls?=?scrapy.Field()???
????
????#?持久化存儲(chǔ)我選擇用mysql,不具體展開
????def?get_insert_sql_and_data(self):
????#?CREATE?TABLE?douban(
????#?id?int?not?null?auto_increment?primary?key,
????#?title?text,?`year`?int,?image?text)ENGINE=INNODB?DEFAULT?CHARSET=UTF8mb4;
????????insert_sql?=?'INSERT?INTO?douban(title,`year`,image)'?\?????#?系統(tǒng)關(guān)鍵字需要加``
?????????????????????'VALUES(%s,%s,%s)'
????????data?=?(self['title'],self['year'],self['image'])
????????return?(insert_sql,?data)
middlewares.py
中間件就很靈性了,很多小伙伴也不一定用的到,但實(shí)際上在配置代理時(shí)很重要,一般需求不去配置SpiderMiddleware,主要針對(duì)DownloaderMiddleware進(jìn)行修改
#?信號(hào),這個(gè)名詞在scrapy自定義拓展中很重要from?scrapy?import?signals
#?本地配置的類,代碼見后續(xù),可以搭在自己的IP池上,也可以直接掛在收費(fèi)IP(比如我)
from?proxyhelper?import?Proxyhelper
#?多線程操作同一個(gè)對(duì)象需要鎖,用法就是實(shí)例化以后一鎖一釋放
from?twisted.internet.defer?import?DeferredLock
class?DoubanSpiderMiddleware(object):?#?spider中間件不設(shè)置
????pass
class?DoubanDownloaderMiddleware(object):
????def?__init__(self):
????????#?對(duì)IP配置的代理和鎖都實(shí)例化
????????self.helper?=?Proxyhelper()
????????self.lock?=?DeferredLock()
????@classmethod
????def?from_crawler(cls,?crawler):??#?不修改
????????#?This?method?is?used?by?Scrapy?to?create?your?spiders.
????????s?=?cls()
????????crawler.signals.connect(s.spider_opened,?signal=signals.spider_opened)
????????return?s
????def?process_request(self,?request,?spider):
????????#?request的數(shù)據(jù)流到達(dá)下載中間件的時(shí)候出發(fā)
????????self.lock.acquire()
????????request.meta['Proxy']?=?self.helper.get_proxy()
????????self.lock.release()
????????return?None
????def?process_response(self,?request,?response,?spider):
????????#?對(duì)響應(yīng)判斷,如果不符合就換代理重新請(qǐng)求
????????if?response.status?!=?200:???
????????????self.lock.acquire()
????????????self.helper.update_proxy(request.meta['Proxy'])
????????????self.lock.release()
????????????return?request
????????return?response
????def?process_exception(self,?request,?exception,?spider):
????????self.lock.acquire()
????????self.helper.update_proxy(request.meta['Proxy'])
????????self.lock.release()
????????return?request
????def?spider_opened(self,?spider):??#?不修改
????????spider.logger.info('Spider?opened:?%s'?%?spider.name)附上proxyhelper配置的代碼import?requests
class?Proxyhelper(object):
????def?__init__(self):
????????self.proxy?=?self._get_proxy_from_xxx()
????def?get_proxy(self):
????????return?self.proxy
????def?update_proxy(self,?proxy):
????????if?proxy?==?self.proxy:
????????????print('Updating?a?proxy')
????????????self.proxy?=?self._get_proxy_from_xxx()
????def?_get_proxy_from_xxx(self):
????????url?=?''?#?此處修改url,最好是一次返回一個(gè)ip
????????response?=?requests.get(url)
????????return?'http://'?+?response.text.strip()
pipelines.py
#?載入本地的mysql持久化類,按需自己寫from?mysqlhelper?import?Mysqlhelper
#?載入ImagesPipeline便于重載,自定義一些功能
from?scrapy.pipelines.images?import?ImagesPipeline
import?hashlib
from?scrapy.utils.python?import?to_bytes
from?scrapy.http?import?Request
class?DoubanImagesPipeline(ImagesPipeline):
????def?get_media_requests(self,?item,?info):
????????request_lst?=?[]
????????for?x?in?item.get(self.images_urls_field,?[]):
????????????req?=?Request(x)
????????????req.meta['movie_name']?=?item['title']??#?獲取名字
????????????request_lst.append(req)
????????return?request_lst
????#?重載
????def?file_path(self,?request,?response=None,?info=None):
????????image_guid?=?hashlib.sha1(to_bytes(request.url)).hexdigest()
????????return?'full/%s.jpg'?%?(request.meta['movie_name'])?#?修改圖片名
#?無特殊,有些步驟在items已經(jīng)寫完,實(shí)現(xiàn)pipelines和items功能上的分離
class?DoubanPipeline(object):
????def?__init__(self):
????????self.mysqlhelper?=?Mysqlhelper()
????def?process_item(self,?item,?spider):
????????if?'get_insert_sql_and_data'?in?dir(item):
????????????(insert_sql,?data)?=?item.get_insert_sql_and_data()
????????????self.mysqlhelper.execute_sql(insert_sql,?data)
????????return?item
setting.py
極其關(guān)鍵的部件,注釋已經(jīng)在代碼中標(biāo)注
#?爬蟲名稱BOT_NAME?=?'Douban'
SPIDER_MODULES?=?['Douban.spiders']
NEWSPIDER_MODULE?=?'Douban.spiders'
#?客戶端請(qǐng)求頭
#?Crawl?responsibly?by?identifying?yourself?(and?your?website)?on?the?user-agent
#USER_AGENT?=?'Douban?(+http://www.yourdomain.com)'
#?Obey?robots.txt?rules
#?機(jī)器人協(xié)定
ROBOTSTXT_OBEY?=?False
#?并發(fā)請(qǐng)求數(shù)
#?Configure?maximum?concurrent?requests?performed?by?Scrapy?(default:?16)
CONCURRENT_REQUESTS?=?32
#?下載延遲
#DOWNLOAD_DELAY?=?3
#?The?download?delay?setting?will?honor?only?one?of:
#?單域名和單IP并發(fā)數(shù),會(huì)覆蓋上面的設(shè)定
#CONCURRENT_REQUESTS_PER_DOMAIN?=?16
#CONCURRENT_REQUESTS_PER_IP?=?16
#?Disable?cookies?(enabled?by?default)
#COOKIES_ENABLED?=?False
#?Disable?Telnet?Console?(enabled?by?default)
#?對(duì)爬蟲進(jìn)行監(jiān)控
#TELNETCONSOLE_ENABLED?=?False
#?TELNETCONSOLE_ENABLED?=?True
#?TELNETCONSOLE_HOST?=?'127.0.0.1'
#?TELNETCONSOLE_PORT?=?[6023,]
#?操作命令:cmd -> telent 127.0.0.1 6023-> est<>
#?Override?the?default?request?headers:
#?默認(rèn)請(qǐng)求頭,項(xiàng)目內(nèi)所有爬蟲有效
#?DEFAULT_REQUEST_HEADERS?=?{
#???'Accept':?'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#???'Accept-Language':?'en',
#???'user-agent':?'Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/74.0.3729.169?Safari/537.36'
#?}
#?爬蟲中間件
#?SPIDER_MIDDLEWARES?=?{
#????#?'scrapy.spidermiddlewares.offsite.OffsiteMiddleware':?None
#????'Douban.middlewares.DoubanSpiderMiddleware':?543,
#?}
#?Enable?or?disable?downloader?middlewares
#?See?https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#?下載中間件
DOWNLOADER_MIDDLEWARES?=?{
???'Douban.middlewares.DoubanDownloaderMiddleware':?560,??
#?更改為560的原因在于不同中間件細(xì)分很多亞組間,這些組間的數(shù)據(jù)大小決定了request和response數(shù)據(jù)流觸碰的順序,具體見官方文檔
}
#?允許url的訪問時(shí)限
TIMEOUT?=?10
#?深度限制
#?DEPTH_LIMIT?=?1
#?自定義拓展
EXTENSIONS?=?{
???'Douban.extends.MyExtension':?500,
}
#?item-pipelines配置
ITEM_PIPELINES?=?{
???#?'scrapy.pipelines.images.ImagesPipeline':?1,??#?圖片下載器需要注冊(cè)
???'Douban.pipelines.DoubanImagesPipeline':?300,
}
#?利用算法,自動(dòng)限速
#?Enable?and?configure?the?AutoThrottle?extension?(disabled?by?default)
#?See?https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED?=?True
#?The?initial?download?delay
#AUTOTHROTTLE_START_DELAY?=?5
#?The?maximum?download?delay?to?be?set?in?case?of?high?latencies
#AUTOTHROTTLE_MAX_DELAY?=?60
#?The?average?number?of?requests?Scrapy?should?be?sending?in?parallel?to
#?each?remote?server
#AUTOTHROTTLE_TARGET_CONCURRENCY?=?1.0
#?Enable?showing?throttling?stats?for?every?response?received:
#AUTOTHROTTLE_DEBUG?=?False
#?啟用緩存,較少用
#?Enable?and?configure?HTTP?caching?(disabled?by?default)
#?See?https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED?=?True
#HTTPCACHE_EXPIRATION_SECS?=?0
#HTTPCACHE_DIR?=?'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES?=?[]
#HTTPCACHE_STORAGE?=?'scrapy.extensions.httpcache.FilesystemCacheStorage'
#?圖片下載器ImagePipeline的配置,按需開啟
IMAGES_STORE?=?'download'
extends.py
自定義擴(kuò)展,建議設(shè)置該部件需要對(duì)信號(hào)有了解,深入理解scrapy運(yùn)行過程的信號(hào)觸發(fā),實(shí)際還是需要對(duì)數(shù)據(jù)流理解的完善。代碼中我是利用自己寫的類,本質(zhì)就是利用喵提醒在某些特定時(shí)刻觸發(fā)提醒(喵提醒打錢?)。當(dāng)然也可以利用日志或者其他功能強(qiáng)化拓展功能,通過signal的不同觸發(fā)時(shí)刻針對(duì)性設(shè)置
需要自己創(chuàng)建,創(chuàng)建位置如圖:
from?scrapy?import?signalsfrom?message?import?Message
class?MyExtension(object):
????def?__init__(self,?value):
????????self.value?=?value
????@classmethod
????def?from_crawler(cls,?crawler):
????????val?=?crawler.settings.getint('MMMM')
????????ext?=?cls(val)
????????crawler.signals.connect(ext.spider_opened,?signal=signals.spider_opened)
????????crawler.signals.connect(ext.spider_closed,?signal=signals.spider_closed)
????????return?ext
????def?spider_opened(self,?spider):
????????print('spider?running')
????def?spider_closed(self,?spider):
????????message?=?Message('spider運(yùn)行結(jié)束')
????????message.push()
????????print('spider?closed')
running.py
runnings.py最后提一下吧,其實(shí)就是一個(gè)在python中運(yùn)行cmd的命令
from?scrapy.cmdline?import?executeexecute('scrapy?crawl?douban'.split())
以上就是可以滿足基本需求的Scrapy各部件配置,如果還不熟悉的話可以參考,之后我們將更新一些Scrapy爬蟲實(shí)戰(zhàn)案例。
牛逼!一行代碼讓 pandas 的 apply 速度飆到極致!
萬里星空、皓月千里、電閃雷鳴,各種天氣特效,算法一鍵生成
Python操作Excel 模塊,你猜哪家強(qiáng)?
終于來了!!Pyston v2.0 發(fā)布,解決 Python 慢速的救星
2020年,那些已經(jīng)死亡的公司
End
碼農(nóng)升級(jí)
長按二維碼關(guān)注
你點(diǎn)的每個(gè)在看,我都認(rèn)真當(dāng)成了喜歡 與50位技術(shù)專家面對(duì)面20年技術(shù)見證,附贈(zèng)技術(shù)全景圖總結(jié)
以上是生活随笔為你收集整理的request设置请求头_收藏 Scrapy框架各组件详细设置的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: python集合类型_Python 的集
- 下一篇: sklearn 神经网络_sklearn