日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

request设置请求头_收藏 Scrapy框架各组件详细设置

發布時間:2025/3/13 编程问答 26 豆豆
生活随笔 收集整理的這篇文章主要介紹了 request设置请求头_收藏 Scrapy框架各组件详细设置 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
今天說一下Scrapy框架各組件的詳細設置。

關于Scrapy

Scrapy是純Python語言實現的爬蟲框架,簡單、易用、拓展性高是其主要特點。這里不過多介紹Scrapy的基本知識點,主要針對其高拓展性詳細介紹各個主要部件的配置方法。其實也不詳細,不過應該能滿足大多數人的需求了 : )。當然,更多信息可以仔細閱讀官方文檔。首先還是放一張Scrapy數據流的圖供復習和參考。接下來進入正題,有些具體的示例以某瓣spider為例。

創建命令

scrapy?startproject?<Project_name>
scrapy?genspider?<spider_name>?<domains>如果想要創建全網爬取的便捷框架crawlspider,則用如下命令:scrapy?genspider?–t?crawl?<spider_name>?<domains>

spider.py

首先介紹最核心的部件spider.py,廢話不多說,上代碼,看注釋

import?scrapy
#?有些命令如果有python基礎的都明白,我不做過多介紹
import?json
#?需要做持久化所以導入item,也可以根據文件夾名慢慢導入
from?..items?import?DoubanItem

class?DoubanSpider(scrapy.Spider):
????name?=?'douban'
????allowed_domains?=?['douban.com']
????#?對單個爬蟲設置請求頭
????custom_settings?=?{??
????????'DEFAULT_REQUEST_HEADERS':?{
????????'Accept':?'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
????????'Accept-Language':?'en',
????????'user-agent':?'Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/74.0.3729.169?Safari/537.36'
????}}
????
????#?很多時候并不需要重載這個函數,如果需要定制化起始url或者單獨設置請求頭可以選擇重載
????def?start_requests(self):
????????page?=?18
????????base_url?=?'https://xxxx'
????????for?i?in?range(page):
????????????url?=?base_url.format(i?*?20)
????????????req?=?scrapy.Request(url=url,?callback=self.parse)
????????????#?對某個請求添加請求頭,后面的請求如果要設置也是類似方法
????????????#?req.headers['User-Agent']?=?''??
????????????yield?req
????????????
????#?沒有特別要解釋,就是常規的頁面解析拋給...(看數據流就明白了)
????def?parse(self,?response):
????????json_str?=?response.body.decode('utf-8')
????????res_dict?=?json.loads(json_str)
????????for?i?in?res_dict['subjects']:
????????????url?=?i['url']
????????????yield?scrapy.Request(url=url,?callback=self.parse_detailed_page)
????????????
????#?scrapy的response可以直接用xpath解析,基礎東西大家都懂不贅述
????def?parse_detailed_page(self,?response):
????????title?=?response.xpath('//h1/span[1]/text()').extract_first()
????????year?=?response.xpath('//h1/span[2]/text()').extract()[0]
????????image?=?response.xpath('//img[@rel="v:image"]/@src').extract_first()
????????
????????item?=?DoubanItem()
????????item['title']?=?title
????????item['year']?=?year
????????item['image']?=?image
????????#?如果要下載圖片需要單獨設置,ImagePipelines,同樣在settings和pipelines都需要相應設置
????????item['image_urls']?=?[image]???
????????yield?item

如果是全網爬取,則框架中spiders的部分開頭會略有差別

rules?=?(Rule(LinkExtractor(allow=r'http://digimons.net/digimon/.*/index.html'),?callback='parse_item',?follow=False),)

關鍵就是follow的設置了,是否到達既定深度和頁面需要自己把握。提一嘴,請求頭可以在三個地方設置,決定了請求頭的影響范圍

  • 在settings中設置,范圍最大,影響整個框架的所有spider

  • 在spiders類變量處設置,影響該spider的所有請求

  • 在具體請求中設置,只影響該request

  • 三處設置的影響范圍實際就是從全局到單個爬蟲到單個請求。如果同時存在則單個請求的headers設置優先級最高!

    items.py

    import?scrapy

    class?DoubanItem(scrapy.Item):
    ????title?=?scrapy.Field()
    ????year?=?scrapy.Field()
    ????image?=?scrapy.Field()
    ????#?下載圖片的ImagePipelines也需要設置items
    ????image_urls?=?scrapy.Field()???
    ????
    ????#?持久化存儲我選擇用mysql,不具體展開
    ????def?get_insert_sql_and_data(self):
    ????#?CREATE?TABLE?douban(
    ????#?id?int?not?null?auto_increment?primary?key,
    ????#?title?text,?`year`?int,?image?text)ENGINE=INNODB?DEFAULT?CHARSET=UTF8mb4;
    ????????insert_sql?=?'INSERT?INTO?douban(title,`year`,image)'?\?????#?系統關鍵字需要加``
    ?????????????????????'VALUES(%s,%s,%s)'
    ????????data?=?(self['title'],self['year'],self['image'])
    ????????return?(insert_sql,?data)

    middlewares.py

    中間件就很靈性了,很多小伙伴也不一定用的到,但實際上在配置代理時很重要,一般需求不去配置SpiderMiddleware,主要針對DownloaderMiddleware進行修改

    #?信號,這個名詞在scrapy自定義拓展中很重要
    from?scrapy?import?signals
    #?本地配置的類,代碼見后續,可以搭在自己的IP池上,也可以直接掛在收費IP(比如我)
    from?proxyhelper?import?Proxyhelper
    #?多線程操作同一個對象需要鎖,用法就是實例化以后一鎖一釋放
    from?twisted.internet.defer?import?DeferredLock

    class?DoubanSpiderMiddleware(object):?#?spider中間件不設置
    ????pass

    class?DoubanDownloaderMiddleware(object):
    ????def?__init__(self):
    ????????#?對IP配置的代理和鎖都實例化
    ????????self.helper?=?Proxyhelper()
    ????????self.lock?=?DeferredLock()

    ????@classmethod
    ????def?from_crawler(cls,?crawler):??#?不修改
    ????????#?This?method?is?used?by?Scrapy?to?create?your?spiders.
    ????????s?=?cls()
    ????????crawler.signals.connect(s.spider_opened,?signal=signals.spider_opened)
    ????????return?s

    ????def?process_request(self,?request,?spider):
    ????????#?request的數據流到達下載中間件的時候出發
    ????????self.lock.acquire()
    ????????request.meta['Proxy']?=?self.helper.get_proxy()
    ????????self.lock.release()
    ????????return?None

    ????def?process_response(self,?request,?response,?spider):
    ????????#?對響應判斷,如果不符合就換代理重新請求
    ????????if?response.status?!=?200:???
    ????????????self.lock.acquire()
    ????????????self.helper.update_proxy(request.meta['Proxy'])
    ????????????self.lock.release()
    ????????????return?request
    ????????return?response

    ????def?process_exception(self,?request,?exception,?spider):
    ????????self.lock.acquire()
    ????????self.helper.update_proxy(request.meta['Proxy'])
    ????????self.lock.release()
    ????????return?request

    ????def?spider_opened(self,?spider):??#?不修改
    ????????spider.logger.info('Spider?opened:?%s'?%?spider.name)附上proxyhelper配置的代碼import?requests

    class?Proxyhelper(object):
    ????def?__init__(self):
    ????????self.proxy?=?self._get_proxy_from_xxx()

    ????def?get_proxy(self):
    ????????return?self.proxy

    ????def?update_proxy(self,?proxy):
    ????????if?proxy?==?self.proxy:
    ????????????print('Updating?a?proxy')
    ????????????self.proxy?=?self._get_proxy_from_xxx()

    ????def?_get_proxy_from_xxx(self):
    ????????url?=?''?#?此處修改url,最好是一次返回一個ip
    ????????response?=?requests.get(url)
    ????????return?'http://'?+?response.text.strip()

    pipelines.py

    #?載入本地的mysql持久化類,按需自己寫
    from?mysqlhelper?import?Mysqlhelper
    #?載入ImagesPipeline便于重載,自定義一些功能
    from?scrapy.pipelines.images?import?ImagesPipeline
    import?hashlib
    from?scrapy.utils.python?import?to_bytes
    from?scrapy.http?import?Request

    class?DoubanImagesPipeline(ImagesPipeline):
    ????def?get_media_requests(self,?item,?info):
    ????????request_lst?=?[]
    ????????for?x?in?item.get(self.images_urls_field,?[]):
    ????????????req?=?Request(x)
    ????????????req.meta['movie_name']?=?item['title']??#?獲取名字
    ????????????request_lst.append(req)
    ????????return?request_lst
    ????#?重載
    ????def?file_path(self,?request,?response=None,?info=None):
    ????????image_guid?=?hashlib.sha1(to_bytes(request.url)).hexdigest()
    ????????return?'full/%s.jpg'?%?(request.meta['movie_name'])?#?修改圖片名

    #?無特殊,有些步驟在items已經寫完,實現pipelines和items功能上的分離
    class?DoubanPipeline(object):
    ????def?__init__(self):
    ????????self.mysqlhelper?=?Mysqlhelper()

    ????def?process_item(self,?item,?spider):
    ????????if?'get_insert_sql_and_data'?in?dir(item):
    ????????????(insert_sql,?data)?=?item.get_insert_sql_and_data()
    ????????????self.mysqlhelper.execute_sql(insert_sql,?data)
    ????????return?item

    setting.py

    極其關鍵的部件,注釋已經在代碼中標注

    #?爬蟲名稱
    BOT_NAME?=?'Douban'

    SPIDER_MODULES?=?['Douban.spiders']
    NEWSPIDER_MODULE?=?'Douban.spiders'

    #?客戶端請求頭
    #?Crawl?responsibly?by?identifying?yourself?(and?your?website)?on?the?user-agent
    #USER_AGENT?=?'Douban?(+http://www.yourdomain.com)'

    #?Obey?robots.txt?rules
    #?機器人協定
    ROBOTSTXT_OBEY?=?False

    #?并發請求數
    #?Configure?maximum?concurrent?requests?performed?by?Scrapy?(default:?16)
    CONCURRENT_REQUESTS?=?32


    #?下載延遲
    #DOWNLOAD_DELAY?=?3
    #?The?download?delay?setting?will?honor?only?one?of:
    #?單域名和單IP并發數,會覆蓋上面的設定
    #CONCURRENT_REQUESTS_PER_DOMAIN?=?16
    #CONCURRENT_REQUESTS_PER_IP?=?16

    #?Disable?cookies?(enabled?by?default)
    #COOKIES_ENABLED?=?False

    #?Disable?Telnet?Console?(enabled?by?default)
    #?對爬蟲進行監控
    #TELNETCONSOLE_ENABLED?=?False
    #?TELNETCONSOLE_ENABLED?=?True
    #?TELNETCONSOLE_HOST?=?'127.0.0.1'
    #?TELNETCONSOLE_PORT?=?[6023,]
    #?操作命令:cmd -> telent 127.0.0.1 6023-> est<>

    #?Override?the?default?request?headers:
    #?默認請求頭,項目內所有爬蟲有效
    #?DEFAULT_REQUEST_HEADERS?=?{
    #???'Accept':?'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #???'Accept-Language':?'en',
    #???'user-agent':?'Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/74.0.3729.169?Safari/537.36'
    #?}


    #?爬蟲中間件
    #?SPIDER_MIDDLEWARES?=?{
    #????#?'scrapy.spidermiddlewares.offsite.OffsiteMiddleware':?None
    #????'Douban.middlewares.DoubanSpiderMiddleware':?543,
    #?}

    #?Enable?or?disable?downloader?middlewares
    #?See?https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #?下載中間件
    DOWNLOADER_MIDDLEWARES?=?{
    ???'Douban.middlewares.DoubanDownloaderMiddleware':?560,??
    #?更改為560的原因在于不同中間件細分很多亞組間,這些組間的數據大小決定了request和response數據流觸碰的順序,具體見官方文檔
    }
    #?允許url的訪問時限
    TIMEOUT?=?10
    #?深度限制
    #?DEPTH_LIMIT?=?1

    #?自定義拓展
    EXTENSIONS?=?{
    ???'Douban.extends.MyExtension':?500,
    }


    #?item-pipelines配置
    ITEM_PIPELINES?=?{
    ???#?'scrapy.pipelines.images.ImagesPipeline':?1,??#?圖片下載器需要注冊
    ???'Douban.pipelines.DoubanImagesPipeline':?300,
    }

    #?利用算法,自動限速
    #?Enable?and?configure?the?AutoThrottle?extension?(disabled?by?default)
    #?See?https://docs.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED?=?True
    #?The?initial?download?delay
    #AUTOTHROTTLE_START_DELAY?=?5
    #?The?maximum?download?delay?to?be?set?in?case?of?high?latencies
    #AUTOTHROTTLE_MAX_DELAY?=?60
    #?The?average?number?of?requests?Scrapy?should?be?sending?in?parallel?to
    #?each?remote?server
    #AUTOTHROTTLE_TARGET_CONCURRENCY?=?1.0
    #?Enable?showing?throttling?stats?for?every?response?received:
    #AUTOTHROTTLE_DEBUG?=?False

    #?啟用緩存,較少用
    #?Enable?and?configure?HTTP?caching?(disabled?by?default)
    #?See?https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED?=?True
    #HTTPCACHE_EXPIRATION_SECS?=?0
    #HTTPCACHE_DIR?=?'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES?=?[]
    #HTTPCACHE_STORAGE?=?'scrapy.extensions.httpcache.FilesystemCacheStorage'

    #?圖片下載器ImagePipeline的配置,按需開啟
    IMAGES_STORE?=?'download'

    extends.py

    自定義擴展,建議設置該部件需要對信號有了解,深入理解scrapy運行過程的信號觸發,實際還是需要對數據流理解的完善。代碼中我是利用自己寫的類,本質就是利用喵提醒在某些特定時刻觸發提醒(喵提醒打錢?)。當然也可以利用日志或者其他功能強化拓展功能,通過signal的不同觸發時刻針對性設置

    需要自己創建,創建位置如圖:

    from?scrapy?import?signals
    from?message?import?Message

    class?MyExtension(object):
    ????def?__init__(self,?value):
    ????????self.value?=?value

    ????@classmethod
    ????def?from_crawler(cls,?crawler):
    ????????val?=?crawler.settings.getint('MMMM')
    ????????ext?=?cls(val)

    ????????crawler.signals.connect(ext.spider_opened,?signal=signals.spider_opened)
    ????????crawler.signals.connect(ext.spider_closed,?signal=signals.spider_closed)

    ????????return?ext

    ????def?spider_opened(self,?spider):
    ????????print('spider?running')

    ????def?spider_closed(self,?spider):
    ????????message?=?Message('spider運行結束')
    ????????message.push()
    ????????print('spider?closed')

    running.py

    runnings.py最后提一下吧,其實就是一個在python中運行cmd的命令

    from?scrapy.cmdline?import?execute
    execute('scrapy?crawl?douban'.split())

    以上就是可以滿足基本需求的Scrapy各部件配置,如果還不熟悉的話可以參考,之后我們將更新一些Scrapy爬蟲實戰案例。

    牛逼!一行代碼讓 pandas 的 apply 速度飆到極致!

    萬里星空、皓月千里、電閃雷鳴,各種天氣特效,算法一鍵生成

    Python操作Excel 模塊,你猜哪家強?

    終于來了!!Pyston v2.0 發布,解決 Python 慢速的救星

    2020年,那些已經死亡的公司

    End

    碼農升級

    長按二維碼關注

    你點的每個在看,我都認真當成了喜歡 與50位技術專家面對面20年技術見證,附贈技術全景圖

    總結

    以上是生活随笔為你收集整理的request设置请求头_收藏 Scrapy框架各组件详细设置的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。