日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

22-爬虫之scrapy框架分布式09

發布時間:2024/9/15 编程问答 22 豆豆
生活随笔 收集整理的這篇文章主要介紹了 22-爬虫之scrapy框架分布式09 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

分布式

  • 實現分布式的方式:scrapy+redis(scrapy結合著scrapy-redis組件)
  • 原生的scrapy框架是無法實現分布式的
    • 什么是分布式
      • 需要搭建一個分布式機群,然后讓機群中的每一臺電腦執行同一組程序,讓其對同一組資源進行聯合且分布的數據爬取。
      • 因調度器,管道無法被分布式機群共享所以原生架構scrapy無法實現分布式
      • 使用scrapy-redis組件可以給原生的scrapy框架提供共享管道和調度器實現分布式
        • pip install scrapy-redis

實現流程

創建工程

創建一個爬蟲工程:scrapy startproject proName
進入工程創建一個基于CrawlSpider的爬蟲文件
scrapy genspider -t crawl spiderName www.xxx.com
執行工程:scrapy crawl spiderName

1,修改爬蟲文件

  • 1.1 導包:from scrapy_redis.spiders import RedisCrawlSpider
  • 1.2 修改當前爬蟲類的父類為:RedisCrawlSpider
  • 1.3 將start_url 替換成redis_key的屬性,屬性值為任意字符串
    • redis_key=‘xxxx’ #可以被共享的調度器隊列的名稱/稍后我們需要將一個起始url手動添加到redis_key表示的隊列中
  • 1.4 將數據解析的操作補充完成即可

fbs.py 爬蟲源文件

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from scrapy_redis.spiders import RedisCrawlSpider from fbsPro.items import FbsproItemclass FbsSpider(RedisCrawlSpider):name = 'fbs'#allowed_domains = ['www.xxx.com']#start_urls = ['http://www.xxx.com/']redis_key = 'sunQueue' #可以被共享的調度器隊列的名稱# 稍后我們需要將一個起始url手動添加到redis_key表示的隊列中rules = (Rule(LinkExtractor(allow=r'id=1&page=\d+'), callback='parse_item', follow=True),)def parse_item(self, response):# 將全站的標題獲取li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')for li in li_list:title = li.xpath('./span[3]/a/text()').extract_first()item = FbsproItem()item['title']=titleyield item

2,對settings.py進行配置

  • 指定調度器
# 使用scrapy-redis組件的去重隊列 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis組件自己的調度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 是否允許暫停 SCHEDULER_PERSIST = True
  • 指定管道
#開啟使用scrapy-redis組件中封裝好的管道 ITEM_PIPELINES = {'scrapy_redis.pipelines.RedisPipeline': 400 } # 該種管道只可以將item寫入redis
  • 指定redis
#在配置文件中進行爬蟲程序鏈接redis的配置:REDIS_HOST = '127.0.0.1' REDIS_PORT = 6379 # REDIS_ENCODING = 'utf-8' # REDIS_PARAMS = {'password':'123456'}

完整代碼

# Scrapy settings for fbsPro project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'fbsPro'SPIDER_MODULES = ['fbsPro.spiders'] NEWSPIDER_MODULE = 'fbsPro.spiders' #LOG_LEVEL = 'ERROR' #指定類型日志的輸出(只輸出錯誤信息)#設置UA偽裝 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'# Obey robots.txt rules #改成False不遵從robots協議 ROBOTSTXT_OBEY = False# 使用scrapy-redis組件的去重隊列 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis組件自己的調度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 是否允許暫停 SCHEDULER_PERSIST = True#開啟使用scrapy-redis組件中封裝好的管道 ITEM_PIPELINES = {'scrapy_redis.pipelines.RedisPipeline': 400 }#在配置文件中進行爬蟲程序鏈接redis的配置:REDIS_HOST = '127.0.0.1' REDIS_PORT = 6379 # REDIS_ENCODING = 'utf-8' # REDIS_PARAMS = {'password':'123456'}# Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 5# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}# Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'fbsPro.middlewares.FbsproSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'fbsPro.middlewares.FbsproDownloaderMiddleware': 543, #}# Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html# ITEM_PIPELINES = { # 'fbsPro.pipelines.FbsproPipeline': 300, # }# Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

3,配置redis的配置文件redis.windows.conf

  • 解除默認綁定

    • #bind 127.0.0.1 注釋掉(第56行)
  • 關閉保護模式

    • protected-mode no 把yes改成no(第75行)
  • redis運行時出錯
    Creating Server TCP listening socket 127.0.0.1:6379: bind: No error

  • 解決方法,依次輸入以下命令

    • redis-cli.exe
    • shutdown
    • exit
    • redis-server redis.windows.conf
  • 啟動redis服務和客戶端

    • redis-server redis.windows.conf
  • redis-cli

5 執行scrapy工程

  • 不要在配置文件中加入#LOG_LEVEL = ‘ERROR’
  • 工程啟動后 程序回停留在listening位置,等待其實url的加入

6 想redis_key表示的隊列中添加起始url

  • 需要在redis的客戶端執行如下指令:(調度器隊列式存在與redis中)
  • lpush sunQueue http://wz.sun0769.com/political/index/politicsNewest?id=1&page=

我們查看數據庫可以看見爬取到的數據

總結

以上是生活随笔為你收集整理的22-爬虫之scrapy框架分布式09的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。