23-爬虫之scrapy框架增量式实时监测数据爬取10
生活随笔
收集整理的這篇文章主要介紹了
23-爬虫之scrapy框架增量式实时监测数据爬取10
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
增量式
概念:監測網站數據更新的情況,以便于爬取到最新更新出來的數據
- 實現核心:去重
- 實戰中去重的方式:記錄表
- 記錄表需要記錄的是爬取過的相關數據
- 爬取過的相關信息:url,標題,等唯一標識(我們使用每一部電影詳情頁的url作為標識)
- 只需要使用某一組數據,改組數據如果可以作為網站唯一標識信息即可,只要可以表示網站內容中唯一標識的數據我們統稱為 數據指紋。
- 記錄表需要記錄的是爬取過的相關數據
- 去重的方式對應的記錄表:
- python中的set集合(不可行)
- set集合無法持久化存儲
- redis中的set集合就可以
- 因為可以持久化存儲
- python中的set集合(不可行)
實現流程
創建工程
創建一個爬蟲工程:scrapy startproject proName
進入工程創建一個基于CrawlSpider的爬蟲文件
scrapy genspider -t crawl spiderName www.xxx.com
執行工程:scrapy crawl spiderName
啟動redis服務端
啟動redis客戶端
我們插入一個gemoumou set顯示1 再次插入顯示為0 說明無法插入
zls.py (爬蟲源文件)
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from redis import Redis from zlsPro.items import ZlsproItemclass ZlsSpider(CrawlSpider):name = 'zls'#allowed_domains = ['www.xxx.com']start_urls = ['https://www.4567kan.com/index.php/vod/show/class/%E5%8A%A8%E4%BD%9C/id/1.html']conn = Redis(host="127.0.0.1",port=6379) # 鏈接redis服務器rules = (Rule(LinkExtractor(allow=r'page/\d+\.html'), callback='parse_item', follow=False),# 頁碼鏈接)def parse_item(self, response):# 解析電影的名稱+電影詳情頁的urlli_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')for li in li_list:title = li.xpath('./div/a/@title').extract_first()detail_url = "https://www.4567kan.com" + li.xpath('./div/a/@href').extract_first()ex = self.conn.sadd("movie_urls",detail_url)# ex ==1 插入成功 ex == 0 插入失敗if ex == 1: # 說明detail_url 表示的數據沒有存在記錄表中# 爬取數據,發起請求item = ZlsproItem()item["title"] = titleprint("有新數據更新,正在爬取新數據:",title)yield scrapy.Request(url=detail_url,callback=self.parse_detail,meta={"item":item})else: # 存在記錄表中print("暫無數據更新")def parse_detail(self,response):# 解析電影簡介desc = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()').extract_first()item = response.meta["item"]item["desc"] = descprint(desc)yield itemitems.py
# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ZlsproItem(scrapy.Item):# define the fields for your item here like:title = scrapy.Field()desc = scrapy.Field()pipelines.py
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface from itemadapter import ItemAdapterclass ZlsproPipeline:def process_item(self, item, spider):spider.conn.lpush('movie_data',item)return itemsettings.py
# Scrapy settings for zlsPro project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'zlsPro'SPIDER_MODULES = ['zlsPro.spiders'] NEWSPIDER_MODULE = 'zlsPro.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"# Obey robots.txt rules ROBOTSTXT_OBEY = False LOG_LEVEL = "ERROR"# Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}# Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'zlsPro.middlewares.ZlsproSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'zlsPro.middlewares.ZlsproDownloaderMiddleware': 543, #}# Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = {'zlsPro.pipelines.ZlsproPipeline': 300, }# Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'結果
因數據存在所以暫無數據
因為網站中有4個電影無法打開
所以結果少了4個
總結
以上是生活随笔為你收集整理的23-爬虫之scrapy框架增量式实时监测数据爬取10的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 22-爬虫之scrapy框架分布式09
- 下一篇: 14生成器