當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

20-爬虫之scrapy框架CrawlSpider07

發(fā)布時(shí)間：2024/9/15 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 20-爬虫之scrapy框架CrawlSpider07 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

CrawlSpider

是Spider的一個(gè)子類，Spider是爬蟲文件中的爬蟲父類
- 之類的功能一定是對(duì)于父類

作用：被作用于專業(yè)實(shí)現(xiàn)全站數(shù)據(jù)爬取
- 將一個(gè)頁(yè)面下的所有頁(yè)碼對(duì)應(yīng)的數(shù)據(jù)進(jìn)行爬取
基本使用
- 創(chuàng)建一個(gè)爬蟲工程：scrapy startproject proName
- 進(jìn)入工程創(chuàng)建一個(gè)基于CrawlSpider的爬蟲文件
  - scrapy genspider -t crawl spiderName www.xxx.com
- 執(zhí)行工程：scrapy crawl spiderName

注意

一個(gè)鏈接提取器對(duì)應(yīng)一個(gè)規(guī)則解析器（多個(gè)鏈接提取器和多個(gè)規(guī)則解析器）
在實(shí)現(xiàn)深度爬取的過(guò)程中，需要和scrapy.Request（）結(jié)合使用
link = LinkExtractor(allow=r’’)# allow是空f(shuō)ollow是True那么我們就能取出全站所有鏈接

普通爬取

爬蟲源碼

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Ruleclass TestSpider(CrawlSpider):name = 'test'#allowed_domains = ['www.xxx.com']start_urls = ['http://www.521609.com/daxuemeinv/']# 鏈接提取器：根據(jù)指定規(guī)則（allow參數(shù)）在頁(yè)面中進(jìn)行鏈接（url）提取# allow = "正則"：提取鏈接的規(guī)則link = LinkExtractor(allow=r'list8\d+\.html')# 實(shí)例化LinkExtractor對(duì)象# link = LinkExtractor(allow=r'')# allow是空f(shuō)ollow是True那么我們就能取出全站所有鏈接rules = (# 是實(shí)例化一個(gè)Rule對(duì)象#規(guī)則解析器:接收鏈接提取器提取到的鏈接，對(duì)其發(fā)起請(qǐng)求，然后根據(jù)指定規(guī)則（callback）解析數(shù)據(jù)Rule(link, callback='parse_item', follow=True),)# follow = True# 將鏈接提取器，繼續(xù)作用到鏈接提取器提取到的頁(yè)碼所對(duì)應(yīng)的頁(yè)面中def parse_item(self, response):print(response)# 基于response實(shí)現(xiàn)數(shù)據(jù)解析

深度爬取

CrawlSpider實(shí)現(xiàn)深度爬取

通用方式：CrawlSpider + Spider實(shí)現(xiàn)深度爬取
創(chuàng)建一個(gè)爬蟲工程：scrapy startproject proName
進(jìn)入工程創(chuàng)建一個(gè)基于CrawlSpider的爬蟲文件
- scrapy genspider -t crawl spiderName www.xxx.com
執(zhí)行工程：scrapy crawl spiderName

settings.py

# Scrapy settings for sunPro project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'sunPro'SPIDER_MODULES = ['sunPro.spiders'] NEWSPIDER_MODULE = 'sunPro.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"# Obey robots.txt rules ROBOTSTXT_OBEY = False LOG_LEVEL = "ERROR" # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}# Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'sunPro.middlewares.SunproSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'sunPro.middlewares.SunproDownloaderMiddleware': 543, #}# Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = {'sunPro.pipelines.SunproPipeline': 300, }# Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

items.py

# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass SunproItem(scrapy.Item):title = scrapy.Field()status = scrapy.Field()class SunProItemDetail(scrapy.Item):content = scrapy.Field()

pipelines.py

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface from itemadapter import ItemAdapterclass SunproPipeline:def process_item(self, item, spider):if item.__class__.__name__=='SunproItem':title = item['title']status = item['status']print(title+":"+status)else:content = item['content']print(content)return item

sun.py 爬蟲源文件

此方法進(jìn)行數(shù)據(jù)爬取，，持久化存儲(chǔ)目前無(wú)法將title和content進(jìn)行一一匹配，我們需要手動(dòng)發(fā)起請(qǐng)求

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from sunPro.items import SunproItem,SunProItemDetailclass TestSpider(CrawlSpider):name = 'sun'#allowed_domains = ['www.xxx.com']start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']# 鏈接提取器：根據(jù)指定規(guī)則（allow參數(shù)）在頁(yè)面中進(jìn)行鏈接（url）提取# allow = "正則"：提取鏈接的規(guī)則link = LinkExtractor(allow=r'id=1&page=\d+')# 實(shí)例化LinkExtractor對(duì)象link_detail = LinkExtractor(allow=r'dindex\?id=\d+') #詳情頁(yè)url# link = LinkExtractor(allow=r'')# allow是空f(shuō)ollow是True那么我們就能取出全站所有鏈接rules = (# 是實(shí)例化一個(gè)Rule對(duì)象#規(guī)則解析器:接收鏈接提取器提取到的鏈接，對(duì)其發(fā)起請(qǐng)求，然后根據(jù)指定規(guī)則（callback）解析數(shù)據(jù)Rule(link, callback='parse_item', follow=True),Rule(link_detail, callback='parse_detail'),)# follow = True# 將鏈接提取器，繼續(xù)作用到鏈接提取器提取到的頁(yè)碼所對(duì)應(yīng)的頁(yè)面中#標(biāo)題&狀態(tài)def parse_item(self, response):li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')for li in li_list:title = li.xpath('./span[3]/a/text()').extract_first()status = li.xpath('./span[2]/text()').extract_first()item = SunproItem()item['title']=titleitem['status'] =statusyield item# 實(shí)現(xiàn)深度爬取：爬取詳情頁(yè)中的數(shù)據(jù)# 1，對(duì)詳情頁(yè)的url進(jìn)行捕獲# 2，對(duì)詳情頁(yè)的url發(fā)起請(qǐng)求獲取數(shù)據(jù)def parse_detail(self,response):content = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]/pre/text()').extract_first()item = SunProItemDetail()item['content']=contentyield item# 爬蟲文件會(huì)向管道提交兩個(gè)不同形式的item，管道會(huì)接受到兩個(gè)不同形式的item。#我們需要在管道中判斷接收到item到底是哪一個(gè)#此方法進(jìn)行數(shù)據(jù)爬取，，持久化存儲(chǔ)目前無(wú)法將title和content進(jìn)行一一匹配，我們需要手動(dòng)發(fā)起請(qǐng)求

CrawlSpider+Spider全站深度爬取

items.py

pipelines.py

sun.py 爬蟲源文件

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from sunPro.items import SunproItemclass TestSpider(CrawlSpider):name = 'sun'#allowed_domains = ['www.xxx.com']start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']# 鏈接提取器：根據(jù)指定規(guī)則（allow參數(shù)）在頁(yè)面中進(jìn)行鏈接（url）提取# allow = "正則"：提取鏈接的規(guī)則link = LinkExtractor(allow=r'id=1&page=\d+')# 實(shí)例化LinkExtractor對(duì)象#link_detail = LinkExtractor(allow=r'dindex\?id=\d+') #詳情頁(yè)urlrules = (# 是實(shí)例化一個(gè)Rule對(duì)象#規(guī)則解析器:接收鏈接提取器提取到的鏈接，對(duì)其發(fā)起請(qǐng)求，然后根據(jù)指定規(guī)則（callback）解析數(shù)據(jù)Rule(link, callback='parse_item', follow=True),# Rule(link_detail, callback='parse_detail'),)# #follow = True# 將鏈接提取器，繼續(xù)作用到鏈接提取器提取到的頁(yè)碼所對(duì)應(yīng)的頁(yè)面中#標(biāo)題&狀態(tài)def parse_item(self, response):li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')for li in li_list:title = li.xpath('./span[3]/a/text()').extract_first()status = li.xpath('./span[2]/text()').extract_first()detail_url ="http://wz.sun0769.com" + li.xpath('./span[3]/a/@href').extract_first()#詳情頁(yè)urlitem = SunproItem()item['title'] = titleitem['status'] = statusyield scrapy.Request(url=detail_url,callback=self.parse_detail,meta={'item':item})def parse_detail(sele,response):content = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]/pre/text()').extract_first()item = response.meta['item']item['content'] = contentyield item

與50位技術(shù)專家面對(duì)面20年技術(shù)見(jiàn)證，附贈(zèng)技術(shù)全景圖

總結(jié)

以上是生活随笔為你收集整理的20-爬虫之scrapy框架CrawlSpider07的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： 19-爬虫之scrapy框架大文件下载0
下一篇： 21-爬虫之scrapy框架seleni