日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

20-爬虫之scrapy框架CrawlSpider07

發布時間:2024/9/15 编程问答 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 20-爬虫之scrapy框架CrawlSpider07 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

CrawlSpider

是Spider的一個子類,Spider是爬蟲文件中的爬蟲父類
- 之類的功能一定是對于父類

  • 作用:被作用于專業實現全站數據爬取
    • 將一個頁面下的所有頁碼對應的數據進行爬取
  • 基本使用
    • 創建一個爬蟲工程:scrapy startproject proName
    • 進入工程創建一個基于CrawlSpider的爬蟲文件
      • scrapy genspider -t crawl spiderName www.xxx.com
    • 執行工程:scrapy crawl spiderName

注意

  • 一個鏈接提取器對應一個規則解析器(多個鏈接提取器和多個規則解析器)
  • 在實現深度爬取的過程中,需要和scrapy.Request()結合使用
  • link = LinkExtractor(allow=r’’)# allow是空follow是True那么我們就能取出全站所有鏈接

普通爬取

爬蟲源碼

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Ruleclass TestSpider(CrawlSpider):name = 'test'#allowed_domains = ['www.xxx.com']start_urls = ['http://www.521609.com/daxuemeinv/']# 鏈接提取器:根據指定規則(allow參數)在頁面中進行鏈接(url)提取# allow = "正則":提取鏈接的規則link = LinkExtractor(allow=r'list8\d+\.html')# 實例化LinkExtractor對象# link = LinkExtractor(allow=r'')# allow是空follow是True那么我們就能取出全站所有鏈接rules = (# 是實例化一個Rule對象#規則解析器:接收鏈接提取器提取到的鏈接,對其發起請求,然后根據指定規則(callback)解析數據Rule(link, callback='parse_item', follow=True),)# follow = True# 將鏈接提取器,繼續作用到鏈接 提取器提取到的頁碼所對應的頁面中def parse_item(self, response):print(response)# 基于response實現數據解析

深度爬取

CrawlSpider實現深度爬取

  • 通用方式:CrawlSpider + Spider實現深度爬取

  • 創建一個爬蟲工程:scrapy startproject proName

  • 進入工程創建一個基于CrawlSpider的爬蟲文件

    • scrapy genspider -t crawl spiderName www.xxx.com
  • 執行工程:scrapy crawl spiderName

settings.py

# Scrapy settings for sunPro project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'sunPro'SPIDER_MODULES = ['sunPro.spiders'] NEWSPIDER_MODULE = 'sunPro.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"# Obey robots.txt rules ROBOTSTXT_OBEY = False LOG_LEVEL = "ERROR" # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}# Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'sunPro.middlewares.SunproSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'sunPro.middlewares.SunproDownloaderMiddleware': 543, #}# Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = {'sunPro.pipelines.SunproPipeline': 300, }# Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

items.py

# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass SunproItem(scrapy.Item):title = scrapy.Field()status = scrapy.Field()class SunProItemDetail(scrapy.Item):content = scrapy.Field()

pipelines.py

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface from itemadapter import ItemAdapterclass SunproPipeline:def process_item(self, item, spider):if item.__class__.__name__=='SunproItem':title = item['title']status = item['status']print(title+":"+status)else:content = item['content']print(content)return item

sun.py 爬蟲源文件

此方法進行數據爬取,,持久化存儲目前無法將title和content進行一一匹配,我們需要手動發起請求

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from sunPro.items import SunproItem,SunProItemDetailclass TestSpider(CrawlSpider):name = 'sun'#allowed_domains = ['www.xxx.com']start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']# 鏈接提取器:根據指定規則(allow參數)在頁面中進行鏈接(url)提取# allow = "正則":提取鏈接的規則link = LinkExtractor(allow=r'id=1&page=\d+')# 實例化LinkExtractor對象link_detail = LinkExtractor(allow=r'dindex\?id=\d+') #詳情頁url# link = LinkExtractor(allow=r'')# allow是空follow是True那么我們就能取出全站所有鏈接rules = (# 是實例化一個Rule對象#規則解析器:接收鏈接提取器提取到的鏈接,對其發起請求,然后根據指定規則(callback)解析數據Rule(link, callback='parse_item', follow=True),Rule(link_detail, callback='parse_detail'),)# follow = True# 將鏈接提取器,繼續作用到鏈接 提取器提取到的頁碼所對應的頁面中#標題&狀態def parse_item(self, response):li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')for li in li_list:title = li.xpath('./span[3]/a/text()').extract_first()status = li.xpath('./span[2]/text()').extract_first()item = SunproItem()item['title']=titleitem['status'] =statusyield item# 實現深度爬取:爬取詳情頁中的數據# 1,對詳情頁的url進行捕獲# 2,對詳情頁的url發起請求獲取數據def parse_detail(self,response):content = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]/pre/text()').extract_first()item = SunProItemDetail()item['content']=contentyield item# 爬蟲文件會向管道提交兩個不同形式的item,管道會接受到兩個不同形式的item。#我們需要在管道中判斷接收到item到底是哪一個#此方法進行數據爬取,,持久化存儲目前無法將title和content進行一一匹配,我們需要手動發起請求

CrawlSpider+Spider全站深度爬取

items.py

# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass SunproItem(scrapy.Item):title = scrapy.Field()status = scrapy.Field()content = scrapy.Field()

pipelines.py

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface from itemadapter import ItemAdapterclass SunproPipeline:def process_item(self, item, spider):print(item)return item

sun.py 爬蟲源文件

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from sunPro.items import SunproItemclass TestSpider(CrawlSpider):name = 'sun'#allowed_domains = ['www.xxx.com']start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']# 鏈接提取器:根據指定規則(allow參數)在頁面中進行鏈接(url)提取# allow = "正則":提取鏈接的規則link = LinkExtractor(allow=r'id=1&page=\d+')# 實例化LinkExtractor對象#link_detail = LinkExtractor(allow=r'dindex\?id=\d+') #詳情頁urlrules = (# 是實例化一個Rule對象#規則解析器:接收鏈接提取器提取到的鏈接,對其發起請求,然后根據指定規則(callback)解析數據Rule(link, callback='parse_item', follow=True),# Rule(link_detail, callback='parse_detail'),)# #follow = True# 將鏈接提取器,繼續作用到鏈接 提取器提取到的頁碼所對應的頁面中#標題&狀態def parse_item(self, response):li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')for li in li_list:title = li.xpath('./span[3]/a/text()').extract_first()status = li.xpath('./span[2]/text()').extract_first()detail_url ="http://wz.sun0769.com" + li.xpath('./span[3]/a/@href').extract_first()#詳情頁urlitem = SunproItem()item['title'] = titleitem['status'] = statusyield scrapy.Request(url=detail_url,callback=self.parse_detail,meta={'item':item})def parse_detail(sele,response):content = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]/pre/text()').extract_first()item = response.meta['item']item['content'] = contentyield item

與50位技術專家面對面20年技術見證,附贈技術全景圖

總結

以上是生活随笔為你收集整理的20-爬虫之scrapy框架CrawlSpider07的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。