當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

21-爬虫之scrapy框架selenium的使用08

發(fā)布時間：2024/9/15 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 21-爬虫之scrapy框架selenium的使用08 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

selenium在scrapy中的使用
案例：爬取網(wǎng)易新聞中，國內(nèi)，國際，軍事，航空，無人機這五個板塊下的所有新聞數(shù)據(jù)（標(biāo)題+內(nèi)容）

基本使用

創(chuàng)建一個爬蟲工程：scrapy startproject proName
進入工程創(chuàng)建一個基于CrawlSpider的爬蟲文件
scrapy genspider spiderName www.xxx.com
執(zhí)行工程：scrapy crawl spiderName

分析

首頁非動態(tài)加載的數(shù)據(jù)
- 在首頁爬取板塊對應(yīng)的url
每一個板塊對應(yīng)的頁面中的新聞是動態(tài)加載的
- 爬取新聞標(biāo)題+詳情頁url
- 每一條新聞詳情頁面中的數(shù)據(jù)不是動態(tài)加載
  - 爬取詳情頁新聞內(nèi)容
selenium在scrapy中的使用流程
- 1，在爬蟲類中實例化一個瀏覽器對象，將其作為爬蟲類的一個屬性
- 2，在中間件中實現(xiàn)瀏覽器自動化的相關(guān)操作
- 3，在爬蟲類中重寫closed(self,spider)，再其內(nèi)部關(guān)閉瀏覽器對象

settings.py

# Scrapy settings for SeleniumTest project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'SeleniumTest'SPIDER_MODULES = ['SeleniumTest.spiders'] NEWSPIDER_MODULE = 'SeleniumTest.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"# Obey robots.txt rules ROBOTSTXT_OBEY = False LOG_LEVEL = "ERROR"# Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #}# Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'SeleniumTest.middlewares.SeleniumtestSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = {'SeleniumTest.middlewares.SeleniumtestDownloaderMiddleware': 543, }# Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = {'SeleniumTest.pipelines.SeleniumtestPipeline': 300, }# Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

middlewares.py

from scrapy import signals from itemadapter import is_item, ItemAdapter from scrapy.http import HtmlResponse #scrapy封裝好的響應(yīng)類 import time class SeleniumtestDownloaderMiddleware:def process_request(self, request, spider):return None# 攔截所有的響應(yīng)請求# 整個工程發(fā)起的請求：1+5+n個響應(yīng)的響應(yīng)對象也是1+5+n個# 只有指定的5個響應(yīng)對象是不滿足需求# 只將不滿足需求的5個指定的響應(yīng)對象的響應(yīng)數(shù)據(jù)進行篡改def process_response(self, request, response, spider):# 將所有攔截到的響應(yīng)對象中指定的5個響應(yīng)對象找出if request.url in spider.model_urls:bro = spider.bro# response表示的就是指定的不滿足需求的5個響應(yīng)對象# 篡改響應(yīng)數(shù)據(jù)：首先先獲取滿足需求的響應(yīng)數(shù)據(jù)，將其篡改到響應(yīng)對象中即可# 滿足需求的響應(yīng)數(shù)據(jù)就可以使用selenium獲取bro.get(request.url) # 對五個板塊的url發(fā)起請求time.sleep(2)bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')time.sleep(1)# 捕獲到了板塊頁面中加載出來的全部數(shù)據(jù)（包含了動態(tài)加載的數(shù)據(jù)）page_text = bro.page_source# response.text = page_text# return response# 返回了一個新的響應(yīng)對象，新的對象替換了原來不滿足需求的舊響應(yīng)對象return HtmlResponse(url=request.url, body=page_text, encoding="utf-8", request=request) # 5else:return response # 1+ndef process_exception(self, request, exception, spider):pass

pipelines.py

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface from itemadapter import ItemAdapterclass SeleniumtestPipeline:def process_item(self, item, spider):print(item)return item

items.py

# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass SeleniumtestItem(scrapy.Item):title = scrapy.Field()content = scrapy.Field()

test.py（爬蟲源文件）

import scrapy from selenium import webdriver from SeleniumTest.items import SeleniumtestItemclass TestSpider(scrapy.Spider):name = 'test'# allowed_domains = ['www.xxx.com']start_urls = ['http://news.163.com/']model_urls = [] # 存放每一個板塊對應(yīng)的url# 實例化一個全局瀏覽器對象bro = webdriver.Chrome()# 數(shù)據(jù)解析：每一個板塊對應(yīng)的urldef parse(self, response):li_list = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li')indexs = [3,4,6,7,8]for index in indexs:model_li = li_list[index]mode_url = model_li.xpath('./a/@href').extract_first()self.model_urls.append(mode_url)#對每一個板塊發(fā)起請求for url in self.model_urls:yield scrapy.Request(url=url,callback=self.parse_model)# 數(shù)據(jù)解析：新聞標(biāo)題+新聞詳情頁的url（動態(tài)加載的數(shù)據(jù)）def parse_model(self,response):div_list = response.xpath('/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div')for div in div_list:title = div.xpath('./div/div[1]/h3/a/text()').extract_first()new_detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first()if new_detail_url:item = SeleniumtestItem() # 實例化item對象item['title'] = title# 對新聞詳情頁url發(fā)起請求yield scrapy.Request(url=new_detail_url,callback=self.parse_new_detail,meta={'item':item})def parse_new_detail(self,response):# 解析新聞內(nèi)容content = response.xpath('//*[@id="endText"]/p/text()').extract()content = ''.join(content)item = response.meta['item']item['content'] = contentyield item# 關(guān)閉瀏覽器爬蟲類父類的方法，該方法是在爬蟲結(jié)束最后一刻執(zhí)行def closed(self,spider):self.bro.quit() 與50位技術(shù)專家面對面20年技術(shù)見證，附贈技術(shù)全景圖

總結(jié)

以上是生活随笔為你收集整理的21-爬虫之scrapy框架selenium的使用08的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 20-爬虫之scrapy框架CrawlS
下一篇： 22-爬虫之scrapy框架分布式09