當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

scrapy框架对接seleniumpipeline数据持久化

發布時間：2025/3/21 编程问答 19 豆豆

生活随笔收集整理的這篇文章主要介紹了 scrapy框架对接seleniumpipeline数据持久化小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

1、**scrapy對接selenium**
2、pipeline數據持久化

1、scrapy對接selenium

動態數據加載:
1.ajax:
①url接口存在規律, 可以自行構建url, 直接爬取
②selenium自動化測試框架, 抓取動態數據
2.js動態數據加載
①js逆向
②selenium抓取

selenium可以實現抓取動態數據
scrapy不能抓取動態數據, 如果是ajax請求, 可以請求接口, 如果是js動態加載, 需要結合selenium

import scrapy from selenium import webdriver from ..items import WynewsItem from selenium.webdriver import ChromeOptionsclass NewsSpider(scrapy.Spider):name = 'news'# allowed_domains = ['www.baidu.com']start_urls = ['https://news.163.com/domestic/']option.add_experimental_option('excludeSwitches', ['enable-automation']) bro=webdriver.Chrome(executable_path=r'C:\Users\Administrator\Desktop\news\wynews\wynews\spiders\chromedriver.exe')def detail_parse(self, response):content_list = response.xpath('//div[@id="endText"]/p//text()').extract()content = ''title = response.meta['title']for s in content_list:content += sitem = WynewsItem()item["title"] = titleitem["content"] = contentyield itemdef parse(self, response):div_list = response.xpath('//div[contains(@class, "data_row")]')for div in div_list:link = div.xpath('./a/@href').extract_first()title = div.xpath('./div/div[1]/h3/a/text()').extract_first()yield scrapy.Request(url=link, callback=self.detail_parse, meta={"title":title}) # 中間件編碼: from scrapy.http import HtmlResponse class WynewsDownloaderMiddleware(object):def process_response(self, request, response, spider):bro = spider.broif request.url in spider.start_urls:bro.get(request.url)time.sleep(3)js = 'window.scrollTo(0, document.body.scrollHeight)'bro.execute_script(js)time.sleep(3)response_selenium = bro.page_sourcereturn HtmlResponse(url=bro.current_url, body=response_selenium, encoding="utf-8", request=request)return response # Pipeline編碼: import pymongoclass WynewsPipeline(object):conn = pymongo.MongoClient('localhost', 27017)db = conn.wynewstable = db.newsinfodef process_item(self, item, spider):self.table.insert(dict(item))return item

2、pipeline數據持久化

介紹:
1.pipelines: 用于數據持久化
2.數據持久化的方式有很多種: MongoDB, MySQL, Redis, CSV
3.必須實現的方法: process_item

# 核心方法講解: open_spider(self, spider): spider開啟是被調用 close_spider(self, spider): spider關閉是被調用 from_crawler(cls, crawler): 類方法, 用@classmethod標識, 可以獲取配置信息 Process_item(self, item, spider): 與數據庫交互存儲數據, 該方法必須實現 ***** # 重點: 所有的方法名都必須一致 # MongoDB交互: import Pymongo # 管道類 class MongoPipeline(object):# 初始化方法, __new__: 構造方法, 在內存中開辟一塊空間def __init__(self, mongo_uri, mongo_db):self.mongo_uri = mongo_uriself.mongo_db = mongo_db@classmethoddef from_crawler(cls, crawler):return cls(mongo_uri = crawler.settings.get('MONGO_URI'),mongo_db = crawler.settings.get('MONGO_DB'))def open_spider(self, spider):self.client = pymongo.MongoClient(self.mongo_uri)self.db = self.client[self.mongo_db]def process_item(self, item, spider):self.db['news'].insert(dict(item))# 在一個項目中可能存在多個管道類, 如果該管道類后面還有管道類需要存儲數據, 必須return itemreturn itemdef close_spider(self, spider):self.client.close() # MySQL交互: import pymysqlclass MysqlPipeline(object):def __init__(self, host, database, user, password, port):self.host = hostself.database = databaseself.user = userself.password = passwordself.port = port@classmethod def from_crawler(cls, crawler):return cls(host = crawler.settings.get('MYSQL_HOST')database = crawler.settings.get('MYSQL_DATABASE')user = crawler.settings.get('MYSQL_USER')password= crawler.settings.get('MYSQL_PASSWORD')port = crawler.settings.get('MYSQL_PORT'))def open_spider(self, spider):self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf-8', port=self.port)self.cursor = self.db.cursor()def process_item(self, item, spider):data = dict(item)keys = ','.join(data.keys())values = ','.join(['%s']*len(data))sql = 'insert into %s (%s) values (%s)' % (tablename, keys, values)self.cursor.execute(sql, tuple(data.values()))self.db.commit()return item

用于文件下載的管道類

# spider編碼: import scrapy from ..items import XhxhItem class XhSpider(scrapy.Spider):name = 'xh'# allowed_domains = ['www.baidu.com']start_urls = ['http://www.521609.com/qingchunmeinv/']def parse(self, response):li_list = response.xpath('//div[@class="index_img list_center"]/ul/li')for li in li_list:item = XhxhItem()link = li.xpath('./a[1]/img/@src').extract_first()item['img_link'] = 'http://www.521609.com' + linkprint(item)yield item # items編碼: import scrapy class XhxhItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()img_link = scrapy.Field() # 管道編碼: import scrapy from scrapy.pipelines.images import ImagesPipelineclass XhxhPipeline(object):def process_item(self, item, spider):return itemclass ImgPipeLine(ImagesPipeline):def get_media_requests(self, item, info):yield scrapy.Request(url=item['img_link'])def file_path(self, request, response=None, info=None):url = request.urlfile_name = url.split('/')[-1]return file_namedef item_completed(self, results, item, info):return item # settings編碼: ITEM_PIPELINES = {'xhxh.pipelines.XhxhPipeline': 300,'xhxh.pipelines.ImgPipeLine': 301, } IMAGES_STORE = './mvs'

總結

以上是生活随笔為你收集整理的scrapy框架对接seleniumpipeline数据持久化的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。