當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

四、scrapy爬虫框架——scrapy管道的使用

發布時間：2024/7/5 编程问答 50 豆豆

生活随笔收集整理的這篇文章主要介紹了四、scrapy爬虫框架——scrapy管道的使用小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

scrapy管道的使用

學習目標：

掌握 scrapy管道(pipelines.py)的使用

之前我們在scrapy入門使用一節中學習了管道的基本使用，接下來我們深入的學習scrapy管道的使用

1. pipeline中常用的方法：

process_item(self,item,spider):

管道類中必須有的函數
實現對item數據的處理
必須return item

open_spider(self, spider): 在爬蟲開啟的時候僅執行一次

close_spider(self, spider): 在爬蟲關閉的時候僅執行一次

2. 管道文件的修改

繼續完善wangyi爬蟲，在pipelines.py代碼中完善

myCode:

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface from itemadapter import ItemAdapter import json from pymongo import MongoClientclass WangyiPipeline:# def __init__(self):# self.file = open('wangyi.json','w')def open_spider(self,spider):if spider.name == 'job':self.file = open('wangyi.json','w',encoding='utf-8')def process_item(self, item, spider):if spider.name == 'job':# 將item對象轉換成字典類型item = dict(item)# 將字典類型數據轉換成字符串str_data = json.dumps(item,ensure_ascii=False) +',\n'self.file.write(str_data)return itemdef close_spider(self,spider):if spider.name == 'job':self.file.close()class WangyiSimplePipeline:# def __init__(self):# self.file = open('wangyi.json','w')def open_spider(self,spider):if spider.name == 'job_simple':self.file = open('wangyiSimple.json','w',encoding='utf-8')def process_item(self, item, spider):if spider.name == 'job_simple':# 將item對象轉換成字典類型item = dict(item)# 將字典類型數據轉換成字符串str_data = json.dumps(item,ensure_ascii=False) +',\n'self.file.write(str_data)return itemdef close_spider(self,spider):if spider.name == 'job_simple':self.file.close()class MongoPipeline(object):def open_spider(self,spider):self.client = MongoClient('127.0.0.1',27017)self.db = self.client['itcast']self.col = self.db['wangyi']def process_item(self,item,spider):# 將item對象轉換成字符串data =dict(item)# 將data寫入數據庫self.col.insert(data)return itemdef close_spider(self,spider):self.client.close() import json from pymongo import MongoClientclass WangyiFilePipeline(object):def open_spider(self, spider): # 在爬蟲開啟的時候僅執行一次if spider.name == 'itcast':self.f = open('json.txt', 'a', encoding='utf-8')def close_spider(self, spider): # 在爬蟲關閉的時候僅執行一次if spider.name == 'itcast':self.f.close()def process_item(self, item, spider):if spider.name == 'itcast':self.f.write(json.dumps(dict(item), ensure_ascii=False, indent=2) + ',\n')# 不return的情況下，另一個權重較低的pipeline將不會獲得itemreturn item class WangyiMongoPipeline(object):def open_spider(self, spider): # 在爬蟲開啟的時候僅執行一次if spider.name == 'itcast':# 也可以使用isinstanc函數來區分爬蟲類:con = MongoClient(host='127.0.0.1', port=27017) # 實例化mongoclientself.collection = con.itcast.teachers # 創建數據庫名為itcast,集合名為teachers的集合操作對象def process_item(self, item, spider):if spider.name == 'itcast':self.collection.insert(item) # 此時item對象必須是一個字典,再插入# 如果此時item是BaseItem則需要先轉換為字典：dict(BaseItem)# 不return的情況下，另一個權重較低的pipeline將不會獲得itemreturn item

3. 開啟管道

在settings.py設置開啟pipeline

...... ITEM_PIPELINES = {'myspider.pipelines.ItcastFilePipeline': 400, # 400表示權重'myspider.pipelines.ItcastMongoPipeline': 500, # 權重值越小，越優先執行！ } ......

別忘了開啟mongodb數據庫 sudo service mongodb start
并在mongodb數據庫中查看 mongo

思考：在settings中能夠開啟多個管道，為什么需要開啟多個？

不同的pipeline可以處理不同爬蟲的數據，通過spider.name屬性來區分

不同的pipeline能夠對一個或多個爬蟲進行不同的數據處理的操作，比如一個進行數據清洗，一個進行數據的保存

同一個管道類也可以處理不同爬蟲的數據，通過spider.name屬性來區分

4. pipeline使用注意點

使用之前需要在settings中開啟

pipeline在setting中鍵表示位置(即pipeline在項目中的位置可以自定義)，值表示距離引擎的遠近，越近數據會越先經過：權重值小的優先執行

有多個pipeline的時候，process_item的方法必須return item,否則后一個pipeline取到的數據為None值

pipeline中process_item的方法必須有，否則item沒有辦法接受和處理

process_item方法接受item和spider，其中spider表示當前傳遞item過來的spider

open_spider(spider) :能夠在爬蟲開啟的時候執行一次

close_spider(spider) :能夠在爬蟲關閉的時候執行一次

上述倆個方法經常用于爬蟲和數據庫的交互，在爬蟲開啟的時候建立和數據庫的連接，在爬蟲關閉的時候斷開和數據庫的連接

小結

管道能夠實現數據的清洗和保存，能夠定義多個管道實現不同的功能，其中有個三個方法
- process_item(self,item,spider):實現對item數據的處理
- open_spider(self, spider): 在爬蟲開啟的時候僅執行一次
- close_spider(self, spider): 在爬蟲關閉的時候僅執行一次

總結

以上是生活随笔為你收集整理的四、scrapy爬虫框架——scrapy管道的使用的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Web框架——Flask系列之sessi
下一篇： socket.io跨域踩坑