當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Scrapy 框架入门

發布時間：2024/4/17 编程问答 50 豆豆

生活随笔收集整理的這篇文章主要介紹了 Scrapy 框架入门小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、介紹

? Scrapy 是一個基于Twisted 的異步處理框架，是純 Python 實現的爬蟲框架，其架構清晰，模塊之間耦合較低，擴展性和靈活強，是目前 Python 中使用最廣泛的爬蟲框架。

架構示意圖；

它分為以下幾個部分：

Engine：引擎，處理整個系統的數據流處理、觸發事務，是整個框架的核心。
Item：項目，它定義了爬取數據結果的數據結構，爬取的數據會被賦值成該 Item 對象。
Scheduler：調度器，接受引擎發送過來的請求并將其加入到隊列中，在引擎再次請求的時候提供給引擎。
Downloader：下載器，下載網頁內容并將其返回給Spiders。
Spiders：蜘蛛，其內定義了爬取的邏輯和網頁的解析規則，它主要任務是負責解析響應并生成提取結果和新的請求。
Item Pipeline：項目管道，負責處理由 Spiders 從網頁中抽取的項目，它的主要任務是清洗、驗證和存儲數據。
Downloader Middlewares：下載中間件，位于引擎和下載器之間的鉤子框架，主要處理引擎與下載器之間的請求及響應。
Spider Middlewares：蜘蛛中間件，位于引擎和蜘蛛之間的鉤子框架，主要處理蜘蛛輸入的響應和輸出的結果及新的請求。

項目結構

Scrapy 框架通過命令行來創建項目，IDE 編寫代碼，項目文件結構如下所示：

scrapy.cfg # Scrapy 項目配置文件 project/__init__.pyitems.py # 它定義了 Item 數據結構pipelines.py # 它定義了 Item Pipeline 的實像settings.py # 它定義了項目的全局配置middlewares.py # 它定義了 Spider、Downloader 的中間件的實現spiders/ # 其內包含了一個個 spider 的實現__init__.pyspider1.pyspider2.py...

二、Scrapy 入門 Demo

目標：

創建一個 Scrapy 項目。
創建一個 Spider 來抓取站點和處理數據。
通過命令行將抓取的內容導出。
將抓取的內容保存到 MongoDB 數據庫。

創建一個 Scrapy 項目：

scrapy startproject tutorial

文件夾結構如下：

創建 Spider

自定義的 Spider 類必須繼承scrapy.Spider 類。使用命令行自定義一個 Quotes Spider。

cd tutorial # 進入剛才創建的 tutorial，即進入項目的根路徑 scrapy genspider quotes quotes.toscrape.com # 執行 genspider 命令，第一個參數是 Spider 的名稱，第二個參數是網站域名。

然后 spiders 下就多了個 quotes.py 文件：

# -*- coding: utf-8 -*- import scrapyclass QuotesSpider(scrapy.Spider):# 每個 spider 獨特的名字以便區分name = 'quotes' # 要爬取的鏈接的域名，若鏈接不在這個域名下，會被過濾allowed_domains = ['quotes.toscrape.com']# 它包含了 Spider 在啟動時爬取的 url 列表請求start_urls = ['http://quotes.toscrape.com/']# 當上述的請求在完成下載后，返回的響應作為參數，該方法負責解析返回的響應、提取數據或進一步生成要處理的請求def parse(self, response):pass

創建 Item

Item 是用來保存爬取數據的容器（數據結構），使用方法類似與字典，不過多了額外的保護機制避免拼寫錯誤。創建自定義的 Item 也需要繼承 scrapy.Item 類并且定義類型為 scrapy.Filed的字段。修改 items.py如下：

import scrapyclass QuoteItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()text = scrapy.Field()author = scrapy.Field()tags = scrapy.Field()pass

解析 Response

首先打開自定義的 Spider 中的首個請求：http://quotes.toscrape.com/，查看網頁結構，發現每一頁都有多個 class 為 quote 的區塊，每個區塊內都含有 text、author、tags。

所以，修改自定義 Spider 中的 parse 方法如下：

# -*- coding: utf-8 -*- import scrapyclass QuotesSpider(scrapy.Spider):name = 'quotes'allowed_domains = ['quotes.toscrape.com']start_urls = ['http://quotes.toscrape.com/']def parse(self, response):# 使用 css 選擇器，選出類為 quote 的元素quotes = response.css('.quote') for quote in quotes:# 獲取 quote 下第一個.text 元素的的 texttext = quote.css('.text::text').extract_first()author = quote.css('.author::text').extract_first()# 獲取多個標簽的文本tags = quote.css('.tags .tag::text').extract()

使用 Item

QuotesSpider 的改寫如下：

后續 Requets

這里后續的請求指的是請求下一頁的數據，該怎么請求呢？就要觀察網頁了：

QuotesSpider.py：

# -*- coding: utf-8 -*- import scrapy from tutorial.items import QuoteItemclass QuotesSpider(scrapy.Spider):name = 'quotes'allowed_domains = ['quotes.toscrape.com']start_urls = ['http://quotes.toscrape.com/']def parse(self, response):# 使用 css 選擇器，選出類為 quote 的元素quotes = response.css('.quote') for quote in quotes:# 實例化 QuoteItemitem = QuoteItem()# 獲取 quote 下第一個.text 元素的的 textitem['text'] = quote.css('.text::text').extract_first()item['author'] = quote.css('.author::text').extract_first()# 獲取多個標簽的文本item['tags'] = quote.css('.tags .tag::text').extract()yield item# 獲取下一頁的相對 urlnext = response.css('.pager .next a::attr("href")').extract_first()# 獲取下一頁的絕對 urlurl = response.urljoin(next)# 構造新的請求，這個請求完成后，響應會重新經過 parse 方法處理，如此往復yield scrapy.Request(url=url, callback=self.parse)

運行 Spider

scrapy crawl quotes

下面是控制臺的輸出結果，輸出了當前的版本號以及 Middlewares 和 Pipelines，各個頁面的抓取結果等。

保存到文件中

scrapy crawl quotes -o quotes.json：將上面抓取數據的結果保存成 json 文件。

scrapy crawl quotes -o quotes.jsonlines：每一個 Item 輸出一行 JSON。
scrapy crawl quotes -o quotes.cs：輸出為 CSV 格式。
scrapy crawl quotes -o quotes.xml：輸出為 XML 格式。
scrapy crawl quotes -o quotes.pickle：輸出為 pickle 格式。
scrapy crawl quotes -o quotes.marshal：輸出為 marshal 格式。
scrapy crawl quotes -o ftg://user:pass@ftp.example.com/path/to/quotes.csv：ftp 遠程輸出。

使用 Item Pineline 保存到數據庫中

如果想進行更復雜的操作，如將結果保存到 MongoDB 數據中或篩選出某些有用的 Item，則我們可以自定義 ItemPineline 來實現。修改 pinelines.py 文件：

# -*- coding: utf-8 -*-# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlfrom scrapy.exceptions import DropItem import pymongoclass TextPipeline(object):def __init__(self):self.limit = 50# 需要實現 process_item 方法，啟用 Item Pineline 會自動調用這個方法def process_item(self, item, spider):'''如果字段無值，拋出 DropItem 異常，否則判斷字段的長度是否大于規定的長度，若大于則截取到規定的長度并拼接上省略號，否則直接返回 item'''if item['text']:if len(item['text']) > self.limit:item['text'] = item['text'][0:self.limit].rstrip() + '...'return itemelse:return DropItem('Missing Text')class MongoPipeline(object):def __init__(self, mongo_uri, mongo_db):self.mongo_uri = mongo_uriself.mongo_db= mongo_db'''此方法用@classmethod 修飾表示時一個類方法，是一種依賴注入的方式，通過 crawler我們可以獲取到全局配置（settings.py）的每個信息'''@classmethoddef from_crawler(cls, crawler):return cls(mongo_uri = crawler.settings.get('MONGO_URI'),mongo_db = crawler.settings.get('MONGO_DB'))def open_spider(self, spider):self.client = pymongo.MongoClient(self.mongo_uri)self.db = self.client[self.mongo_db]# 執行了數據庫的插入操作def process_item(self, item, spider):name = item.__class__.__name__self.db[name].insert(dict(item))return itemdef close_spider(self, spider):self.client.close()

settings.py 添加如下內容：

# 賦值 ITEM_PIPELINES 字典，鍵名是 pipeline 類的名稱，鍵值是優先級， #是一個數字，越小，越先被調用 ITEM_PIPELINES = {'tutorial.pipelines.TextPipeline': 300,'tutorial.pipelines.MongoPipeline': 400 } MONGO_URI = 'localhost' MONGO_DB = 'tutorial'

重新執行爬取

scrapy crawl quotes

三、參考書籍

崔慶才.《Python3 網絡爬蟲開發實戰》

轉載于:https://www.cnblogs.com/yunche/p/10357232.html

與50位技術專家面對面20年技術見證，附贈技術全景圖

總結

以上是生活随笔為你收集整理的Scrapy 框架入门的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【题解】跳石头
下一篇： PAT B1007 素数对猜想（20