當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

scrapy框架架构

發(fā)布時間：2025/3/17 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 scrapy框架架构小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

介紹

Scrapy一個開源和協(xié)作的框架，其最初是為了頁面抓取 (更確切來說, 網(wǎng)絡(luò)抓取 )所設(shè)計(jì)的，使用它可以以快速、簡單、可擴(kuò)展的方式從網(wǎng)站中提取所需的數(shù)據(jù)。但目前Scrapy的用途十分廣泛，可用于如數(shù)據(jù)挖掘、監(jiān)測和自動化測試等領(lǐng)域，也可以應(yīng)用在獲取API所返回的數(shù)據(jù)(例如 Amazon Associates Web Services ) 或者通用的網(wǎng)絡(luò)爬蟲。

Scrapy 是基于twisted框架開發(fā)而來，twisted是一個流行的事件驅(qū)動的python網(wǎng)絡(luò)框架。因此Scrapy使用了一種非阻塞（又名異步）的代碼來實(shí)現(xiàn)并發(fā)。

架構(gòu)

流程解析

The Engine gets the initial Requests to crawl from the Spider. （引擎從爬蟲獲取初始Requests ）

The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl. （引擎將該Requests 放入調(diào)度器中，并請求下一個Requests來爬取）

The Scheduler returns the next Requests to the Engine. （調(diào)度器將下一個Requests 返回給引擎）

The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()). （經(jīng)過中間件，引擎將Requests發(fā)送給下載器，）

Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()). （一旦頁面爬取完成，下載器就會生成一個Response，再經(jīng)過中間件，發(fā)送給引擎）

The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()). （引擎收到下載器返回的Response 后，經(jīng)過中間件，發(fā)送給爬蟲處理）

The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()). （爬蟲處理Response，經(jīng)過中間件，返回處理后的items 或新的Requests給引擎）

The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl. （引擎將處理后的items發(fā)送給項(xiàng)目管道，將Requests 發(fā)送給調(diào)度器，并請求下一個Requests 來爬取）

The process repeats (from step 1) until there are no more requests from the Scheduler. （不斷重復(fù)以上流程，直到調(diào)度器中沒有requests 為止）

組件說明

引擎(EGINE)：

引擎負(fù)責(zé)控制系統(tǒng)所有組件之間的數(shù)據(jù)流，并在某些動作發(fā)生時觸發(fā)事件。

調(diào)度器(SCHEDULER)：

用來接受引擎發(fā)過來的請求, 壓入隊(duì)列中, 并在引擎再次請求的時候返回. 可以想像成一個URL的優(yōu)先級隊(duì)列, 由它來決定下一個要抓取的網(wǎng)址是什么, 同時去除重復(fù)的網(wǎng)址

下載器(DOWLOADER)

用于下載網(wǎng)頁內(nèi)容, 并將網(wǎng)頁內(nèi)容返回給EGINE，下載器是建立在twisted這個高效的異步模型上的

爬蟲(SPIDERS)

SPIDERS是開發(fā)人員自定義的類，用來解析responses，并且提取items，或者發(fā)送新的請求

項(xiàng)目管道(ITEM PIPLINES)

在items被提取后負(fù)責(zé)處理它們，主要包括清理、驗(yàn)證、持久化（比如存到數(shù)據(jù)庫）等操作

下載器中間件(Downloader Middlewares)

位于Scrapy引擎和下載器之間，主要用來處理從EGINE傳到DOWLOADER的請求request，已經(jīng)從DOWNLOADER傳到EGINE的響應(yīng)response，你可用該中間件做以下幾件事：

在request發(fā)往下載器之前對requests進(jìn)行處理（也就是在爬取網(wǎng)站之前）
在response 傳遞給爬蟲之前，修改response
不給爬蟲發(fā)送收到的response，而是給它發(fā)送新的request，
不爬取網(wǎng)頁，直接給爬蟲返回response
丟棄一些request

爬蟲中間件(Spider Middlewares)

位于EGINE和SPIDERS之間，主要工作是處理SPIDERS的輸入（即responses）和輸出（即requests）

命令行工具

全局命令

startproject 創(chuàng)建項(xiàng)目

genspider: scrapy genspider [-t template] <name> <domain>生成爬蟲，-l 查看模板； -t 指定模板，name爬蟲名，domain域名

settings 查看設(shè)置

runspider 運(yùn)行爬蟲（運(yùn)行一個獨(dú)立的python文件，不必創(chuàng)建項(xiàng)目）

shell ：scrapy shell [url]進(jìn)入交互式命令行，可以方便調(diào)試

–spider=SPIDER 忽略爬蟲自動檢測，強(qiáng)制使用指定的爬蟲

-c 評估代碼，打印結(jié)果并退出：

$ scrapy shell --nolog http://www.example.com/ -c '(response.status, response.url)' (200, 'http://www.example.com/')

–no-redirect 拒絕重定向

–nolog 不打印日志

response.status 查看響應(yīng)碼

response.url

response.text; response.body 響應(yīng)文本；響應(yīng)二進(jìn)制

view(response) 打開下載到本地的頁面,方便分析頁面（比如非靜態(tài)元素）

fetch 查看爬蟲是如何獲取頁面的，常見選項(xiàng)如下：

–spider=SPIDER 忽略爬蟲自動檢測，強(qiáng)制使用指定的爬蟲

–headers 查看響應(yīng)頭信息

–no-redirect 拒絕重定向

view 同交互式命令中的view

version

項(xiàng)目命令

crawl : scrapy crawl <spider> 指定爬蟲開始爬取（確保配置文件中ROBOTSTXT_OBEY = False）

check: scrapy check [-l] <spider>檢查語法錯誤

list 爬蟲list

edit 命令行模式編輯爬蟲（沒啥用）

parse: scrapy parse <url> [options] 爬取并用指定的回掉函數(shù)解析（可以驗(yàn)證我們的回調(diào)函數(shù)是否正確）

–callback 或者 -c 指定回調(diào)函數(shù)

bench 測試爬蟲性能

項(xiàng)目結(jié)構(gòu)和爬蟲應(yīng)用簡介

scrapy startproject tutorial tutorial/scrapy.cfg # 項(xiàng)目的主配置信息，用來部署scrapy時使用，爬蟲相關(guān)的配置信息在settings.py文件中tutorial/ # 項(xiàng)目模塊__init__.pyitems.py # 設(shè)置數(shù)據(jù)存儲模板，用于結(jié)構(gòu)化數(shù)據(jù)，類似Django的Modelpipelines.py # 數(shù)據(jù)處理行為，如：一般結(jié)構(gòu)化的數(shù)據(jù)持久化settings.py # project settings filespiders/ # a directory where you'll later put your spiders__init__.py

待更新

總結(jié)

以上是生活随笔為你收集整理的scrapy框架架构的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。