當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python3 scrapy 教程_Scrapy 教程

發(fā)布時(shí)間：2025/3/13 python 16 豆豆

生活随笔收集整理的這篇文章主要介紹了 python3 scrapy 教程_Scrapy 教程小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Scrapy 教程?

在本教程中，我們假定scrapy已經(jīng)安裝在您的系統(tǒng)上。如果不是這樣的話，看安裝指南 .

我們將抓取' quotes.toscrape.com ' _，這是一個(gè)列出著名作家名言的網(wǎng)站。

本教程將指導(dǎo)您完成以下任務(wù)：

創(chuàng)建新的Scrapy項(xiàng)目

寫一篇 spider 對網(wǎng)站進(jìn)行爬網(wǎng)并提取數(shù)據(jù)

使用命令行導(dǎo)出抓取的數(shù)據(jù)

將spider改為遞歸跟蹤鏈接

使用蜘蛛?yún)?shù)

Scrapy是用 Python 寫的。如果你對這門語言不熟悉，你可能想從了解這門語言是什么開始，從 Scrapy 語言中得到最大的收獲。

如果您已經(jīng)熟悉其他語言，并且希望快速學(xué)習(xí)Python，那么 Python Tutorial 是一種很好的資源。

如果您是編程新手，并且想從python開始，那么下面的書可能對您有用：

創(chuàng)建項(xiàng)目?

在開始抓取之前，你必須建立一個(gè)新的零碎項(xiàng)目。輸入要在其中存儲代碼并運(yùn)行的目錄：

scrapy startproject tutorial

這將創(chuàng)建一個(gè) tutorial 目錄包含以下內(nèi)容：

tutorial/

scrapy.cfg # deploy configuration file

tutorial/ # project's Python module, you'll import your code from here

__init__.py

items.py # project items definition file

middlewares.py # project middlewares file

pipelines.py # project pipelines file

settings.py # project settings file

spiders/ # a directory where you'll later put your spiders

__init__.py

我們的第一只蜘蛛?

蜘蛛是你定義的類，Scrapy用來從一個(gè)網(wǎng)站（或一組網(wǎng)站）獲取信息。它們必須是子類 Spider 定義要發(fā)出的初始請求，可以選擇如何跟蹤頁面中的鏈接，以及如何解析下載的頁面內(nèi)容以提取數(shù)據(jù)。

這是我們第一只蜘蛛的代碼。將其保存在名為的文件中 quotes_spider.py 下 tutorial/spiders 項(xiàng)目中的目錄：

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

def start_requests(self):

urls = [

'http://quotes.toscrape.com/page/1/',

'http://quotes.toscrape.com/page/2/',

]

for url in urls:

yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):

page = response.url.split("/")[-2]

filename = f'quotes-{page}.html'

with open(filename, 'wb') as f:

f.write(response.body)

self.log(f'Saved file{filename}')

如你所見，我們的蜘蛛子類 scrapy.Spider 并定義了一些屬性和方法：

name ：標(biāo)識蜘蛛。它在一個(gè)項(xiàng)目中必須是唯一的，也就是說，不能為不同的蜘蛛設(shè)置相同的名稱。

start_requests() ：必須返回一個(gè)ITable of requests（您可以返回一個(gè)請求列表或編寫一個(gè)生成器函數(shù)），蜘蛛將從中開始爬行。隨后的請求將從這些初始請求中依次生成。

parse() ：將調(diào)用的方法，用于處理為每個(gè)請求下載的響應(yīng)。響應(yīng)參數(shù)是的實(shí)例 TextResponse 它保存頁面內(nèi)容，并有進(jìn)一步有用的方法來處理它。

這個(gè) parse() 方法通常解析響應(yīng)，將抓取的數(shù)據(jù)提取為dict，并查找新的URL以跟蹤和創(chuàng)建新的請求。（ Request 從他們那里。

如何運(yùn)行我們的蜘蛛?

要使蜘蛛正常工作，請轉(zhuǎn)到項(xiàng)目的頂級目錄并運(yùn)行：

scrapy crawl quotes

此命令運(yùn)行名為的spider quotes 我們剛剛添加的，這將發(fā)送一些 quotes.toscrape.com 領(lǐng)域。您將得到類似于以下內(nèi)容的輸出：

... (omitted for brevity)

2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened

2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) (referer: None)

2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)

2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html

2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html

2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)

...

現(xiàn)在，檢查當(dāng)前目錄中的文件。您應(yīng)該注意到已經(jīng)創(chuàng)建了兩個(gè)新文件： quotes-1.html 和引用-2.HTML, 將各個(gè)URL的內(nèi)容作為 parse 方法指示。

注解

如果您想知道為什么我們還沒有解析HTML，請稍等，我們很快就會(huì)討論這個(gè)問題。

引擎蓋下面發(fā)生了什么？?

Scrapy安排了 scrapy.Request 返回的對象 start_requests 蜘蛛的方法。在接收到每個(gè)響應(yīng)時(shí)，它實(shí)例化 Response 對象并調(diào)用與請求關(guān)聯(lián)的回調(diào)方法（在本例中，為 parse 方法）將響應(yīng)作為參數(shù)傳遞。

啟動(dòng)請求方法的快捷方式?

而不是執(zhí)行 start_requests() 生成的方法 scrapy.Request 來自URL的對象，您只需定義 start_urls 具有URL列表的類屬性。然后，此列表將由 start_requests() 要為您的蜘蛛創(chuàng)建初始請求，請執(zhí)行以下操作：

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

start_urls = [

'http://quotes.toscrape.com/page/1/',

'http://quotes.toscrape.com/page/2/',

]

def parse(self, response):

page = response.url.split("/")[-2]

filename = f'quotes-{page}.html'

with open(filename, 'wb') as f:

f.write(response.body)

這個(gè) parse() 方法將被調(diào)用來處理這些URL的每個(gè)請求，即使我們沒有明確地告訴Scrapy這樣做。這是因?yàn)?parse() 是Scrapy的默認(rèn)回調(diào)方法，對沒有顯式分配回調(diào)的請求調(diào)用該方法。

提取數(shù)據(jù)?

學(xué)習(xí)如何使用scrappy提取數(shù)據(jù)的最佳方法是使用 Scrapy shell . 運(yùn)行：

scrapy shell 'http://quotes.toscrape.com/page/1/'

注解

否則，在運(yùn)行Scrapy命令時(shí)，請記住要在命令行中包含url。 & 字符）不起作用。

在Windows上，使用雙引號：

scrapy shell "http://quotes.toscrape.com/page/1/"

您將看到類似的內(nèi)容：

[ ... Scrapy log here ... ]

2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)

[s] Available Scrapy objects:

[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)

[s] crawler

[s] item {}

[s] request

[s] response <200 http://quotes.toscrape.com/page/1/>

[s] settings

[s] spider

[s] Useful shortcuts:

[s] shelp() Shell help (print this help)

[s] fetch(req_or_url) Fetch request (or URL) and update local objects

[s] view(response) View response in a browser

使用shell，可以嘗試使用 CSS 對于響應(yīng)對象：

>>>response.css('title')

[]

運(yùn)行``response.css（'title'）``的結(jié)果是一個(gè)類似于列表的對象：class：~scrapy.selector.SelectorList，它表示一個(gè)列表：class：`~scrapy.selector.Selector，這些對象環(huán)繞XML/HTML元素，并允許您運(yùn)行進(jìn)一步的查詢，以細(xì)化所選內(nèi)容或提取數(shù)據(jù)。

要從上述標(biāo)題中提取文本，可以執(zhí)行以下操作：

>>>response.css('title::text').getall()

['Quotes to Scrape']

這里有兩件事需要注意：一是我們已經(jīng)添加了 ::text 對于CSS查詢，意味著我們只想直接選擇內(nèi)部的文本元素

元素。如果我們不指定 ::text ，我們將獲得完整的title元素，包括其標(biāo)記：

>>>response.css('title').getall()

Quotes to Scrape']

另一件事是呼叫的結(jié)果 .getall() 是一個(gè)列表：選擇器可能返回多個(gè)結(jié)果，因此我們提取所有結(jié)果。當(dāng)您知道您只想要第一個(gè)結(jié)果時(shí)，如本例所示，您可以：

>>>response.css('title::text').get()

'Quotes to Scrape'

作為替代，你可以寫下：

>>>response.css('title::text')[0].get()

'Quotes to Scrape'

然而，使用 .get() 直接在A上 SelectorList 實(shí)例避免了 IndexError 回報(bào) None 當(dāng)它找不到任何與所選內(nèi)容匹配的元素時(shí)。

這里有一個(gè)教訓(xùn)：對于大多數(shù)抓取代碼，您希望它能夠?qū)τ捎谠陧撁嫔险也坏降臇|西而導(dǎo)致的錯(cuò)誤具有彈性，這樣即使某些部分無法抓取，您至少可以 some 數(shù)據(jù)。

>>>response.css('title::text').re(r'Quotes.*')

['Quotes to Scrape']

>>>response.css('title::text').re(r'Q\w+')

['Quotes']

>>>response.css('title::text').re(r'(\w+) to (\w+)')

['Quotes', 'Scrape']

為了找到合適的CSS選擇器，您可能會(huì)發(fā)現(xiàn)在Web瀏覽器的shell中使用 view(response) . 您可以使用瀏覽器的開發(fā)人員工具檢查HTML并找到一個(gè)選擇器（請參見使用瀏覽器的開發(fā)人員工具進(jìn)行抓取）

Selector Gadget 也是一個(gè)很好的工具，可以快速找到視覺上選中的元素的CSS選擇器，它可以在許多瀏覽器中使用。

XPath: 簡介?

此外 CSS ，scrapy選擇器也支持使用 XPath 表達(dá)：

>>>response.xpath('//title')

[]

>>>response.xpath('//title/text()').get()

'Quotes to Scrape'

XPath表達(dá)式是非常強(qiáng)大的，是抓取選擇器的基礎(chǔ)。實(shí)際上，CSS選擇器在引擎蓋下轉(zhuǎn)換為xpath。如果仔細(xì)閱讀shell中選擇器對象的文本表示形式，可以看到這一點(diǎn)。

雖然可能不像CSS選擇器那么流行，但xpath表達(dá)式提供了更多的功能，因?yàn)槌藢?dǎo)航結(jié)構(gòu)之外，它還可以查看內(nèi)容。使用xpath，您可以選擇如下內(nèi)容：*選擇包含文本“下一頁”*的鏈接。這使得xpath非常適合于抓取任務(wù)，并且我們鼓勵(lì)您學(xué)習(xí)xpath，即使您已經(jīng)知道如何構(gòu)造css選擇器，它也會(huì)使抓取更加容易。

我們在這里不會(huì)涉及很多XPath，但你可以關(guān)于：ref：在這里使用帶有Scrapy選擇器的XPath 。要了解有關(guān)XPath的更多信息，我們建議`本教程通過示例學(xué)習(xí)XPath `_，以及`本教程學(xué)習(xí)“如何在XPath中思考 “`_。

提取引用和作者?

既然您對選擇和提取有了一些了解，那么讓我們通過編寫代碼從網(wǎng)頁中提取引號來完成蜘蛛程序。

Http://quotes.toscrape.com中的每個(gè)引號都由如下所示的HTML元素表示：

“The world as we have created it is a process of our

thinking. It cannot be changed without changing our thinking.”

by Albert Einstein

(about)

Tags:

change

deep-thoughts

thinking

world

讓我們打開Scrapy Shell并播放一點(diǎn)以了解如何提取所需數(shù)據(jù)：

$ scrapy shell 'http://quotes.toscrape.com'

我們得到了一個(gè)quote HTML元素的選擇器列表，其中包括：

>>>response.css("div.quote")

...]

上面查詢返回的每個(gè)選擇器都允許我們對其子元素運(yùn)行進(jìn)一步的查詢。讓我們將第一個(gè)選擇器分配給一個(gè)變量，這樣我們就可以直接在特定的引號上運(yùn)行CSS選擇器：

>>>quote = response.css("div.quote")[0]

現(xiàn)在，讓我們提取 text ， author 以及 tags 從引用中使用 quote 我們剛剛創(chuàng)建的對象：

>>>text = quote.css("span.text::text").get()

>>>text

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

>>>author = quote.css("small.author::text").get()

>>>author

'Albert Einstein'

鑒于標(biāo)記是字符串列表，我們可以使用 .getall() 方法獲取所有這些參數(shù)：

>>>tags = quote.css("div.tags a.tag::text").getall()

>>>tags

['change', 'deep-thoughts', 'thinking', 'world']

找到了如何提取每個(gè)位之后，我們現(xiàn)在可以迭代所有的quotes元素，并將它們放在Python字典中：

>>>for quote in response.css("div.quote"):

... text = quote.css("span.text::text").get()

... author = quote.css("small.author::text").get()

... tags = quote.css("div.tags a.tag::text").getall()

... print(dict(text=text, author=author, tags=tags))

{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}

{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}

...

在蜘蛛中提取數(shù)據(jù)?

讓我們回到蜘蛛身邊。到目前為止，它還沒有提取任何數(shù)據(jù)，特別是將整個(gè)HTML頁面保存到一個(gè)本地文件中。讓我們把上面的提取邏輯集成到蜘蛛中。

剪貼蜘蛛通常會(huì)生成許多字典，其中包含從頁面中提取的數(shù)據(jù)。為此，我們使用 yield 回調(diào)中的python關(guān)鍵字，如下所示：

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

start_urls = [

'http://quotes.toscrape.com/page/1/',

'http://quotes.toscrape.com/page/2/',

]

def parse(self, response):

for quote in response.css('div.quote'):

yield {

'text': quote.css('span.text::text').get(),

'author': quote.css('small.author::text').get(),

'tags': quote.css('div.tags a.tag::text').getall(),

}

如果運(yùn)行這個(gè)spider，它將用日志輸出提取的數(shù)據(jù)：

2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>

{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}

2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>

{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}

存儲抓取的數(shù)據(jù)?

存儲抓取數(shù)據(jù)的最簡單方法是使用 Feed exports ，使用以下命令：：

scrapy crawl quotes -O quotes.json

這將生成一個(gè)``quotes.json``文件，其中包含所有已刪除的項(xiàng)目，在`JSON`_中序列化。

這個(gè) -O 命令行開關(guān)覆蓋任何現(xiàn)有文件；使用 -o 而是將新內(nèi)容附加到任何現(xiàn)有文件中。但是，附加到JSON文件會(huì)使文件內(nèi)容無效JSON。附加到文件時(shí)，請考慮使用不同的序列化格式，例如 JSON Lines ：：

scrapy crawl quotes -o quotes.jl

這個(gè) JSON Lines 格式很有用，因?yàn)樗愃朴诹?#xff0c;您可以很容易地向它附加新記錄。當(dāng)您運(yùn)行兩次時(shí)，它不存在相同的JSON問題。另外，由于每個(gè)記錄都是單獨(dú)的一行，因此您可以處理大文件，而不必將所有內(nèi)容都放入內(nèi)存中，因此有如下工具： JQ 以幫助在命令行中執(zhí)行此操作。

在小項(xiàng)目中（如本教程中的項(xiàng)目），這就足夠了。但是，如果您想對爬取的項(xiàng)目執(zhí)行更復(fù)雜的操作，可以編寫一個(gè) Item Pipeline . 項(xiàng)目創(chuàng)建時(shí)已為您設(shè)置了項(xiàng)目管道的占位符文件，位于 tutorial/pipelines.py . 但是，如果只想存儲爬取的項(xiàng)目，則不需要實(shí)現(xiàn)任何項(xiàng)目管道。

以下鏈接?

比如說，你不需要從http://quotes.toscrape.com的前兩頁抓取內(nèi)容，而是需要從網(wǎng)站上所有頁面的引用。

既然您知道了如何從頁面中提取數(shù)據(jù)，那么讓我們看看如何從頁面中跟蹤鏈接。

第一件事是提取到我們要跟蹤的頁面的鏈接。檢查我們的頁面，我們可以看到有一個(gè)鏈接指向下一個(gè)帶有以下標(biāo)記的頁面：

我們可以嘗試在外殼中提取：

>>>response.css('li.next a').get()

'Next →'

這將獲取anchor元素，但我們需要該屬性 href . 為此，Scrapy支持CSS擴(kuò)展，允許您選擇屬性內(nèi)容，如下所示：

>>>response.css('li.next a::attr(href)').get()

'/page/2/'

還有一個(gè) attrib 可用屬性（請參見選擇元素屬性更多信息）：

>>>response.css('li.next a').attrib['href']

'/page/2/'

現(xiàn)在讓我們看看我們的spider被修改為遞歸地跟蹤下一頁的鏈接，從中提取數(shù)據(jù)：

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

start_urls = [

'http://quotes.toscrape.com/page/1/',

]

def parse(self, response):

for quote in response.css('div.quote'):

yield {

'text': quote.css('span.text::text').get(),

'author': quote.css('small.author::text').get(),

'tags': quote.css('div.tags a.tag::text').getall(),

}

next_page = response.css('li.next a::attr(href)').get()

if next_page is not None:

next_page = response.urljoin(next_page)

yield scrapy.Request(next_page, callback=self.parse)

現(xiàn)在，在提取數(shù)據(jù)之后， parse() 方法查找到下一頁的鏈接，并使用 urljoin() 方法（因?yàn)殒溄涌梢允窍鄬Φ?#xff09;，并生成對下一頁的新請求，將自身注冊為回調(diào)，以處理下一頁的數(shù)據(jù)提取，并保持爬行在所有頁中進(jìn)行。

這里您看到的是scrapy的以下鏈接機(jī)制：當(dāng)您在回調(diào)方法中生成一個(gè)請求時(shí)，scrapy將計(jì)劃發(fā)送該請求，并注冊一個(gè)回調(diào)方法，以便在該請求完成時(shí)執(zhí)行。

使用它，您可以構(gòu)建復(fù)雜的爬蟲程序，這些爬蟲程序根據(jù)您定義的規(guī)則跟蹤鏈接，并根據(jù)所訪問的頁面提取不同類型的數(shù)據(jù)。

在我們的示例中，它創(chuàng)建了一種循環(huán)，跟蹤到下一頁的所有鏈接，直到找不到一個(gè)為止——這對于爬行博客、論壇和其他帶有分頁的站點(diǎn)很方便。

創(chuàng)建請求的快捷方式?

作為創(chuàng)建請求對象的快捷方式，您可以使用 response.follow ：：

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

start_urls = [

'http://quotes.toscrape.com/page/1/',

]

def parse(self, response):

for quote in response.css('div.quote'):

yield {

'text': quote.css('span.text::text').get(),

'author': quote.css('span small::text').get(),

'tags': quote.css('div.tags a.tag::text').getall(),

}

next_page = response.css('li.next a::attr(href)').get()

if next_page is not None:

yield response.follow(next_page, callback=self.parse)

不像Scrapy.Request， response.follow 直接支持相對URL-無需調(diào)用URLJOIN。注意 response.follow 只返回一個(gè)請求實(shí)例；您仍然需要生成這個(gè)請求。

也可以將選擇器傳遞給 response.follow 而不是字符串；此選擇器應(yīng)提取必要的屬性：

for href in response.css('ul.pager a::attr(href)'):

yield response.follow(href, callback=self.parse)

為了元素有一個(gè)快捷方式： response.follow 自動(dòng)使用其href屬性。因此代碼可以進(jìn)一步縮短：

for a in response.css('ul.pager a'):

yield response.follow(a, callback=self.parse)

要從iterable創(chuàng)建多個(gè)請求，可以使用 response.follow_all 取而代之的是：

anchors = response.css('ul.pager a')

yield from response.follow_all(anchors, callback=self.parse)

或者，進(jìn)一步縮短：

yield from response.follow_all(css='ul.pager a', callback=self.parse)

更多示例和模式?

下面是另一個(gè)spider，它演示回調(diào)和以下鏈接，這次是為了抓取作者信息：

import scrapy

class AuthorSpider(scrapy.Spider):

name = 'author'

start_urls = ['http://quotes.toscrape.com/']

def parse(self, response):

author_page_links = response.css('.author + a')

yield from response.follow_all(author_page_links, self.parse_author)

pagination_links = response.css('li.next a')

yield from response.follow_all(pagination_links, self.parse)

def parse_author(self, response):

def extract_with_css(query):

return response.css(query).get(default='').strip()

yield {

'name': extract_with_css('h3.author-title::text'),

'birthdate': extract_with_css('.author-born-date::text'),

'bio': extract_with_css('.author-description::text'),

}

這個(gè)蜘蛛將從主頁開始，它將跟蹤所有指向作者頁面的鏈接，調(diào)用 parse_author 它們的回調(diào)，以及與 parse 像我們以前看到的那樣回?fù)堋?/p>

這里，我們把回電傳遞給 response.follow_all 作為使代碼更短的位置參數(shù)；它也適用于 Request .

這個(gè) parse_author 回調(diào)定義了一個(gè)助手函數(shù)，用于從CSS查詢中提取和清理數(shù)據(jù)，并用作者數(shù)據(jù)生成python dict。

這個(gè)蜘蛛展示的另一個(gè)有趣的事情是，即使同一作者引用了很多話，我們也不需要擔(dān)心多次訪問同一作者頁面。默認(rèn)情況下，scrappy過濾掉對已經(jīng)訪問過的URL的重復(fù)請求，避免了由于編程錯(cuò)誤而太多地訪問服務(wù)器的問題。這可以通過設(shè)置進(jìn)行配置 DUPEFILTER_CLASS .

希望到目前為止，您已經(jīng)很好地了解了如何使用scrappy跟蹤鏈接和回調(diào)的機(jī)制。

作為另一個(gè)利用以下鏈接機(jī)制的蜘蛛示例，請查看 CrawlSpider 類，該類用于實(shí)現(xiàn)一個(gè)小規(guī)則引擎，您可以使用該引擎在上面編寫爬蟲程序。

另外，一個(gè)常見的模式是使用：ref：`trick將其他數(shù)據(jù)傳遞給回調(diào)`來構(gòu)建包含來自多個(gè)頁面的數(shù)據(jù)的項(xiàng)目。

使用蜘蛛?yún)?shù)?

通過使用 -a 運(yùn)行它們時(shí)的選項(xiàng)：

scrapy crawl quotes -O quotes-humor.json -a tag=humor

這些論點(diǎn)被傳給蜘蛛 __init__ 方法并默認(rèn)成為spider屬性。

在本例中，為 tag 參數(shù)將通過 self.tag . 您可以使用它使您的蜘蛛只獲取帶有特定標(biāo)記的引號，并基于以下參數(shù)構(gòu)建URL:：

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

def start_requests(self):

url = 'http://quotes.toscrape.com/'

tag = getattr(self, 'tag', None)

if tag is not None:

url = url + 'tag/' + tag

yield scrapy.Request(url, self.parse)

def parse(self, response):

for quote in response.css('div.quote'):

yield {

'text': quote.css('span.text::text').get(),

'author': quote.css('small.author::text').get(),

}

next_page = response.css('li.next a::attr(href)').get()

if next_page is not None:

yield response.follow(next_page, self.parse)

如果你通過 tag=humor 對于這個(gè)蜘蛛，您會(huì)注意到它只訪問來自 humor 標(biāo)記，如 http://quotes.toscrape.com/tag/humor .

你可以：參考：在這里學(xué)習(xí)更多關(guān)于處理蜘蛛?yún)?shù)的信息。

下一步?

本教程只介紹 Scrapy 的基礎(chǔ)知識，但這里沒有提到很多其他特性。檢查：ref：`topics-whatelse`部分：ref：`intro-overview`一章，快速概述最重要的部分。

您可以繼續(xù)閱讀以下部分：ref：`section-basics`以了解有關(guān)命令行工具，蜘蛛，選擇器以及本教程尚未涵蓋的其他內(nèi)容的更多信息，例如對已刪除數(shù)據(jù)進(jìn)行建模。如果您更喜歡使用示例項(xiàng)目，請查看：ref：`intro-examples`部分。

總結(jié)

以上是生活随笔為你收集整理的python3 scrapy 教程_Scrapy 教程的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： html border阴影效果_一篇文章
下一篇： python中常见的运行时错误_17个常