當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

【Rollo的Python之路】Scrapy Selector选择器的学习

發(fā)布時間：2025/4/14 python 41 豆豆

生活随笔收集整理的這篇文章主要介紹了【Rollo的Python之路】Scrapy Selector选择器的学习小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

選擇器(Selectors)

當(dāng)抓取網(wǎng)頁時，你做的最常見的任務(wù)是從HTML源碼中提取數(shù)據(jù)?，F(xiàn)有的一些庫可以達到這個目的：

BeautifulSoup?是在程序員間非常流行的網(wǎng)頁分析庫，它基于HTML代碼的結(jié)構(gòu)來構(gòu)造一個Python對象，對不良標記的處理也非常合理，但它有一個缺點：慢。
lxml?是一個基于?ElementTree?(不是Python標準庫的一部分)的python化的XML解析庫(也可以解析HTML)。

Scrapy提取數(shù)據(jù)有自己的一套機制。它們被稱作選擇器(seletors)，因為他們通過特定的?XPath?或者?CSS表達式來“選擇” HTML文件中的某個部分。

Scrapy選擇器構(gòu)建于?lxml?庫之上，這意味著它們在速度和解析準確性上非常相似

Scrapy selector是以?文字(text)?或?TextResponse?構(gòu)造的?Selector?實例。其根據(jù)輸入的類型自動選擇最優(yōu)的分析方法(XML vs HTML):

>>> from scrapy.selector import Selector >>> from scrapy.http import HtmlResponse

用scrapy shell 來打一個網(wǎng)站來練習(xí)一下selectors選擇器：（pycharm 的terminal可以練習(xí)）

scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

HTML源碼:

<html><head><base href='http://example.com/' /><title>Example website</title></head><body><div id='images'><a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a><a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a><a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a><a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a><a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a></div></body> </html>

1.1.1?選擇title,用xpath的方法。selector選出來的是一個列表。

In [1]: response.selector.xpath('//title/text()')>>> Out[1]: [<Selector xpath='//title/text()' data='Example website'>] Out[1]: [<Selector xpath='//title/text()' data='Example website'>]In [2]: response.selector.xpath('//title/text()').extract Out[2]: <bound method SelectorList.getall of [<Selector xpath='//title/text()' data='Example website'>]>In [3]: response.selector.xpath('//title/text()').extract_first() Out[3]: 'Example website

選擇title,用css的方法。

In [4]: response.selector.css('title::text') Out[4]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]In [5]: response.selector.css('title::text').extract Out[5]: <bound method SelectorList.getall of [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]>In [6]: response.selector.css('title::text').extract_first Out[6]: <bound method SelectorList.get of [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]>In [7]: response.selector.css('title::text').extract_first() Out[7]: 'Example website'

同樣，scrapy為了response內(nèi)置了selector這個參數(shù)，所有也可以不打個這selector,直接用response.xpath or response.css就可以提取

　　用xpath的方法。

In [8]: response.xpath('//title/text()') Out[8]: [<Selector xpath='//title/text()' data='Example website'>]In [9]: response.xpath('//title/text()').extract Out[9]: <bound method SelectorList.getall of [<Selector xpath='//title/text()' data='Example website'>]>In [10]: response.xpath('//title/text()').extract_first() Out[10]: 'Example website'

　　用css的方法。

In [11]: response.css('title::text') Out[11]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]In [12]: response.css('title::text').extract() Out[12]: ['Example website']In [13]: response.css('title::text').extract Out[13]: <bound method SelectorList.getall of [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]>

注意：extract_first()有用來值，dextract_first(default)

1.1.2 選擇文本

>>> response.xpath('//base/@href').extract() [u'http://example.com/']>>> response.css('base::attr(href)').extract() [u'http://example.com/']>>> response.xpath('//a[contains(@href, "image")]/@href').extract() [u'image1.html',u'image2.html',u'image3.html',u'image4.html',u'image5.html']>>> response.css('a[href*=image]::attr(href)').extract() [u'image1.html',u'image2.html',u'image3.html',u'image4.html',u'image5.html']>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract() [u'image1_thumb.jpg',u'image2_thumb.jpg',u'image3_thumb.jpg',u'image4_thumb.jpg',u'image5_thumb.jpg']>>> response.css('a[href*=image] img::attr(src)').extract() [u'image1_thumb.jpg',u'image2_thumb.jpg',u'image3_thumb.jpg',u'image4_thumb.jpg',u'image5_thumb.jpg']

1.1.3 可以使用RE來提取：

In [18]: response.css('a::text').re('Name\:(.*)') Out[18]: [' My image 1 ',' My image 2 ',' My image 3 ',' My image 4 ',' My image 5 ']In [19]: response.css('a::text').re_first('Name\:(.*)') Out[19]: ' My image 1 ' In [21]: response.xpath('//a/text()') Out[21]: [<Selector xpath='//a/text()' data='Name: My image 1 '>,<Selector xpath='//a/text()' data='Name: My image 2 '>,<Selector xpath='//a/text()' data='Name: My image 3 '>,<Selector xpath='//a/text()' data='Name: My image 4 '>,<Selector xpath='//a/text()' data='Name: My image 5 '>]In [22]: response.xpath('//a/text()').extract Out[22]: <bound method SelectorList.getall of [<Selector xpath='//a/text()' data='Name: My image 1 '>, <Selector xpath='//a/text()' data='Name: My image 2 '>, <Selector xpath='//a/text()' data='Name: My image 3 '>, <Selector xpath='//a/text()' data='Name: My image 4 '>, <Selector xpath='//a/text()' data='Name: My image 5 '>]> In [23]: response.xpath('//a/text()').extract() Out[23]: ['Name: My image 1 ','Name: My image 2 ','Name: My image 3 ','Name: My image 4 ','Name: My image 5 ']In [26]: response.xpath('//a/text()').re('Name\:(.*)') Out[26]: [' My image 1 ',' My image 2 ',' My image 3 ',' My image 4 ',' My image 5 ']In [27]: response.xpath('//a/text()').re_first('Name\:(.*)') Out[27]: ' My image 1 '

1.1.3 可以提取屬性：

#xpath提取url In [28]: response.xpath('//base/@href').extract() Out[28]: ['http://example.com/']In [29]: response.xpath('//a/@href').extract() Out[29]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']#css提取url In [32]: response.css('a::attr(href)').extract_first() Out[32]: 'image1.html'In [33]: response.css('a::attr(href)').extract() Out[33]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] #xpath屬性包括image的url In [42]: response.xpath('//a[contains(@href,"image")]/@href') Out[42]: [<Selector xpath='//a[contains(@href,"image")]/@href' data='image1.html'>,<Selector xpath='//a[contains(@href,"image")]/@href' data='image2.html'>,<Selector xpath='//a[contains(@href,"image")]/@href' data='image3.html'>,<Selector xpath='//a[contains(@href,"image")]/@href' data='image4.html'>,<Selector xpath='//a[contains(@href,"image")]/@href' data='image5.html'>]In [43]: response.xpath('//a[contains(@href,"image")]/@href').extract() Out[43]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']In [44]: response.xpath('//a[contains(@href,"image")]/@href').extract_first() Out[44]: 'image1.html'#css屬性包括image的url In [45]: response.css('a[href*=image]::attr(href)') Out[45]: [<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image1.html'>,<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image2.html'>,<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image3.html'>,<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image4.html'>,<Selector xpath="descendant-or-self::a[@href and contains(@href, 'image')]/@href" data='image5.html'>]In [46]: response.css('a[href*=image]::attr(href)').extract() Out[46]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']In [47]: response.css('a[href*=image]::attr(href)').extract_first() Out[47]: 'image1.html'

In [48]: response.xpath('//a[contains(@href,"image")]/img/@src').extract() Out[48]: ['image1_thumb.jpg','image2_thumb.jpg','image3_thumb.jpg','image4_thumb.jpg','image5_thumb.jpg']In [50]: response.css('a[href*=image] img::attr(src)').extract() Out[50]: ['image1_thumb.jpg','image2_thumb.jpg','image3_thumb.jpg','image4_thumb.jpg','image5_thumb.jpg']

css: 要取文本就用text,要屬性就用：：attr

xpath:text()取文本，@屬性名，可以取屬性

In [53]: response.css('a img::attr(src)').extract() Out[53]: ['image1_thumb.jpg','image2_thumb.jpg','image3_thumb.jpg','image4_thumb.jpg','image5_thumb.jpg']In [55]: response.xpath('//a/img/@src').extract() Out[55]: ['image1_thumb.jpg','image2_thumb.jpg','image3_thumb.jpg','image4_thumb.jpg','image5_thumb.jpg']

轉(zhuǎn)載于:https://www.cnblogs.com/rollost/p/10917172.html

總結(jié)

以上是生活随笔為你收集整理的【Rollo的Python之路】Scrapy Selector选择器的学习的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：数据库中的约束和三大范式
下一篇： VMware Workstation 与