當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

Scrapy-xpath用法以及实例

發(fā)布時(shí)間：2025/3/21 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了 Scrapy-xpath用法以及实例小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

xpath的語(yǔ)法

xpath語(yǔ)法-謂語(yǔ)

創(chuàng)建Scrapy項(xiàng)目

scrapy startproject ArticleSpider

創(chuàng)建scrapy爬蟲(chóng)

cd ArticleSpider scrapy genspiderjobbole blog.jobbole.com

使用方法

可以直接在chrome->F12開(kāi)發(fā)者工具中復(fù)制xpath

這里介紹一種測(cè)試的方法

平時(shí)我們?cè)趐ycharm或者編輯器需要每次運(yùn)行,就需要一次次請(qǐng)求,這里可以用以下的方法:
假設(shè)我們需要爬取伯樂(lè)在線的一篇文章,鏈接為 http://blog.jobbole.com/112614/
我們爬取一篇文章的標(biāo)題,發(fā)布時(shí)間,點(diǎn)贊數(shù),收藏?cái)?shù),評(píng)論數(shù)

所有字段均以字符串形式存儲(chǔ)

命令行執(zhí)行如下

scrapy shell http://blog.jobbole.com/112614/

之后,我們可以通過(guò)以下命令獲取相應(yīng)內(nèi)容,response.xpath()會(huì)返回<class 'scrapy.selector.unified.SelectorList'>類型,可以通過(guò)extract()方法獲取內(nèi)容,返回列表

In [1]: title = response.xpath('//*[@id="post-112614"]/div[1]/h1/text()')In [2]: print(title) [<Selector xpath='//*[@id="post-112614"]/div[1]/h1/text()' data='為什么 SQL 正在擊敗 NoSQL，數(shù)據(jù)的未來(lái)是什么？'>]In [3]: print(title.extract()) ['為什么 SQL 正在擊敗 NoSQL，數(shù)據(jù)的未來(lái)是什么？']

####獲取時(shí)間
我們?cè)賮?lái)獲取一下時(shí)間,這段比較長(zhǎng),因?yàn)槲覅R總到了一塊,之前是一點(diǎn)一點(diǎn)調(diào)試出來(lái)的,過(guò)程可見(jiàn)下圖

create_date = response.xpath('//*[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].strip().replace("·","").strip()

strip()是去除頭尾指定的字符

####獲取標(biāo)題
一下代碼是我直接復(fù)制的xpath路徑,但在這里只能提取這一篇文章的數(shù)據(jù),因?yàn)榭梢钥嫉絠d="post-112614",這里只適用于這一篇文章,其它的就不行了,所以我們需要更換xpath選擇器

create_date = response.xpath('//*[@id="post-112614"]/div[2]/p/text()').extract()[0].strip().repalce("·","").strip()

通過(guò)測(cè)試,我們發(fā)現(xiàn)entry-header這個(gè)class是全局唯一的,于是我們可以這樣提取

title = response.xpath('//*[@class="entry-header"]/h1/text()').extract()[0]

####獲取點(diǎn)贊數(shù)

praise_nums = response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0]

contains:匹配一個(gè)屬性值中包含的字符串

####獲取收藏,此處包含’收藏?cái)?shù)’和’收藏’兩個(gè)字

fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0].strip() match_re = re.match('.*(\d+).*',fav_nums) if match_re:#獲取收藏?cái)?shù)fav_nums = int(math_re.group(1))

####獲取評(píng)論

comment_nums = response.xpath('//*[@class="entry-meta-hide-on-mobile"]/a[2]/text()').extract()[0].strip()

####獲取文章所屬標(biāo)簽
在這里涉及去重,因?yàn)樵谖恼麻_(kāi)頭和文末都有評(píng)論數(shù),所有會(huì)出現(xiàn)重復(fù).如下圖紅色部分,所以我們用判斷去除重復(fù)的部分

tag_list = response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract() tag_list = [element for element in tag_list if not element.strip().endswith('評(píng)論')] tag = ','.join(tag_list)

這個(gè)過(guò)程如下圖

- ####獲取文章內(nèi)容

content = response.xpath('//*[@class="entry"]').extract()[0]

完整代碼

def parse_detail(self, response):#獲取標(biāo)題#可以用//*[@id="post-112614"]/div[1]/h1/text()獲取標(biāo)簽里面的值title = response.xpath('//*[@class="entry-header"]/h1/text()').extract()[0]# print('title',title)# re1_selector = response.xpath('//div[@class="entry_header"]/h1/text()')#獲取時(shí)間#獲取字符串的話用time.extract()[0].strip().repalce("·","").strip()create_date = response.xpath('//*[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].strip().replace("·","").strip()#獲取點(diǎn)贊數(shù)praise_nums = response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0]#獲取收藏,此處包含'收藏?cái)?shù)'和'收藏'兩個(gè)字fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0].strip()match_re = re.match('.*?(\d+).*',fav_nums)if match_re:#獲取收藏?cái)?shù)fav_nums = int(match_re.group(1))else:fav_nums = 0#獲取評(píng)論數(shù)comment_nums = response.xpath('//*[@class="entry-meta-hide-on-mobile"]/a[2]/text()').extract()[0].strip()match_re = re.match('.*?(\d+).*', comment_nums)if match_re:# 獲取收藏?cái)?shù)comment_nums = int(match_re.group(1))else:comment_nums = 0#獲取文章分類標(biāo)簽tag_list = response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract()tag_list = [element for element in tag_list if not element.strip().endswith('評(píng)論')]tag = ','.join(tag_list)content = response.xpath('//*[@class="entry"]').extract()[0]

總結(jié)

以上是生活随笔為你收集整理的Scrapy-xpath用法以及实例的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Semantic UI实现一个landi
下一篇： Scrapy-css选择器