當前位置：首頁 > 编程语言 > python >内容正文

python

利用python爬虫(part9)--Xpath与谓词の爱

發布時間：2023/12/19 python 38 豆豆

生活随笔收集整理的這篇文章主要介紹了利用python爬虫(part9)--Xpath与谓词の爱小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

學習筆記
編輯器：Sublime
注意：這里的理論部分是關于XML文檔的，但是介于HTML和XML的相似性，大家可以自行做類比，有的地方我也會用HTML和XML做類比。

上一篇：Xpath路徑表達式

文章目錄

- 謂詞
- - 謂詞的格式與作用
  - 尋找特殊位置的節點
  - 尋找有特殊屬性的節點
  - 尋找有特殊關系節點的節點
  - 使用謂詞嵌套尋找節點

謂詞

謂詞是定位步中最吸引人的部分。真的很吸引人！！！

謂詞的格式與作用

謂詞的目的是給出定位步所尋找出的節點需滿足的進一步條件，即當前節點在使用定位步的軸和節點測試找出若干個節點后，再使用謂詞從這些節點中篩選出滿足條件的節點。

謂詞格式：

[條件表達式]

謂詞中的條件表達式和節點有關，其值為ture或false，當條件表達式的值是true時，稱節點滿足謂詞給出的條件，否則稱節點不滿足謂詞給出的條件。

在謂詞中，用or, and來表示邏輯關系，使用=, !=, <=, <, >=, >表示大小關系。

例如:

[position() >= 4 and position <= 6]

表示篩選出第4、5、6位置的節點.

如果比較的內容是數值類型，則既可以用關系符=, !=, <=, <, >=, >,比較大小關系。如果比較的內容為字符串類型，則只能用關系符=, !=，比較兩個字符串是否相同。

現給出一個簡單的HTML文檔(這個文檔，之后也要用到)，我們將這個文檔命名為test11.html：

<!DOCTYPE html> <html lang="en"> <head><meta charset="UTF-8"><title>Xpath</title> </head> <body><ol><li class="Ra01"><name class = 'Bunny01'>Huang</name><age>8</age><food>carrot</food></li><li class="Ra01"><name class = 'Bunny02'>Bai</name><age>10</age><food>celery</food></li><li class="Ra02"><name class = 'Bunny03'>Jack</name><age>20</age><food>cucumber</food></li><li class="Ra03"><name class = 'Bunny03'>Tim</name><age>30</age><food>straw</food></li></ol><div class='description'> <p>Bunny like to eat ....<a href="http://www.bunny.com/">Anything</a></p></div><div class='advertising'> <p>welcome to buy our product ....<a href="http://www.bunnyproduct.com/">Perfact!</a></p></div> </body> </html>

對于上面的html文件，我們寫一個Xpath表達式(定位步縮寫形式)：

/child::html/child::body/child::div[position()=2]/child::p/child::text()縮寫：/html/body/div[2]/p/text()

第一個定位步為：

Document節點(即，根節點)

第二個定位步為：

child::html

由于HTML文檔只有一個名為html的根標記，所以Document節點使用該定位步尋找出的節點集中只有一個名字是html的Element節點。

第三個定位步為：

child::body

Xpath在第2定位步后尋找出的節點使用第3定位步尋找出的節點集中只有一個名字是body的Element節點。

第四定位步為：

child::div[position()=2]

Xpath在第3定位步后尋找出的節點使用第4定位步尋找Element節點，則Xpath路徑找出的節點集中只有1個名字是div的Element節點(對應著第2個div標記)。

第五定位步為：

child::p

Xpath在第4定位步后尋找出的節點使用第5定位步尋找出的節點集中只有一個名字是p的Element節點。

第六定位步：

child::text()

Xpath在第5定位步后尋找出的節點使用第6定位步尋找Text節點，則Xpath路徑找出的節點集中有1個Text節點，這個Text節點名字是#text,包含的文本內容為welcome to buy our product ....

尋找特殊位置的節點

在謂詞中使用position()和last()函數可以尋找指定位置上的節點。

我覺得這里直接上例子會理解的比較快，所以，開始敲代碼吧！

用我們剛才的test11.html文件為例，在python中進行匹配：

from lxml import etreewith open('test11.html', 'rb') as f:html = f.read().decode('utf-8')#print(html) parse_html = etree.HTML(html)my_xpath = '/html/body/ol/li[position() = 2]/name/text()' data_list = parse_html.xpath(my_xpath) print(data_list) print('-'*15)my_xpath = '/html/body/ol/li[last()]/name/text()' data_list = parse_html.xpath(my_xpath) print(data_list) print('-'*15)my_xpath = '/html/body/ol/li[last()-1]/name/text()' data_list = parse_html.xpath(my_xpath) print(data_list) print('-'*15)my_xpath = '/html/body/ol/li[position() >= 3 and position() <= 4]/name/text()' data_list = parse_html.xpath(my_xpath) print(data_list)

Sublime輸出：

['Bai'] --------------- ['Tim'] --------------- ['Jack'] --------------- ['Jack', 'Tim'] [Finished in 0.6s]

Great!!!

尋找有特殊屬性的節點

在謂語中使用attribute軸或者@尋找具有特定屬性或屬性值的節點。

尋找具有指定屬性和屬性值的節點

如果要尋找具有指定屬性的節點，可以在謂詞中使用：

attribute::屬性名1 邏輯關系符 attribute::屬性名2 ... attribute::屬性名n

如果要尋找具有指定屬性和屬性值的節點，可以在謂詞中使用：

attribute::屬性名1 大小關系符 '值' 邏輯關系符...attribute::屬性名n 大小關系符 '值'

還是用我們剛才的test11.html文件為例，在python中進行匹配：

from lxml import etreewith open('test11.html', 'rb') as f:html = f.read().decode('utf-8')#print(html) parse_html = etree.HTML(html)my_xpath = '/html/body/ol/li[@class="Ra02"]/name/text()' data_list = parse_html.xpath(my_xpath) print(data_list) print('-'*15)my_xpath = '/html/body/div//*[@href]/@href' data_list = parse_html.xpath(my_xpath) print(data_list) print('-'*15)

Sublime輸出：

['Jack'] --------------- ['http://www.bunny.com/', 'http://www.bunnyproduct.com/'] --------------- [Finished in 0.4s]

使用contains()函數

如果希望尋找具有指定屬性，并且屬性值中包含有指定字符串的節點時，可以在謂詞中使用contains()函數，格式為：

contains(attribute::屬性名, '特定字符串')

再來個例子：

from lxml import etreewith open('test11.html', 'rb') as f:html = f.read().decode('utf-8')#print(html) parse_html = etree.HTML(html)my_xpath = '/html/body/ol/li[contains(@class, "Ra")]/name/text()' data_list = parse_html.xpath(my_xpath) print(data_list)

Sublime輸出：

['Huang', 'Bai', 'Jack', 'Tim'] [Finished in 0.4s]

尋找有特殊關系節點的節點

？？？為啥感覺讀起來有點繞口

尋找有特殊關系節點的節點

如果要尋找有特殊關系節點的節點，但對該特殊關系節點所包含的內容沒有特殊要求，可以在謂詞中使用：

軸::節點名稱1 邏輯關系符軸::節點名稱2 ... 軸::節點名稱n

如果要尋找有特殊關系節點的節點，對該特殊關系節點所包含的內容有特殊要求，可以在謂詞中使用：

軸::節點名稱1 大小關系 '值' 邏輯關系符...軸::節點名稱n 大小關系 '值'

再來個例子：

from lxml import etreewith open('test11.html', 'rb') as f:html = f.read().decode('utf-8')#print(html) parse_html = etree.HTML(html)my_xpath = '/html/body/*[descendant::p and descendant::a]/p/a/text()' data_list = parse_html.xpath(my_xpath) print(data_list) print('-'*15)my_xpath = '/html/body/ol/li[age=20]/name/text()' data_list = parse_html.xpath(my_xpath) print(data_list) print('-'*15)

Sublime輸出：

['Anything', 'Perfact!'] --------------- ['Jack'] --------------- [Finished in 0.6s]

使用contains()函數

如果要尋找有特殊關系節點的節點，且特殊關系節點包含的文本需要含有指定的字符串時，可以使用contains()函數，格式是：

contains(軸::節點名稱, '特定字符串')

來個例子：

from lxml import etreewith open('test11.html', 'rb') as f:html = f.read().decode('utf-8')#print(html) parse_html = etree.HTML(html)my_xpath = '/html/body//li[contains(age, "0")]/name/text()' data_list = parse_html.xpath(my_xpath) print(data_list) print('-'*15)

Sublime輸出：

['Bai', 'Jack', 'Tim'] --------------- [Finished in 0.6s]

使用謂詞嵌套尋找節點

謂詞的作用是給出節點需要滿足的進一步條件，因此允許謂詞繼續使用謂詞。

比如，對于嵌套謂詞：

[div[descendant::a="Anything"]]

滿足該謂詞條件的節點必須有名字是div的子節點，且div節點中必須包含文本內容為"Anything"的a子孫節點。

如有錯誤，請求指出。

創作挑戰賽新人創作獎勵來咯，堅持創作打卡瓜分現金大獎

總結

以上是生活随笔為你收集整理的利用python爬虫(part9)--Xpath与谓词の爱的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： int是什么_十进制数16的16进制表示
下一篇：利用python爬虫(part10)--