利用python爬虫(part10)--Xpath节点集与函数
學(xué)習(xí)筆記
編輯器:Sublime
注意:這里的理論部分是關(guān)于XML文檔的,但是介于HTML和XML的相似性,大家可以自行做類比,有的地方我也會(huì)用HTML和XML做類比。
上一篇:Xpath與謂詞的愛(ài)
文章目錄
- 節(jié)點(diǎn)集上使用謂詞
- 節(jié)點(diǎn)集上的并運(yùn)算
- 節(jié)點(diǎn)集與函數(shù)
節(jié)點(diǎn)集上使用謂詞
Xpath允許在Xpath絕對(duì)路徑最后尋找出的節(jié)點(diǎn)集上使用謂詞,作用是從當(dāng)前節(jié)點(diǎn)集中篩選、過(guò)濾出所需要的節(jié)點(diǎn),使用格式為:
(絕對(duì)路徑)[謂詞]針對(duì)下面的HTML文檔(將這個(gè)文檔命名為test12.html),我們來(lái)寫(xiě)一段代碼,直觀感受一下這個(gè)知識(shí)點(diǎn)(也可能啥也感受不到??):
<!DOCTYPE html> <html lang="en"> <head><meta charset="UTF-8"><title>Xpath</title> </head> <body><ol><li class="Ra01"><name class = 'Bunny01'>Huang</name><age>8</age><food>carrot</food></li><li class="Ra01"><name class = 'Bunny02'>Bai</name><age>10</age><food>celery</food></li><li class="Ra02"><name class = 'Bunny03'>Jack</name><age>20</age><food>cucumber</food></li><li class="Ra03"><name class = 'Bunny03'>Tim</name><age>30</age><food>straw</food></li></ol> </body> </html>python代碼:
from lxml import etreewith open('test12.html', 'rb') as f:html = f.read().decode('utf-8')#print(html) parse_html = etree.HTML(html)my_xpath = '(/html//li)[1]' data_list = parse_html.xpath(my_xpath) print(data_list) print('-'*15)my_xpath = '/html//li[1]' data_list = parse_html.xpath(my_xpath) print(data_list)Sublime輸出:
[<Element li at 0xe2588b4ec8>] --------------- [<Element li at 0xe2588b4ec8>] [Finished in 0.5s]需要注意的是,python代碼中兩種Xpath表達(dá)式得到的結(jié)果雖然相同,但是尋找節(jié)點(diǎn)的過(guò)程是不同的。
節(jié)點(diǎn)集上的并運(yùn)算
Xpath允許將多個(gè)Xpath表達(dá)式尋找出的節(jié)點(diǎn)集用|進(jìn)行集合的并運(yùn)算,所得到的的節(jié)點(diǎn)集中的節(jié)點(diǎn)按照節(jié)點(diǎn)在XML中的對(duì)應(yīng)的標(biāo)記或文本出現(xiàn)的先后順序排列。
來(lái)個(gè)例子吧,用上面的test12.html為例:
from lxml import etreewith open('test12.html', 'rb') as f:html = f.read().decode('utf-8')#print(html) parse_html = etree.HTML(html)my_xpath = '/html//li[@class="Ra01"]/name/text() | /html//li[@class="Ra01"]/age/text()' data_list = parse_html.xpath(my_xpath) print(data_list) print('-'*15)Sublime輸出:
['Huang', '8', 'Bai', '10'] --------------- [Finished in 0.4s]節(jié)點(diǎn)集與函數(shù)
為例計(jì)算和節(jié)點(diǎn)集有關(guān)的數(shù)據(jù),Xpath語(yǔ)言給出了幾個(gè)作用于節(jié)點(diǎn)集上的函數(shù)。
- count(節(jié)點(diǎn)集)
count()函數(shù)將返回參數(shù)指定的節(jié)點(diǎn)集中的節(jié)點(diǎn)的個(gè)數(shù)。
舉個(gè)例子,還是用上面的test12.html文件舉例:
from lxml import etreewith open('test12.html', 'rb') as f:html = f.read().decode('utf-8')#print(html) parse_html = etree.HTML(html)my_xpath = 'count(/html//li[@class="Ra01"]/name)' data_list = parse_html.xpath(my_xpath) print(data_list) print('-'*15)my_xpath = 'count(/html//li[contains(food, "c")])' data_list = parse_html.xpath(my_xpath) print(data_list)Sublime輸出:
2.0 --------------- 3.0 [Finished in 0.5s]- sun(節(jié)點(diǎn)集)
sum()函數(shù)將節(jié)點(diǎn)集中的節(jié)點(diǎn)所包含的文本轉(zhuǎn)換為數(shù)字,并返回它們的和,如果節(jié)點(diǎn)集中的某個(gè)節(jié)點(diǎn)所包含的文本不能轉(zhuǎn)換為數(shù)字,則sum()會(huì)返回NaN
這里為了舉例,我們新構(gòu)建一個(gè)HTML文檔,如下:
<!DOCTYPE html> <html lang="en"> <head><meta charset="UTF-8"><title>Xpath</title> </head> <body>5<p>3<ol><li>1</li><li>2</li></ol></p><p>4</p> </body> </html>python代碼:
from lxml import etreewith open('test13.html', 'rb') as f:html = f.read().decode('utf-8')#print(html) parse_html = etree.HTML(html)my_xpath = 'sum(/html//li/text())' data_list = parse_html.xpath(my_xpath) print(data_list) print('-'*15)my_xpath = 'sum(/html//li)' data_list = parse_html.xpath(my_xpath) print(data_list) print('-'*15)my_xpath = 'sum(/html/body/p)' data_list = parse_html.xpath(my_xpath) print(data_list) print('-'*15)my_xpath = 'sum(/html/body)' data_list = parse_html.xpath(my_xpath) print(data_list)Sublime輸出:
3.0 --------------- 3.0 --------------- 7.0 --------------- nan [Finished in 0.4s]由結(jié)果,我們可以得出很多結(jié)論,請(qǐng)自行領(lǐng)會(huì)(因?yàn)椴┲鞑恢雷约嚎偨Y(jié)的對(duì)不對(duì),哈哈哈)。
如有錯(cuò)誤,請(qǐng)求指出。
Xpath系列暫時(shí)更完。
總結(jié)
以上是生活随笔為你收集整理的利用python爬虫(part10)--Xpath节点集与函数的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 利用python爬虫(part9)--X
- 下一篇: 利用python爬虫(part11)--