當前位置：首頁 > 编程语言 > python >内容正文

python

python 全栈开发，Day135(爬虫系列之第2章-BS和Xpath模块)

發(fā)布時間：2023/12/29 python 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 python 全栈开发，Day135(爬虫系列之第2章-BS和Xpath模块) 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

一、BeautifulSoup

1. 簡介

簡單來說，Beautiful Soup是python的一個庫，最主要的功能是從網頁抓取數(shù)據(jù)。官方解釋如下：

''' Beautiful Soup提供一些簡單的、python式的函數(shù)用來處理導航、搜索、修改分析樹等功能。它是一個工具箱，通過解析文檔為用戶提供需要抓取的數(shù)據(jù)，因為簡單，所以不需要多少代碼就可以寫出一個完整的應用程序。 '''

Beautiful Soup?是一個可以從HTML或XML文件中提取數(shù)據(jù)的Python庫.它能夠通過你喜歡的轉換器實現(xiàn)慣用的文檔導航,查找,修改文檔的方式.Beautiful Soup會幫你節(jié)省數(shù)小時甚至數(shù)天的工作時間.你可能在尋找?Beautiful Soup3?的文檔,Beautiful Soup 3 目前已經停止開發(fā),官網推薦在現(xiàn)在的項目中使用Beautiful Soup 4。

安裝

pip3 install bs4

解析器

pip3 install lxml

另一個可供選擇的解析器是純Python實現(xiàn)的 html5lib , html5lib的解析方式與瀏覽器相同,可以選擇下列方法來安裝html5lib:

pip install html5lib

解析器對比：

官方文檔

簡單使用

下面的一段HTML代碼將作為例子被多次用到.這是?愛麗絲夢游仙境的?的一段內容(以后內容中簡稱為?愛麗絲?的文檔):

html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p><p class="story">...</p> """

上面這段代碼不是完整的html標簽，為什么呢？它沒有body和html的結束標簽！

沒關系，bs4有容錯機制！

使用BeautifulSoup解析這段代碼,能夠得到一個?BeautifulSoup?的對象

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser')

從文檔中找到所有<a>標簽的鏈接:

for link in soup.find_all('a'):print(link.get('href'))

執(zhí)行輸出：

http://example.com/elsie http://example.com/lacie http://example.com/tillie View Code

從文檔中獲取所有文字內容:

print(soup.get_text())

?執(zhí)行輸出：

The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well. ... View Code

2. 標簽對象

通俗點講就是 HTML 中的一個個標簽，Tag?對象與XML或HTML原生文檔中的tag相同:

from bs4 import BeautifulSoupsoup = BeautifulSoup('<b class="boldest">Extremely bold</b>','html.parser') tag = soup.b print(type(tag))

執(zhí)行輸出：<class 'bs4.element.Tag'>

Tag的名字

soup對象再以愛麗絲夢游仙境的html_doc為例，操作文檔樹最簡單的方法就是告訴它你想獲取的tag的name.如果想獲取 <head> 標簽,只要用?soup.head?:

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser')soup = soup.head # 獲取head標簽 print(soup)soup = soup.title # 獲取title標簽 print(soup)

執(zhí)行輸出：

<head><title>The Dormouse's story</title></head> <title>The Dormouse's story</title> View Code

這是個獲取tag的小竅門,可以在文檔樹的tag中多次調用這個方法.下面的代碼可以獲取<body>標簽中的第一個<b>標簽:

soup = soup.body.b # <body>標簽中的第一個<b>標簽 print(soup) # <b>The Dormouse's story</b>

通過點取屬性的方式只能獲得當前名字的第一個tag:

soup = soup.a # 第一個a標簽 print(soup) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

如果想要得到所有的<a>標簽,或是通過名字得到比一個tag更多的內容的時候,就需要用到?Searching the tree?中描述的方法,比如: find_all()

soup = soup.find_all('a') # 所有a標簽 print(soup)

執(zhí)行輸出：

結果是一個列表！

我們可以利用 soup加標簽名輕松地獲取這些標簽的內容，注意，它查找的是在所有內容中的第一個符合要求的標簽。

Tag的name和attributes屬性

Tag有很多方法和屬性，主要是標簽對象的名稱，屬性，文本。

現(xiàn)在介紹一下tag中最重要的屬性: name和attributes

名稱和屬性

每個tag都有自己的名字,通過.name?來獲取:

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser')soup = soup.find_all('a') # 所有a標簽for link in soup:print(link.name) # 標簽名print(link.get('id')) # id屬性print(link.get('class')) # class屬性print(link.get('href')) # herf屬性print(link.attrs) # 標簽所有屬性# {'href': 'http://example.com/tillie', 'class': ['sister'], 'id': 'link3'}

tag的屬性可以被添加,刪除或修改. 再說一次, tag的屬性操作方法與字典一樣

修改

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser')soup = soup.find_all('a') # 所有a標簽 soup[0]['class'] = 'verybold' # 修改class屬性 soup[0]['id'] = 1 # 修改id屬性 print(soup[0]) # <a class="verybold" href="http://example.com/elsie" id="1">Elsie</a>

刪除

soup = soup.find_all('a') # 所有a標簽del soup[0]['class'] del soup[0]['id'] print(soup[0]) # <a href="http://example.com/elsie">Elsie</a>

文本獲取

標簽對象的文本獲取

from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc, 'html.parser') print(soup.p.string) # p下的文本只有一個時，取到，否則為None print(soup.p.strings) # 拿到一個生成器對象, 取到p下所有的文本內容 print(soup.p.text) #取到p下所有的文本內容 for line in soup.stripped_strings: #去掉空白print(line)

執(zhí)行輸出：

The Dormouse's story <generator object _all_strings at 0x000001AA3941C5C8> The Dormouse's story The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie , Lacie and Tillie ; and they lived at the bottom of a well. ... View Code

如果tag包含了多個子節(jié)點,tag就無法確定 .string 方法應該調用哪個子節(jié)點的內容, .string 的輸出結果是 None，如果只有一個子節(jié)點那么就輸出該子節(jié)點的文本，比如下面的這種結構，soup.p.string 返回為None,但soup.p.strings就可以找到所有文本

from bs4 import BeautifulSouptmp_html= """<p id='list-1'>哈哈哈哈<a class='sss'><span><h1>aaaa</h1></span></a><b>bbbbb</b> </p>""" soup = BeautifulSoup(tmp_html, 'html.parser') print(soup.p.string) # p下的文本只有一個時，取到，否則為None print(soup.p.strings) # 拿到一個生成器對象, 取到p下所有的文本內容for i in soup.stripped_strings:print('-->',i)

執(zhí)行輸出：

None <generator object _all_strings at 0x000002017427C570> --> 哈哈哈哈 --> aaaa --> bbbbb

因此，使用?string，標簽下的文本只能有一個，否則返回None！

而使用text，返回所有文本！

這就是tag.text和tag.string的區(qū)別！

3.遍歷文檔樹

關于DOM數(shù)，請參考鏈接：

https://www.cnblogs.com/majj/p/9102890.html

1、嵌套選擇

from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc, 'html.parser') print(soup.head.title.string) # The Dormouse's story print(soup.body.a.string) # Elsie

2、子節(jié)點、子孫節(jié)點

from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc, 'html.parser')print(soup.p.contents) #p下所有子節(jié)點 print(soup.p.children) #得到一個迭代器,包含p下所有子節(jié)點for i,child in enumerate(soup.p.children):print(i,child)print(soup.p.descendants) #獲取子孫節(jié)點,p下所有的標簽都會選擇出來 for i,child in enumerate(soup.p.descendants):print(i,child)

3、父節(jié)點、祖先節(jié)點

print(soup.a.parent) #獲取a標簽的父節(jié)點 print(soup.a.parents) #找到a標簽所有的祖先節(jié)點，父親的父親，父親的父親的父親...

?soup.a.parents 最終返回的節(jié)點是<html>...</html>。也就是說，它直接回到頂層了！

4、兄弟節(jié)點

print('=====>') print(soup.a.next_sibling) #下一個兄弟 print(soup.a.previous_sibling) #上一個兄弟print(list(soup.a.next_siblings)) #下面的兄弟們=>生成器對象 print(soup.a.previous_siblings) #上面的兄弟們=>生成器對象

4. 搜索文檔樹

BeautifulSoup定義了很多搜索方法,這里著重介紹2個: find() 和 find_all() .其它方法的參數(shù)和用法類似

1、五種過濾器

搜索文檔樹：BeautifulSoup定義了很多搜索方法,這里著重介紹2個: find() 和 find_all() .其它方法的參數(shù)和用法類似

html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p id="my p" class="title"><b id="bbb" class="boldest">The Dormouse's story</b> </p><p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p><p class="story">...</p> """from bs4 import BeautifulSoup soup=BeautifulSoup(html_doc,'lxml')

五種過濾器: 字符串、正則表達式、列表、True、方法

1.1、字符串：即標簽名

print(soup.find_all('b'))

執(zhí)行輸出：?

[<b class="boldest" id="bbb">The Dormouse's story</b>]

1.2、正則表達式

import re print(soup.find_all(re.compile('^b'))) #找出b開頭的標簽，結果有body和b標簽

執(zhí)行輸出：

[<body> <p class="title" id="my p"><b class="boldest" id="bbb">The Dormouse's story</b> </p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body>, <b class="boldest" id="bbb">The Dormouse's story</b>] View Code

1.3、列表

如果傳入列表參數(shù),Beautiful Soup會將與列表中任一元素匹配的內容返回.下面代碼找到文檔中所有<a>標簽和<b>標簽:

print(soup.find_all(['a','b']))

執(zhí)行輸出：

[<b class="boldest" id="bbb">The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] View Code

1.4、True

可以匹配任何值,下面代碼查找到所有的tag,但是不會返回字符串節(jié)點

print(soup.find_all(True)) for tag in soup.find_all(True):print(tag.name)

這個不怎么用，可以忽略掉！?

1.5、方法

如果沒有合適過濾器,那么還可以定義一個方法,方法只接受一個元素參數(shù) ,如果這個方法返回 True 表示當前元素匹配并且被找到,如果不是則反回 False

from bs4 import BeautifulSoupsoup=BeautifulSoup(html_doc,'lxml')def has_class_but_no_id(tag): # 有class但是沒有id屬性的標簽return tag.has_attr('class') and not tag.has_attr('id')print(soup.find_all(has_class_but_no_id)) # 將每一個標簽傳給這個方法過濾

執(zhí)行輸出：

[<p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p>, <p class="story">...</p>] View Code

最終輸出的，都是有class屬性，但是沒有id屬性的標簽

2、find_all()

語法

find_all( name , attrs , recursive , text , **kwargs )

2.1、name

搜索name參數(shù)的值可以使任一類型的過濾器 ,字符竄,正則表達式,列表,方法或是 True?

from bs4 import BeautifulSoupsoup=BeautifulSoup(html_doc,'lxml')import re print(soup.find_all(name=re.compile('^t')))

執(zhí)行輸出：

[<title>The Dormouse's story</title>]

2.2、keyword

key=value的形式，value可以是過濾器：字符串 , 正則表達式 , 列表, True?

print(soup.find_all(id=re.compile('my'))) # 查找id含有my屬性的標簽 # 查找herf含有l(wèi)acle以及id帶有數(shù)字的標簽。 # 如果是class屬性，需要使用class_，因為在python中，class是關鍵字！ print(soup.find_all(href=re.compile('lacie'),id=re.compile('\d'))) print(soup.find_all(id=True)) #查找有id屬性的標簽

執(zhí)行輸出：

[<p class="title" id="my p"><b class="boldest" id="bbb">The Dormouse's story</b> </p>] [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] [<p class="title" id="my p"><b class="boldest" id="bbb">The Dormouse's story</b> </p>, <b class="boldest" id="bbb">The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] View Code

有些tag屬性在搜索不能使用,比如HTML5中的 data-* 屬性:

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>','lxml') data_soup.find_all(data-foo="value") #報錯：SyntaxError: keyword can't be an expression

但是可以通過 find_all() 方法的 attrs 參數(shù)定義一個字典參數(shù)來搜索包含特殊屬性的tag:

from bs4 import BeautifulSoupsoup=BeautifulSoup(html_doc,'lxml')data_soup = BeautifulSoup('<div data-foo="value">foo!</div>','lxml') print(data_soup.find_all(attrs={"data-foo": "value"})) # [<div data-foo="value">foo!</div>]

2.3、按照類名查找

注意關鍵字是class_，class_=value,value可以是五種選擇器之一

from bs4 import BeautifulSoupsoup=BeautifulSoup(html_doc,'lxml') import re print(soup.find_all('a',class_='sister')) #查找類為sister的a標簽 print(soup.find_all('a',class_='sister ssss')) #查找類為sister和sss的a標簽，順序錯誤也匹配不成功 print(soup.find_all(class_=re.compile('^sis'))) #查找類為sister的所有標簽

執(zhí)行輸出：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] [] [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] View Code

2.4、attrs

如果有多個屬性，在attrs里面增加key-value即可

# 查詢p標簽,class屬性為story print(soup.find_all('p',attrs={'class':'story'}))

執(zhí)行輸出：

2.5、text

值可以是：字符，列表，True，正則

print(soup.find_all(text='Elsie')) print(soup.find_all('a',text='Elsie'))

執(zhí)行輸出：

['Elsie'] [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] View Code

2.6、limit參數(shù)

如果文檔樹很大那么搜索會很慢.如果我們不需要全部結果,可以使用 limit 參數(shù)限制返回結果的數(shù)量.效果與SQL中的limit關鍵字類似,當搜索到的結果數(shù)量達到 limit 的限制時,就停止搜索返回結果

print(soup.find_all('a',limit=2))

?執(zhí)行輸出：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] View Code

2.7、recursive

調用tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的所有子孫節(jié)點,如果只想搜索tag的直接子節(jié)點,可以使用參數(shù) recursive=False?

print(soup.html.find_all('a')) print(soup.html.find_all('a',recursive=False))

執(zhí)行輸出：

備注

像調用 find_all() 一樣調用tag find_all() 幾乎是Beautiful Soup中最常用的搜索方法,所以我們定義了它的簡寫方法. BeautifulSoup 對象和 tag 對象可以被當作一個方法來使用, 這個方法的執(zhí)行結果與調用這個對象的 find_all() 方法相同,下面兩行代碼是等價的:soup.find_all("a") soup("a")這兩行代碼也是等價的:soup.title.find_all(text=True) soup.title(text=True)

3、find()

語法

find( name , attrs , recursive , text , **kwargs )

find_all() 方法將返回文檔中符合條件的所有tag,盡管有時候我們只想得到一個結果.比如文檔中只有一個<body>標簽,那么使用 find_all() 方法來查找<body>標簽就不太合適, 使用 find_all 方法并設置 limit=1 參數(shù)不如直接使用 find() 方法.下面兩行代碼是等價的:

from bs4 import BeautifulSoupsoup=BeautifulSoup(html_doc,'lxml') print(soup.find_all('title', limit=1)) # soup.find_all('title', limit=1) print(soup.find('title')) # <title>The Dormouse's story</title>

唯一的區(qū)別是 find_all() 方法的返回結果是值包含一個元素的列表,而 find() 方法直接返回結果.
find_all() 方法沒有找到目標是返回空列表, find() 方法找不到目標時,返回 None .

print(soup.find("nosuchtag")) # None

soup.head.title 是 tag的名字方法的簡寫.這個簡寫的原理就是多次調用當前tag的 find() 方法:

print(soup.head.title) # <title>The Dormouse's story</title> print(soup.find("head").find("title")) # <title>The Dormouse's story</title>

4、其他方法

見官網: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-parents-find-parent

5、css選擇器

我們在寫 CSS 時，標簽名不加任何修飾，類名前加點，id名前加 #，在這里我們也可以利用類似的方法來篩選元素，用到的方法是?soup.select()，返回類型是?list

1. 通過標簽名查找

from bs4 import BeautifulSoupsoup=BeautifulSoup(html_doc,'lxml') print(soup.select("title")) # [<title>The Dormouse's story</title>] print(soup.select("b")) # [<b class="boldest" id="bbb">The Dormouse's story</b>]

2.?通過類名查找

print(soup.select(".sister")) ''' [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]'''

3.?通過 id 名查找

print(soup.select("#link1")) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

4.?組合查找

組合查找即和寫 class 文件時，標簽名與類名、id名進行的組合原理是一樣的，例如查找 p 標簽中，id 等于 link1的內容，二者需要用空格分開

print(soup.select("p #link2"))#[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

直接子標簽查找

print(soup.select("p > #link2")) # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

5. 屬性查找

查找時還可以加入屬性元素，屬性需要用中括號括起來，注意屬性和標簽屬于同一節(jié)點，所以中間不能加空格，否則會無法匹配到。

print(soup.select("a[href='http://example.com/tillie']")) #[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

select 方法返回的結果都是列表形式，可以遍歷形式輸出，然后用 get_text()?方法來獲取它的內容：

for title in soup.select('a'):print (title.get_text())''' Elsie Lacie Tillie '''

5. 修改文檔樹

修改文檔樹

二、Xpath模塊

xpath簡介

XPath在Python的爬蟲學習中，起著舉足輕重的地位，對比正則表達式 re兩者可以完成同樣的工作，實現(xiàn)的功能也差不多，但XPath明顯比re具有優(yōu)勢，在網頁分析上使re退居二線。

XPath介紹

是什么？全稱為XML Path Language?一種小型的查詢語言
說道XPath是門語言，不得不說它所具備的優(yōu)點：?

可在XML中查找信息?
支持HTML的查找?
通過元素和屬性進行導航

python開發(fā)使用XPath條件：?由于XPath屬于lxml庫模塊，所以首先要安裝庫lxml。

XPath的簡單調用方法：

from lxml import etreeselector=etree.HTML(源碼) #將源碼轉化為能被XPath匹配的格式 selector.xpath(表達式) #返回為一列表

Xpath語法?

查詢

html_doc = """ <!DOCTYPE html> <html lang="en"> <head><meta charset="UTF-8"><title>Title</title> </head> <body><div class="d1"><div class="d2"><p class="story"><a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" id="link3">Tillie</a></p></div><div><p id="p1">ALex is dsb</p><p id="p2">Egon too</p></div> </div><div class="d3"><a href="http://www.baidu.com">baidu</a><p>百度</p> </div></body> </html> """from lxml import etree selector=etree.HTML(html_doc) # 將源碼轉化為能被XPath匹配的格式

1. 選取節(jié)點

語法：

nodename 選取nodename節(jié)點的所有子節(jié)點 xpath(‘//div’) 選取了所有div節(jié)點 / 從根節(jié)點選取 xpath(‘/div’) 從根節(jié)點上選取div節(jié)點 // 選取所有的當前節(jié)點，不考慮他們的位置 xpath(‘//div’) 選取所有的div節(jié)點 . 選取當前節(jié)點 xpath(‘./div’) 選取當前節(jié)點下的div節(jié)點 .. 選取當前節(jié)點的父節(jié)點 xpath(‘..’) 回到上一個節(jié)點 @ 選取屬性 xpath（’//@calss’）選取所有的class屬性

示例代碼：

ret=selector.xpath("//div") ret=selector.xpath("/div") ret=selector.xpath("./div") ret=selector.xpath("//p[@id='p1']") ret=selector.xpath("//div[@class='d1']/div/p[@class='story']")

2. 謂語

語法：

表達式結果 xpath(‘/body/div[1]’) 選取body下的第一個div節(jié)點 xpath(‘/body/div[last()]’) 選取body下最后一個div節(jié)點 xpath(‘/body/div[last()-1]’) 選取body下倒數(shù)第二個div節(jié)點 xpath(‘/body/div[positon()<3]’) 選取body下前兩個div節(jié)點 xpath(‘/body/div[@class]’) 選取body下帶有class屬性的div節(jié)點 xpath(‘/body/div[@class=”main”]’) 選取body下class屬性為main的div節(jié)點 xpath(‘/body/div[price>35.00]’) 選取body下price元素值大于35的div節(jié)點

示例代碼：

from lxml import etree selector=etree.HTML(html_doc) # 將源碼轉化為能被XPath匹配的格式 ret=selector.xpath("//p[@class='story']//a[2]") # [<Element a at 0x20d9a188c88>] ret=selector.xpath("//p[@class='story']//a[last()]") # [<Element a at 0x27a64818d88>]

返回結果是一個Element對象

3. 通配符

Xpath通過通配符來選取未知的XML元素

表達式結果 xpath（’/div/*’）選取div下的所有子節(jié)點 xpath(‘/div[@*]’) 選取所有帶屬性的div節(jié)點

示例代碼：

ret=selector.xpath("//p[@class='story']/*") # [<Element a at 0x22c8461bcc8>, <Element a at 0x22c8461bc88>, <Element a at 0x22c8461bd88>] ret=selector.xpath("//p[@class='story']/a[@class]") # [<Element a at 0x1ebe370bc88>, <Element a at 0x1ebe370bd88>]

4. 取多個路徑

使用“|”運算符可以選取多個路徑

表達式結果 xpath(‘//div|//table’) 選取所有的div和table節(jié)點

示例代碼：

ret=selector.xpath("//p[@class='story']/a[@class]|//div[@class='d3']") print(ret) # [<Element a at 0x1d509bd9cc8>, <Element a at 0x1d509bd9c88>, <Element div at 0x1d509bd9d88>]

5.?Xpath軸

軸可以定義相對于當前節(jié)點的節(jié)點集

軸名稱表達式描述 ancestor xpath(‘./ancestor::*’) 選取當前節(jié)點的所有先輩節(jié)點（父、祖父） ancestor-or-self xpath(‘./ancestor-or-self::*’) 選取當前節(jié)點的所有先輩節(jié)點以及節(jié)點本身 attribute xpath(‘./attribute::*’) 選取當前節(jié)點的所有屬性 child xpath(‘./child::*’) 返回當前節(jié)點的所有子節(jié)點 descendant xpath(‘./descendant::*’) 返回當前節(jié)點的所有后代節(jié)點（子節(jié)點、孫節(jié)點） following xpath(‘./following::*’) 選取文檔中當前節(jié)點結束標簽后的所有節(jié)點 following-sibing xpath(‘./following-sibing::*’) 選取當前節(jié)點之后的兄弟節(jié)點 parent xpath(‘./parent::*’) 選取當前節(jié)點的父節(jié)點 preceding xpath(‘./preceding::*’) 選取文檔中當前節(jié)點開始標簽前的所有節(jié)點preceding-sibling xpath(‘./preceding-sibling::*’) 選取當前節(jié)點之前的兄弟節(jié)點 self xpath(‘./self::*’) 選取當前節(jié)點

6. 功能函數(shù)

使用功能函數(shù)能夠更好的進行模糊搜索

函數(shù) 用法解釋 starts-with xpath(‘//div[starts-with(@id,”ma”)]‘) 選取id值以ma開頭的div節(jié)點 contains xpath(‘//div[contains(@id,”ma”)]‘) 選取id值包含ma的div節(jié)點 and xpath(‘//div[contains(@id,”ma”) and contains(@id,”in”)]‘) 選取id值包含ma和in的div節(jié)點 text() xpath(‘//div[contains(text(),”ma”)]‘) 選取節(jié)點文本包含ma的div節(jié)點

舉例1：

# 選取p標簽class屬性為story,在它下面的a標簽id屬性為link1的文本信息 ret=selector.xpath("//p[@class='story']/a[@id='link1']/text()") print(ret) # ['Elsie']

舉例2：

# 選取p標簽class屬性為story,在它下面的a標簽id屬性以link開頭的文本信息 ret=selector.xpath("//p[@class='story']/a[starts-with(@id,'link')]/text()") print(ret) # ['Elsie', 'Lacie', 'Tillie']

舉例3：

# 選取p標簽class屬性為story,在它下面的a標簽id屬性包含k的文本信息 ret=selector.xpath("//p[@class='story']/a[contains(@id,'k')]/text()") print(ret) # ['Elsie', 'Lacie', 'Tillie']

舉例4：

# 選取p標簽class屬性為story,在它下面的a標簽class屬性包含is的文本信息 ret=selector.xpath("//p[@class='story']/a[contains(@class,'is')]/text()") print(ret) # ['Elsie', 'Lacie']

舉例5：

# 選取p標簽class屬性為story,在它下面的a標簽文本信息包含ie的文本信息 ret=selector.xpath("//p[@class='story']/a[contains(text(),'ie')]/text()") print(ret) # ['Elsie', 'Lacie', 'Tillie']

Element對象

from lxml.etree import _Element for obj in ret:print(obj)print(type(obj)) # from lxml.etree import _Element''' Element對象class xml.etree.ElementTree.Element(tag, attrib={}, **extra)tag：string，元素代表的數(shù)據(jù)種類。text：string，元素的內容。tail：string，元素的尾形。attrib：dictionary，元素的屬性字典。＃針對屬性的操作clear()：清空元素的后代、屬性、text和tail也設置為None。get(key, default=None)：獲取key對應的屬性值，如該屬性不存在則返回default值。items()：根據(jù)屬性字典返回一個列表，列表元素為(key, value）。keys()：返回包含所有元素屬性鍵的列表。set(key, value)：設置新的屬性鍵與值。＃針對后代的操作append(subelement)：添加直系子元素。extend(subelements)：增加一串元素對象作為子元素。＃python2.7新特性find(match)：尋找第一個匹配子元素，匹配對象可以為tag或path。findall(match)：尋找所有匹配子元素，匹配對象可以為tag或path。findtext(match)：尋找第一個匹配子元素，返回其text值。匹配對象可以為tag或path。insert(index, element)：在指定位置插入子元素。iter(tag=None)：生成遍歷當前元素所有后代或者給定tag的后代的迭代器。＃python2.7新特性iterfind(match)：根據(jù)tag或path查找所有的后代。itertext()：遍歷所有后代并返回text值。remove(subelement)：刪除子元素。'''

本文參考鏈接：

https://www.cnblogs.com/yuanchenqi/articles/9487175.html

應用案例：安居客

打開安居客二手房鏈接：

https://beijing.anjuke.com/sale/?from=navigation

需要爬取標題

務必使用谷歌瀏覽器打開，右鍵檢查。找到對應的標簽，右鍵copy-->copy Xpath

那么就能得到

//*[@id="leftContent"]/ul/li[1]/div/div[1]/a

接下來寫代碼：

import requests from lxml import etree# 訪問二手房網頁 res= requests.get("https://beijing.anjuke.com/sale/?from=navigation")selector=etree.HTML(res.content) # 將html源碼轉化為能被XPath匹配的格式# 匹配第一個條信息的標題 ret=selector.xpath('//*[@id="houselist-mod-new"]/li[1]/div[2]/div[1]/a/text()') print(ret)

執(zhí)行輸出：

['\n 鳳凰城二期，高層兩居室，看房有鑰匙隨時，臨國際村，金星園瀾石']

怎么樣，是不是很方便。注意：它能幫你快速生成 XPath表達式。它并不是萬能的，在某些動態(tài)加載的網頁中，可能不適用！

上面的結果只有一條，如果有多條呢？需要自己修改規(guī)則了！

XPath Helper

XPath Helper可以支持在網頁點擊元素生成xpath，整個抓取使用了xpath、正則表達式、消息中間件、多線程調度框架的chrome插件。

這里我已經安裝好了

安裝完成之后，在頁面右上角，有一個圖標

點擊一下，它會彈出一個黑框。將剛才的XPath規(guī)則復制到左邊，右邊會立即顯示結果

如果被選中，頁面會有黃色標記！

作業(yè)：

爬取鏈家二手房信息，爬取3頁即可！

注意：使用XPath完成！

分解步驟

打開鏈接二手房鏈接：

https://bj.lianjia.com/ershoufang/

獲取標題

獲取第一條信息的標題

選擇所有信息中的標題部分

使用python代碼實現(xiàn)

import requests from lxml import etree# 訪問二手房網頁 res= requests.get("https://bj.lianjia.com/ershoufang/")selector=etree.HTML(res.content) # 將html源碼轉化為能被XPath匹配的格式# 匹配所有的標題 ret=selector.xpath('//*[@id="leftContent"]/ul/li[*]//div/div/a/text()') print(ret)

執(zhí)行輸出：

['南向兩居室電梯房簡單裝修適合居住', '惠民家園 ', ...]

那么所有的標題，就獲取完成了！

轉載于:https://www.cnblogs.com/xiao987334176/p/9720562.html

總結

以上是生活随笔為你收集整理的python 全栈开发，Day135(爬虫系列之第2章-BS和Xpath模块)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： android studio布局显示异常
下一篇： python天天向上的力量b_Pytho

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

python 全栈开发，Day135(爬虫系列之第2章-BS和Xpath模块)

一、BeautifulSoup

1. 簡介

安裝

解析器

簡單使用

2. 標簽對象

Tag的名字

Tag的name和attributes屬性

名稱和屬性

文本獲取

3.遍歷文檔樹

1、嵌套選擇

2、子節(jié)點、子孫節(jié)點

3、父節(jié)點、祖先節(jié)點

4、兄弟節(jié)點

4. 搜索文檔樹

1、五種過濾器

1.1、字符串：即標簽名

1.2、正則表達式

1.3、列表

1.4、True

1.5、方法

2、find_all()

語法

2.1、name

2.2、keyword

2.3、按照類名查找

2.4、attrs

2.5、text

2.6、limit參數(shù)

2.7、recursive

3、find()

語法

4、其他方法

5、css選擇器

1. 通過標簽名查找

2.?通過類名查找

3.?通過 id 名查找

4.?組合查找

5. 屬性查找

5. 修改文檔樹

二、Xpath模塊

xpath簡介

XPath介紹

Xpath語法?

查詢

1. 選取節(jié)點

2. 謂語

3. 通配符

4. 取多個路徑

5.?Xpath軸

6. 功能函數(shù)

Element對象

應用案例：安居客

XPath Helper

作業(yè)：

分解步驟

獲取標題

總結