當前位置：首頁 > 编程语言 > python >内容正文

python

Python爬虫解析库的使用

發布時間：2024/2/28 python 44 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫解析库的使用小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

已寫章節

第一章網絡爬蟲入門
第二章基本庫的使用
第三章解析庫的使用
第四章數據存儲
第五章動態網頁的抓取

文章目錄

- - 已寫章節
第三章解析庫的使用
- 3.1BeautifulSoup
- - 3.1.1 BeautifulSoup的安裝
  - 3.1.2 BeautifulSoup庫的理解
  - 3.1.3 解析器
  - 3.1.4 BeautifulSoup的基本使用
  - - 3.1.4.1 使用soup.\來獲取指定的標簽
    - 3.1.4.2 獲取標簽的屬性
    - 3.1.4.3 標簽樹的遍歷
    - 3.1.4.4 使用BeautifulSoup提供的方法來選擇標簽
    - - find_all()方法
      - find()方法
      - prettify()方法
      - 其他查詢方法:
    - 3.1.4.6 使用Lambda表達式
    - 3.1.4.5 css選擇器
    - - 嵌套選擇
      - 獲取屬性
      - 獲取文本
  - 3.1.5 使用BeautifulSoup爬取qq音樂的首發音樂
- 3.2 Xpath(重點)
- - 3.2.1 Xpath介紹
  - 3.2.2 Xpath的常用規則
  - 3.2.3 Xpath的使用
  - - 3.2.3.1 安裝lxml
    - 3.2.3.2 使用Xpath
    - 3.2.3.3 xpath使用技巧
  - 3.2.4 使用Xpath爬取窮游網的中國熱門城市
- 3.3 pyquery(了解)
- - 3.3.1 pyquery的安裝
  - 3.3.2 初始化
  - 3.3.3 pyquery的使用

第三章解析庫的使用

3.1BeautifulSoup

BeautifulSoup也被稱作為”美味湯“，BeautifulSoup提供了一些簡單的、Python式的函數來處理導航、搜索、修改和分析樹等功能。它是一個工具箱，通過解析文檔為用戶提供需要抓取的數據。BeautifulSoup已成為和lxml、html6lib一樣出色的Python解釋器，為用戶靈活的提供不同的解析策略或強勁的速度。

作者提示：當你學習了Xpath后，你會發現BeautifulSoup是非常不好用的，并且學了Xpath之后，基本不用BeautifulSoup了。但是，我想說的是，學習BeautifulSoup會加強你對標簽樹、子標簽、父標簽、后代標簽以及標簽的結構的理解，而這些都是學習Xpath的基礎，這也是為什么很多人將BeautifulSoup作為自己學習Python網絡爬蟲的第一個解析庫。

Beautiful Soup庫是解析、遍歷、維護“標簽樹"的功能庫。

BeautifulSoup文檔

3.1.1 BeautifulSoup的安裝

在cmd命令行中輸入以下命令來使用pip安裝BeautifulSoup庫：

pip install beautifulsoup4

Beautiful Soup庫，也叫beautifulsoup4或bs4，約定導入方式如下：

from bs4 import BeautifulSoup # 或： import bs4

在Python文件中使用上面的代碼導入BeautifulSoup庫，如果沒有報錯，說明你已經成功安裝了BeautifulSoup。

3.1.2 BeautifulSoup庫的理解

BeautifulSoup是將HTML代碼當做一個便簽樹來處理，BeautifulSoup對應一個HTML/XML文檔的全部內容。其中的每一個標簽的結構如下：

基本元素說明

Tag	標簽，最基本的信息組織單元，分別用<>和</>標明開頭和結尾
Name	標簽名，<p>...</p>的名字是’p’，格式：<tag>.name
Attributes	標簽屬性，字典形式組織，格式：<tag>.attrs
NavigableString	標簽內的非屬性字符串，<>...</>中的字符串，格式：<tag>.string
Comment	標簽內字符串的注釋部分，一種特殊的comment類型

3.1.3 解析器

Beautiful Soup在解析時需要依賴解析器：

soup = BeautifuSoup('<html>data</html>','html.parser') 解析器使用方法條件

bs4的HTML解析器	BeautifulSoup(mk,‘html.parser’)	安裝bs4庫
lxml的解析器	BeautifulSoup(mk,‘lxml’)	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,‘xml’)	pip install lxml
html5lib的解析器	BeautifulSoup(mk,'html5lib‘)	pip install html5lib

3.1.4 BeautifulSoup的基本使用

在上面介紹了Tag（標簽），它是最基本的信息組織單元，分別用<>和</>標明開頭和結尾，下面介紹BeautifulSoup的基本使用。

3.1.4.1 使用soup.<tag>來獲取指定的標簽

from bs4 import BeautifulSoup# 使用soup.<tagname>來獲取指定的標簽 text = ''' <html><head><title>The Dormouse's story</title></head><body>The Dormouse's storyOnce upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well....<f></f><c>This is not a comment</c></body> </html> '''soup = BeautifulSoup(text, "html.parser") print(soup.a)

通過soup.a來獲取HTML中的a標簽，如果結果有多個，取第一個。下面是運行結果：

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

注意：任何存在于HTML語法中的標簽都可以使用soup.<tag>訪問獲得，當HTML中存在多個相同<tag>標簽時，soup.<tag>返回第一個。

3.1.4.2 獲取標簽的屬性

from bs4 import BeautifulSoup# BeautifulSoup中獲取標簽的屬性 text = ''' <html><head><title>The Dormouse's story</title></head><body>The Dormouse's storyOnce upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well....<f></f><c>This is not a comment</c></body> </html> '''soup = BeautifulSoup(text, "html.parser")print("a標簽的名字是：", soup.a.name)print(type(soup.a.attrs)) print("a標簽的屬性為：", soup.a.attrs)print("a標簽的class屬性是", soup.a.attrs['class'])print(type(soup.a.string)) print("a標簽中的非屬性字符串為：", soup.a.string)print(type(soup.f.string)) print("f標簽中的注釋為：", soup.f.string)

運行結果：

a標簽的名字是： a <class 'dict'> a標簽的屬性為： {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'} a標簽的class屬性是 ['sister'] <class 'bs4.element.NavigableString'> a標簽中的非屬性字符串為： Elsie <class 'bs4.element.Comment'> f標簽中的注釋為： This is a comment

3.1.4.3 標簽樹的遍歷

標簽樹的下行遍歷

屬性說明

.contents	子節點的列表，將<tag>的所有兒子節點存入列表
.children	子節點的迭代類型，與.contents類似，用于循環遍歷兒子節點
.descendants	子孫節點的迭代類型，包含所有子孫節點，用于循環遍歷

from bs4 import BeautifulSoup# 標簽樹的下行遍歷 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> <a>...aaa</a> </body> </html> '''soup = BeautifulSoup(text, 'html.parser')print(len(soup.body.contents)) print(soup.body.contents)print("-"*50) for i, child in enumerate(soup.body.children):print(i, child)print("-"*50) for i, descendant in enumerate(soup.body.descendants):print(i, descendant)

運行結果：

9 ['\n', <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\n', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n', <a>...aaa</a>, '\n'] -------------------------------------------------- 0 1 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 2 3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 4 5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 6 7 <a>...aaa</a> 8 -------------------------------------------------- 0 1 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 2 Elsie 3 4 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 5 Lacie 6 7 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 8 Tillie 9 10 <a>...aaa</a> 11 ... 12 ... 13 aaa 14

注意：descendants會遞歸查詢所有子節點，得到所有的子孫節點

標簽的上行遍歷

屬性說明

.parent	節點的父親標簽
.parents	節點的先輩標簽的迭代類型，用于循環遍歷先輩節點

from bs4 import BeautifulSoup# 標簽樹的上行遍歷 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> <a>...aaa</a> </body> </html> '''soup = BeautifulSoup(text, 'html.parser')print(soup.p.parent)print("-"*50) for i, parent in enumerate(soup.p.parents):print(i, parent)

運行結果：

<a>...aaa</a> -------------------------------------------------- 0 <a>...aaa</a> 1 <body> <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> <a>...aaa</a> </body> 2 <html><head><title>The Dormouse's story</title></head> <body> <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> <a>...aaa</a> </body> </html> 3 <html><head><title>The Dormouse's story</title></head> <body> <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> <a>...aaa</a> </body> </html>

標簽的平行遍歷

屬性說明

.next_sibling	返回按照HTML文本順序的下一個平行節點標簽
.previous_sibling	返回按照HTML文本順序的上一個平行節點標簽
.next_siblings	迭代類型，返回按照HTML文本順序的后續所有平行節點標簽
.privious_siblings	迭代類型，返回按照HTML文本順序的前續所有平行節點標簽

from bs4 import BeautifulSoup# 標簽樹的平行遍歷 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a></a> <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> pp <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> <a>...aaa</a> </body> </html> '''text = text.replace('\n', '') # 去除換行符的干擾 soup = BeautifulSoup(text, 'html.parser')print(soup.p.next_sibling) print("-"*50)print(soup.p.previous_sibling) print("-"*50)for i, previous_sibling in enumerate(soup.p.previous_siblings):print(i, previous_sibling) print("-"*50)for i, next_sibling in enumerate(soup.p.next_siblings):print(i, next_sibling)

運行結果如下：

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> -------------------------------------------------- <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> -------------------------------------------------- 0 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 1 <a></a> -------------------------------------------------- 0 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 1 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 2 <a>...aaa</a>

3.1.4.4 使用BeautifulSoup提供的方法來選擇標簽

find_all()方法

查詢所有符合條件的元素，給它傳入一些屬性或文本，就可以得到符合條件的元素。

**find_all(name, attrs, recursive, text, limit, **kwargs)

name參數：篩選出所有符合標簽名條件的標簽，可以是一個標簽名，也可以是一個Python字典或列表封裝的若干個標簽名

from bs4 import BeautifulSoup# soup.find_all()方法中篩選指定標簽名的標簽 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="http://example.com" class="sister">Elsie</a> <a href="http://example.com" class="sister">Tillie</a> ... </body> </html> '''text = text.replace('\n', '') soup = BeautifulSoup(text, 'html.parser')print(soup.find_all('a')) # 找出所有的a標簽print(soup.find_all('p')) # 找出所有的p標簽print(soup.find_all(['p', 'title'])) # 找出所有的a標簽和title標簽

運行結果：

[<a class="sister" href="http://example.com">Elsie</a>, <a class="sister" href="http://example.com">Tillie</a>] [, ...] [<title>The Dormouse's story</title>, , ...]

attrs參數：根據標簽的屬性來篩選滿足條件的標簽，用字典封裝一個標簽的若干個屬性和對應的屬性值

from bs4 import BeautifulSoup# soup.find_all()中根據標簽屬性來篩選標簽 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="link1" class="b">Elsie</a> <a href="link1" class="a">Tillie</a> <a href="link1" class="c">df</a> ... </body> </html> '''text = text.replace('\n', '') soup = BeautifulSoup(text, 'html.parser')print(soup.find_all('a', {'href': 'link1'})) # 找出所有的href屬性為link1的a標簽 print('-'*50)print(soup.find_all('a', {'href': 'link1', 'class': 'a'})) # 找出所有的href屬性為link1且class屬性為a的p標簽 print('-'*50)print(soup.find_all('a', {'class': ['a', 'b']})) # 找出說有的class屬性為a或b的a標簽 print('-'*50)print(soup.find_all(['p', 'title'])) # 找出所有的p標簽和title標簽

運行結果：

[<a class="b" href="link1">Elsie</a>, <a class="a" href="link1">Tillie</a>, <a class="c" href="link1">df</a>] -------------------------------------------------- [<a class="a" href="link1">Tillie</a>] -------------------------------------------------- [<a class="b" href="link1">Elsie</a>, <a class="a" href="link1">Tillie</a>] -------------------------------------------------- [<title>The Dormouse's story</title>, , ...]

recursive參數：布爾變量，默認為True。它決定了要抓取HTML文檔標簽結構里的多少層信息。如果設置為True，find_all()方法就會根據你的要求去查找標簽參數的所有子標簽，以及子標簽的子標簽。如果設置為False，find_all()方法就只查找文檔的一級標簽。recursive一般不需要設置，除非你真正了解自己需要哪些信息，而且抓取的速度非常重要，那時你就可以設置遞歸參數。

text參數：用標簽的文本內容去匹配

from bs4 import BeautifulSoup import re# soup.find_all()中使用文本來篩選標簽 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="link1" class="b">Elsie</a> <a href="link1" class="a">Tillie</a> <a href="link1" class="c">df</a> ... dfd </body> </html> '''text = text.replace('\n', '') soup = BeautifulSoup(text, 'html.parser')print(soup.find_all(text=re.compile('.*df.*'))) # 找出所有標簽文本中包含df的文本，返回結果為字符串列表

運行結果：

['df', 'dfd']

limit參數：結果數目限制參數，它限制了查找結果標簽的數目，它只適用于find_all()方法，find()方法其實等價于find_all()方法的limit為1的時候。

from bs4 import BeautifulSoup# soup.find_all()中限制結果的個數 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="link1" class="b">Elsie</a> <a href="link1" class="a">Tillie</a> <a href="link1" class="c">df</a> </body> </html> '''soup = BeautifulSoup(text, 'html.parser') print(soup.find_all('a', limit=2))

運行結果如下：

[<a class="b" href="link1">Elsie</a>, <a class="a" href="link1">Tillie</a>]

keyword參數：關鍵詞參數keyword可以讓你選擇那些具有指定屬性的標簽，例如：

from bs4 import BeautifulSouptext = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="link1" class="b" id='id1'>Elsie</a> <a href="link1" class="a" id='id1'>Tillie</a> <a href="link1" class="c" id='id2'>df</a> </body> </html> '''soup = BeautifulSoup(text, 'html.parser') all_label = soup.find_all(id="id1") print(all_label[0].get_text())

運行結果：

Elsie

使用keyword參數的注意事項：

關鍵詞參數可以使用其他方法替代，例如下面的兩行代碼功能是一樣的：
bsobj.find_all(id="text") bsobj.find_all("", {"id": "text"})
在使用keyword偶爾也會出現問題，尤其在用class屬性查找標簽時，因為class是Python中受保護的關鍵字，而Python中不能拿關鍵字作為參數名。不過，BeautifulSoup提供了有點兒臃腫的方案，在class下面添加一個下劃線：
bsobj.find_all(class_="green")
當然，你也可以這樣寫：
bsobj.find_all("", {"class"="green"})

find()方法

除了find_all()方法，還有find()方法，只不過find()方法返回的是單個的標簽，也就是第一個匹配的標簽，而find_all()方法返回的是所有匹配的標簽組成的列表。

from bs4 import BeautifulSoup# bs_object.find()方法的使用 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="link1" class="b" id='id1'>Elsie</a> <a href="link1" class="a" id='id1'>Tillie</a> <a href="link1" class="c" id='id2'>df</a> </body> </html> '''soup = BeautifulSoup(text, 'html.parser') print(soup.find(id="id1")) # 這里如果使用的是find_all方法，返回的將是一個包含兩個標簽的列表

運行結果：

<a class="b" href="link1" id="id1">Elsie</a>

可以看到，符合id=id1的標簽有兩個，但是find()方法只會返回第一個符合條件的標簽。

prettify()方法

這個方法可以將html文本補齊并且將其美化：

from bs4 import BeautifulSoup# bs_object.prettify()方法的使用 text = ''' <html><head> <title>The Dormouse'sstory</title></head> <body> <a href="link1" class="b" id='id1'>Elsie</a> <a href="link1" class="a" id='id1'> Tillie</a> <a href= "link1" class="c" id='id2'>df</a> </body>'''soup = BeautifulSoup(text, 'html.parser') print(soup.prettify())

運行結果：

<html><head><title>The Dormouse'sstory</title></head><body><a class="b" href="link1" id="id1">Elsie</a><a class="a" href="link1" id="id1">Tillie</a><a class="c" href="link1" id="id2">df</a></body> </html>

使用prettify()方法之前的HTML文本有一些標簽是缺少結束標簽的，而且格式是亂的，從執行的結果來看，prettify()方法將結束標簽補齊了，也對HTML文本進行了格式化。

其他查詢方法:

find_parents()和find_parent()：前者返回所有祖先節點，后者返回直接父節點
find_next_siblings()和find_next_sibling()：前者返回后面的所有的兄弟節點，后者返回后面的第一個節點
find_previous_siblings()和find_previous_sibling()：前者返回前面所有的兄弟節點，后者返回前面的第一個兄弟節點
find_all_next()和find_next()：前者返回節點后所有符合條件的節點，后者返回第一個符合添加的節點
find_all_previous()和find_previous()：前者返回節點前的所有符合條件的節點，后者返回節點前的第一個符合條件的節點

3.1.4.6 使用Lambda表達式

BeautifulSoup允許我們將特定的函數當做find_all()方法的參數。唯一的限制條件是這些函數必須把一個標簽當做參數且返回結果是布爾類型。BeautifulSoup用這個函數來篩選它遇到的每一個標簽，將符合Lambda中的刪選條件的標簽保留下來。

下面的例子就使用Lamabd表達式來篩選出所有有兩個屬性的標簽：

from bs4 import BeautifulSoup# BeautifulSoup中Lambda表達式的使用 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="link1" class="b" id='id1'>Elsie</a> <a href="link2" class="a" id='id1'>Tillie</a> <a href="link3" class="c" >df</a> <a link="https"></a> </body> </html> '''soup = BeautifulSoup(text, 'html.parser') # 使用lambda表達式將標簽屬性為2的標簽篩選出來 print(soup.find_all(lambda tag: len(tag.attrs) == 2))

運行結果：

[<a class="c" href="link3">df</a>]

3.1.4.5 css選擇器

BeautifulSoup還提供了另外一種選擇器，那就是css選擇器。要使用css選擇器，只需要調用select()方法，傳入相應的css選擇器即可。

選擇器示例示例說明

element	p	選擇所有的<p>元素
element,element	div,p	選擇所有<div>元素和<p>元素
element element	div p	選擇<div>元素內的所有<p>元素
.class	.intro	選擇所有class="intro"的元素
#id	#firstname	選擇所有id="firstname"的元素
\*	\*	選擇所有元素
[attribute]	[target]	選擇所有帶有target屬性元素
[attribute=value]	[target=blank]	選擇所有target="blank"的元素
[attribute*=value]	[title*=f]	選擇所有title屬性中包含f單詞的元素
[attribute^=value]	[lang^=en]	選擇lang屬性以en開頭的所有元素
[attribute$=value]	[href$=.jpg]	選擇所有href屬性以.jpg結尾的元素
tag.get_text	a.get_text	獲取a標簽中的文本內容

from bs4 import BeautifulSoup# bs_object.select()方法中使用css選擇器 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="link1" class="b" id='id1'>Elsie</a> <a href="link2" class="a" id='id1'>Tillie</a> <a href="link3" class="c" id='id2'>df</a> <a link="https"></a> </body> </html> '''soup = BeautifulSoup(text, 'html.parser')print("選取所有的a標簽：", soup.select('b'))print("選取所有class屬性為a的a標簽", soup.select('a .a'))print("選取所有class屬性為a且id屬性為id1的a標簽", soup.select('a .a #id1'))print("選取所有link屬性為https的a節點：", soup.select('a[link="https"]'))print("選取所有link屬性中包含f的b節點", soup.select('b[link*="f"]'))print("選取所有link屬性以b開頭的b標簽", soup.select('b[link^="b"]'))

運行結果：

選取所有的a標簽： [, ] 選取所有class屬性為a的a標簽 [] 選取所有class屬性為a且id屬性為id1的a標簽 [] 選取所有link屬性為https的a節點： [<a link="https"></a>] 選取所有link屬性中包含f的b節點 [] 選取所有link屬性以b開頭的b標簽 []

嵌套選擇

from bs4 import BeautifulSoup# BeautifulSoup中使用嵌套選擇 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="link1" class="b" id='id1'>Elsie</a> <a href="link2" class="a" id='id1'>Tillie</a> <a href="link3" class="c" id='id2'>df</a> <a link="https"></a> </body> </html> '''soup = BeautifulSoup(text, 'html.parser') for i in soup.select('html'):print(i.select('head'))print(i.select('title'))

運行結果：

[<head><title>The Dormouse's story</title></head>] [<title>The Dormouse's story</title>]

獲取屬性

from bs4 import BeautifulSoup#BeautifulSoup支持使用css選擇器獲取標簽 text = ''' <html><head><title>The Dormouse's story</title></head> <body> </body> </html> '''soup = BeautifulSoup(text, 'html.parser') for i in soup.select('b'):print(i['link'])print(i.attrs['link']) af af bba bba

獲取文本

from bs4 import BeautifulSoup#獲取標簽的文本內容 text = ''' <html><head><title>The Dormouse's story</title></head> <body> def oo </body> </html> '''soup = BeautifulSoup(text, 'html.parser') for i in soup.select('b'):print(i.string)print(i.get_text())

運行結果：

def def oo oo

3.1.5 使用BeautifulSoup爬取qq音樂的首發音樂

from bs4 import BeautifulSoup import requests from fake_useragent import UserAgent# 使用BeautifulSoup爬取qq音樂的首發音樂 url = 'https://y.qq.com/' response = requests.get(url=url, headers={'user-Agent': UserAgent().chrome}) soup = BeautifulSoup(response.text, 'html.parser')divs_1 = soup.find('div', {'id': 'new_album_box'}) names = divs_1.select('span.playlist__title_txt > a') authors = divs_1.select('div.playlist__author > a') for name, author in zip(names, authors):print(name.text, author.text, sep=" ")

運行結果：

倒帶劉珂不再年少大志英雄如你洛天依愿你可以王琪瑋愛我的時候單依純鷂周興哲紅火中國年小5 除夕郭秀楓一起過年唐詩蒙著眼睛走大志

3.2 Xpath(重點)

3.2.1 Xpath介紹

Xpath，全稱是XML Path Language，即XML路徑語言，它是一門在XML文檔中查找信息的語言。它最初是用來搜索XML文檔的，但是同樣適用于HTML文檔的搜索。

W3C的Xpath教程

菜鳥教程的Xpath教程

Xpath helper插件

Xpath官方文檔

3.2.2 Xpath的常用規則

選取節點

Xpath使用路徑表達式在XML文檔中選取節點

表達式描述實例實例含義

nodename	選取此節點的所有子節點	a	選取a節點的所有子節點
/	從根節點開始選取	/a	選取根節點a
//	從任意節點開始選取	//a	選取所有的a節點，不管它在文檔中的位置
.	選取當前節點	//a[@class="class1" and id="id1"]	選取所有的class屬性為class1且id屬性為id1的節點
..	選取當前節點的父節點	//a[@class="class1"]//..	選取所有class屬性為class1的a節點的父節點
@	選取屬性	//a[@class="class1"]	選取所有class屬性為class1的節點
text()	選取節點中的文本	//a[class="class1"]/text()	選取所有class屬性為class1的a節點的文本內容
*	匹配任何節點	/a/*	選取a節點下的所有子節點

謂語

謂語用來查找特定的節點，謂語被嵌在方括號中

路徑表達式結果

/a/b[1]	選取a節點下的第一個b節點
/a/b[last()]	選取a節點下的最后一個b節點
/a/b[last()-1]	選取a節點下的倒數第二個b節點
/a/b[position()<3]	選取a節點下的前2個b節點
/a[@class]	選取所有的有class屬性的a節點
/a[@*]	選取所有有屬性的a節點
/a[@class="class1"]	選取所有class屬性為class1的a節點
/a/b[@price>20]	選取a節點下所有price屬性的值大于20的b節點
/a[contains(@class, “a”)]	選取所有class屬性中包含a單詞的節點
/a[@class=“abc” and @id=“abc”]	選取所有class屬性為abc且id屬性為abc的a節點

Xpath的語法推薦大家在Chrome中使用Xpath helper插件來練習，練習熟練之后就可以在Python中使用了！

Xpath helper插件：
鏈接: https://pan.baidu.com/s/1gOa8vewuIUhQ0GTefjrNsQ
提取碼: sp4m

3.2.3 Xpath的使用

3.2.3.1 安裝lxml

在Python中使用Xpath要安裝lxml庫:

pip install lxml

3.2.3.2 使用Xpath

使用Xpath解析HTML文本

from lxml import etree# 使用etree.HTML()方法解析HTML文本 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="link1" class="b" id='id1'>Elsie</a> <a href="link2" class="a" id='id1'>Tillie</a> <a href="link3" class="c" id='id2'>df</a> <a link="https"></a> </body> </html> '''html = etree.HTML(text, etree.HTMLParser()) result = html.xpath('/html/head/title/text()') print(result)

運行結果：

["The Dormouse's story"]

上面的Xpath解析式/html/head/title/text()選取html標簽下的head標簽下的title標簽的文本，再將結果打印出來。

在Python爬蟲中使用Xpath的關鍵就是根據要提取的內容寫成對應的Xpath表達式，只要將Xpath表達式寫好，替換上面的代碼中的Xpath表達式就可以提取到想要的內容了，所有，大家學習Xpath的重點在：會根據需要提取的信息寫出對應的Xpath表達式。

使用Xpath解析HTML文檔

from lxml import etree# 使用etree.parse()方法解析HTNL文檔 html = etree.parse('htmlText.html', etree.HTMLParser()) result = html.xpath('/html/head/title/text()') print(result)

使用Xpath解析HTML文檔使用的是etree.parse()方法，
htmlTest.html文檔中的內容：

<html><head><title>The Dormouse's story</title></head> <body> <a href="link1" class="b" id='id1'>Elsie</a> <a href="link2" class="a" id='id1'>Tillie</a> <a href="link3" class="c" id='id2'>df</a> <a link="https"></a> </body> </html>

運行結果：

["The Dormouse's story"]

3.2.3.3 xpath使用技巧

可以使用concat()方法將兩個xpath表達式的結果合并

爬取鏈家網中二手房的價格使用的xpath表達式：

concat(//div[@class=“price”]/span[1]/text(),//div[@class=“price”]/span[2]/span/text())
可以使用last()方法從篩選出的標簽列表的末尾開始提取標簽：

爬取鏈家網中二手房的供暖情況使用的xpath表達式：

//div[@class=“base”]/div[@class=“content”]/ul/li[last()-1]/text()
contains()的使用

//div[contains(@name,’Tom’)] 查找name屬性中包含字符串Tom的所有div標簽

3.2.4 使用Xpath爬取窮游網的中國熱門城市

import requests from fake_useragent import UserAgent import parsel import csvdef getdata(url):headers = {"user-Agent": UserAgent().chrome}response = requests.get(url=url, headers=headers)response.encoding = response.apparent_encodingselector = parsel.Selector(response.text) # 把字符串類型轉化為對象lis = selector.xpath('//ul[@class="plcCitylist"]/li')for li in lis:city_names = li.xpath('./h3/a/text()').get()city_names = city_names.rstrip()number_people = li.xpath('./p[2]/text()').get()place_hot = li.xpath('./p[@class="pois"]/a/text()').getall()place_hot = [place.strip() for place in place_hot]place_hot = '、'.join(place_hot)place_url = li.xpath('./p[@class="pics"]/a/@href').get()img_url = li.xpath('./p[@class="pics"]/a/img/@src').get()print(city_names, number_people, place_url, img_url, place_hot, sep='|')with open('qiongyouData.csv', mode='a', encoding='utf-8', newline='') as file_object:csv_write = csv.writer(file_object)csv_write.writerow([city_names, number_people, place_url, img_url, place_hot])def main():for i in range(1, 172):url = "https://place.qyer.com/china/citylist-0-0-{}/".format(str(i))getdata(url)if __name__ == '__main__':main()

代碼運行后將在控制臺打印爬取到的數據，并將數據寫入qingyouData.csv文件，由于沒有使用多線程，所以爬取所有的數據花費時間較長。

3.3 pyquery(了解)

雖然 xpath 與 Beautiful Soup 已經很強大了，不過語法還是過于啰嗦，pyquery提供了更加簡潔優雅的語法，你可以像寫jquery一般提取數據。

如果你使用過jQuary，那么你可能更適合使用pyquery解析庫。

pyquery官方文檔

3.3.1 pyquery的安裝

和之前一樣，在cmd控制臺中輸入以下代碼來使用pip安裝pyquery庫：

pip install pyquery

3.3.2 初始化

像BeautifulSoup一樣，初始化pyquery的時候，也需要傳入一個HTML文本來初始化一個pyQuery對象。它的初始化的方式有很多，比如直接傳入字符串，傳入URL，傳入文件名等等。

字符串的初始化

from pyquery import PyQuery as pq# 將字符串初始化為pyquery對象 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="link1" class="b" id='id1'>Elsie</a> <a href="link2" class="a" id='id1'>Tillie</a> <a href="link3" class="c" id='id2'>df</a> <a link="https"></a> </body> </html> '''doc = pq(text) print(doc('b'))

運行結果：

URL初始化

初始化的參數可以是一個URL，此時只需要指定參數為url即可，但是，一般情況下都不會傳入一個URL來初始化，因為通常都會出現編碼方式出錯導致出現亂碼。

from pyquery import PyQuery as pq# 將URL鏈接初始化為pyquery對象 doc = pq(url='https://www.baidu.com', encoding='utf-8') print(doc('title'))

運行結果：

文件初始化

要使用文件來初始化pyquery，只需要指定filename即可：

from pyquery import PyQuery as pq# 將文件初始化為pyquery對象 doc = pq(filename='pyqueryTest.html') print(doc('a'))

下面是pyqueryTest.html中的內容：

下面是運行結果：

<a href="link1" class="b" id="id1">Elsie</a> <a href="link2" class="a" id="id1">Tillie</a> <a href="link3" class="c" id="id2">df</a> <a link="https"/>

3.3.3 pyquery的使用

pyquery支持css選擇器： from pyquery import PyQuery as pq# pyquery中使用css選擇器 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="link1" class="b" id='id1'>Elsie</a> <a href="link2" class="a" id='id1'>Tillie</a> <a href="link3" class="c" id='id2'>df</a> </body> </html> ''' html = pq(text) print(html('a.b#id1'))

首先傳入text將pyquery初始化，讓后傳入’a.b#id1’這個選擇器，這個選擇器的功能是選擇所有class屬性為b且id屬性為id1的a標簽。

運行結果如下：

<a href="link1" class="b" id="id1">Elsie</a>

下面是一些常用的查詢函數，這些函數和jQuery中函數的用法完全相同：

方法介紹

PyQuery.find('css')	在當前節點的所有子孫節點中查找所有滿足css選擇器的節點，返回的是PyQuery對象。
PyQuery.children('css')	和find()的用法一樣，只不過它的查找范圍是子節點。
PyQuery.parent()	獲取當前節點的父節點，返回的是PyQuery對象。
PyQuery.parents('css')	獲取當前節點的祖先節點，返回的是PyQuery對象。
PyQuery.siblings('css')	獲取當前節點的兄弟節點，返回的是PyQuery對象。
PyQuery.items()	得到當前節點的生成器，用于遍歷PyQuery對象。
PyQuery.attr()	獲取當前節點的屬性，當PyQuery中包含多個節點時，attr()只會返回第一個節點的屬性。
PyQuery.text()	獲取文本，但PyQuery中包含多個節點時，它將返回一個將所有節點的文本合并成的字符串。
PyQuery.html()	獲取標簽中的HTML文本，如果PyQuery中包含多個節點，它將返回第一個節點中的HTML文本。
PyQuery.addClass()	為節點添加一個class
PyQuery.removeClass()	移除節點的一個class
PyQuery.attr(' ', ' ')	更改節點的屬性內容
PyQuery.text(' ')	修改節點內容
PyQuery.html(' ')	修改節點內的HTML文本
PyQuery.remove()	移除節點

下面使用實例向大家介紹這些方法的使用過程：

find()

from pyquery import PyQuery as pq# pq_object.find()方法的使用 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <li> <a href="link1" class="b" id='id1'>Elsie</a> <a href="link2" class="a" id='id1'>Tillie</a> <a href="link3" class="c" id='id2'>df</a> </li> </body> </html> ''' pq_object = pq(text) # 初始化一個pyquery對象 bod = pq_object('body') # 在pyquery對象中查找body標簽 print(type(bod))a = bod.find('a.b#id1') # 在body標簽的子孫節點中查找class屬性為a,id屬性為id1的a標簽 print(type(a)) print(a.text())

運行結果：

<class 'pyquery.pyquery.PyQuery'> <class 'pyquery.pyquery.PyQuery'> Elsie

children()

from pyquery import PyQuery as pq# pq_object.children()方法的使用 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <li> <a href="link1" class="b" id='id1'>Elsie</a> <a href="link2" class="a" id='id1'>Tillie</a> <a href="link3" class="c" id='id2'>df</a> </li> </body> </html> '''pq_object = pq(text) bod = pq_object('body') print(bod.children('a'))

運行代碼發現沒有打印任何內容，因為children()方法是在body標簽的子標簽中查找a標簽，而body標簽的子標簽只有li標簽，要提取所有的a標簽，需要將代碼改一下:

運行結果：

<a href="link1" class="b" id="id1">Elsie</a> <a href="link2" class="a" id="id1">Tillie</a> <a href="link3" class="c" id="id2">df</a>

parent()

from pyquery import PyQuery as pq# pq_object.parent()方法的使用 text = ''' <body> <li> <a class="b" id='id1'>Elsie</a> </li> </body> '''pq_object = pq(text) a = pq_object('a.b#id1') print(a.parent()) print('-'*50) print(a.parents()) <li> <a class="b" id="id1">Elsie</a> </li>-------------------------------------------------- <body> <li> <a class="b" id="id1">Elsie</a> </li> </body><li> <a class="b" id="id1">Elsie</a> </li>

siblings()

from pyquery import PyQuery as pq# pq.object.siblings()方法的使用 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="link1" class="b" id='id1'>Elsie</a> <a href="link2" class="a" id='id1'>Tillie</a> <a href="link3" class="c" id='id2'>df</a> </body> </html> '''pq_object = pq(text) a = pq_object('a.a#id1') print(a.siblings())

運行結果：

<a href="link1" class="b" id="id1">Elsie</a> <a href="link3" class="c" id="id2">df</a>

items()

from pyquery import PyQuery as pq# pq_object.items()方法的使用 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="link1" class="b" id='id1'>Elsie</a> <a href="link2" class="a" id='id1'>Tillie</a> <a href="link3" class="c" id='id2'>df</a> </body> </html> '''pq_object = pq(text) a = pq_object('a') for i in a.items():if i.attr('href') == 'link1':print(i)

運行結果：

<a href="link1" class="b" id="id1">Elsie</a>

html()

from pyquery import PyQuery as pq# pq_object.html()方法的使用 text = ''' <head> <title>The Dormouse's story</title> </head> '''pq_object = pq(text) head = pq_object('head') print(head.html())

運行結果：

<title>The Dormouse's story</title>

節點操作

from pyquery import PyQuery as pq# pyquery對節點的操作 text = ''' <html><head><title>The Dormouse's story</title></head> <body> <a href="link1" class="b" id='id1'>Elsietext</a> </body> </html> '''pq_object = pq(text)a = pq_object('a.b#id1') a.add_class('new_class') print("將a節點增加class屬性'new_class的結果為：", a)a.remove_class('b') print("將a節點移除class屬性b的結果為：", a)a.attr('href', 'new_link') print("將a節點中的href屬性更改為'new_link'后的結果為：", a)a.text('new_text') print("將a節點的text更改為'new_text'后的結果為：", a)a.html('b_new_text') print("將a節點的HTML更改為'b_new_text'后的結果為：", a)doc = pq_object('html') a.remove() print("將a節點刪除后的結果為：", doc)

運行結果：

將a節點增加class屬性'new_class的結果為： <a href="link1" class="b new_class" id="id1">Elsietext</a>將a節點移除class屬性b的結果為： <a href="link1" class="new_class" id="id1">Elsietext</a>將a節點中的href屬性更改為'new_link'后的結果為： <a href="new_link" class="new_class" id="id1">Elsietext</a>將a節點的text更改為'new_text'后的結果為： <a href="new_link" class="new_class" id="id1">new_text</a>將a節點的HTML更改為'b_new_text'后的結果為： <a href="new_link" class="new_class" id="id1">b_new_text</a>將a節點刪除后的結果為： <html><head><title>The Dormouse's story</title></head> <body> </body> </html>

好了，上面就介紹完了Python網絡爬蟲中常用的四個解析庫BeautifulSoup、Xpath(重點)、pyquery、re(重點)，其中re在第二章基本庫的使用中介紹了，要將這些解析庫學好，大量的練習是必不可少的，希望大家能多多實踐，將學的知識用起來。

最后，感謝你的閱讀。

總結

以上是生活随笔為你收集整理的Python爬虫解析库的使用的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：硬盘内存大揭秘：这些假货竟如此狡猾
下一篇： python自动卸载win程序_朋友说：

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

Python爬虫 解析库的使用

已寫章節

文章目錄

第三章 解析庫的使用

3.1BeautifulSoup

3.1.1 BeautifulSoup的安裝

3.1.2 BeautifulSoup庫的理解

3.1.3 解析器

3.1.4 BeautifulSoup的基本使用

3.1.4.1 使用soup.<tag>來獲取指定的標簽

3.1.4.2 獲取標簽的屬性

3.1.4.3 標簽樹的遍歷

3.1.4.4 使用BeautifulSoup提供的方法來選擇標簽

find_all()方法

find()方法

prettify()方法

其他查詢方法:

3.1.4.6 使用Lambda表達式

3.1.4.5 css選擇器

嵌套選擇

獲取屬性

獲取文本

3.1.5 使用BeautifulSoup爬取qq音樂的首發音樂

3.2 Xpath(重點)

3.2.1 Xpath介紹

3.2.2 Xpath的常用規則

3.2.3 Xpath的使用

3.2.3.1 安裝lxml

3.2.3.2 使用Xpath

3.2.3.3 xpath使用技巧

3.2.4 使用Xpath爬取窮游網的中國熱門城市

3.3 pyquery(了解)

3.3.1 pyquery的安裝

3.3.2 初始化

3.3.3 pyquery的使用

總結

Python爬虫解析库的使用

第三章解析庫的使用