當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

bs

發布時間：2024/4/17 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 bs 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

lxml 安裝直接whl文件安裝

速度快文檔容錯能力強

html5lib
最好的容錯性
以瀏覽器的方式解析文檔生成HTML5格式的文檔速度慢
soup = BeautifulSoup(html_content, "html5lib")

BeautifulSoup的構造方法,可以傳入一段字符串或一個文件句柄.
使用Beautiful Soup解析后,文檔都被轉換成了Unicode
BeautifulSoup用了編碼自動檢測子庫來識別當前文檔編碼并轉換成Unicode編碼. BeautifulSoup對象的 .original_encoding 屬性記錄了自動識別編碼的結果

我們可以通過from_encoding參數傳入解碼格式，以便更快更正確解碼。

soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup(html_content, "lxml")
soup = BeautifulSoup(content, "lxml", from_encoding='utf-8')

通過Beautiful Soup輸出文檔時,不管輸入文檔是什么編碼方式,輸出編碼均為UTF-8編碼
指定其他編碼 print(soup.prettify("latin-1"))

解析部分文檔 parse_only參數和SoupStrainer對象
from bs4 import SoupStrainer

only_a_tags = SoupStrainer("a")

BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags)
將只生產a標簽相關的內容

提高效率
1.直接使用lxml
2.使用lxml解析器
3.安裝cchardet
4.解析部分文檔不會節省多少解析時間,但是會節省很多內存,并且搜索時也會變得更快.

soup大部分時候可以當成tag，soup沒有attribute屬性

attrs 本tag的屬性
get('id') 得到id屬性

也可以直接 soup.a['href'] 獲取

contents 子節點列表
contents[0] 獲取第一個子節點
children 迭代對象本tag的子tag

find 獲取第一個
findAll find_all 獲取所有

get_text getText text 獲取所有text

string 返回本級別text，只能有一個子節點，多個子節點返回None
strings 獲取所有text 返回迭代對象

parent 直接父節點
parents 一級一級的往上獲取所有父輩是一個迭代

select

CSS選擇器
Beautiful Soup支持大部分的CSS選擇器? ,在?Tag?或?BeautifulSoup?對象的?.select()?方法中傳入字符串參數,即可使用CSS選擇器的語法找到tag:
soup.select("title")

p:nth-child(2) 選擇屬于其父元素的第二個<p>元素的所有元素的第二個。
p:nth-of-type(2) 選擇屬于其父元素第二個 <p> 元素所有<p>元素的第二個。
bs4 僅僅支持nth-of-type
soup.select("ul li:nth-of-type(3)")

通過tag標簽逐層查找: 可以隔代
soup.select("body a")

找到某個tag標簽下的直接子標簽直系
soup.select("head > title")

.class #id

通過CSS的類名查找:
soup.select(".sister")

通過tag的id查找:
soup.select("#link1")

http://www.w3school.com.cn/cssref/css_selectors.ASP

bs首先文檔轉換為unicode 如果傳入字節碼參數,Beautiful Soup會當作UTF-8編碼,可以傳入一段Unicode 編碼來避免Beautiful Soup解析編碼出錯

tag.body.li tag直接獲取html標簽
tag.next_sibling
tag.previous_sibling

1.傳入字符串
soup.find_all('a')
2.正則表達式
soup.find_all(re.complie("^b"))
3.列表
soup.find_all(["a", "b"]) a或者b的都返回
4.方法
soup.find_all(func)
def func(tag):
return True or False //True的會獲取
5.屬性
soup.find_all(href=re.compile("elsie"), id='link1')
//同時過濾

6.css搜索
soup.find_all("a", attrs={"class": "sister"})
css_soup.find_all("p", class_="body strikeout") class順序要與文檔一致

7.text搜索
soup.find_all(text="Elsie")
混合
soup.find_all("a", text="Elsie")

limit限制返回2個
soup.find_all("a", limit=2)

recursive設置是否遞歸搜索
soup.html.find_all("title", recursive=False) //只搜索當前直接子節點

編碼自動檢測
from bs4 import UnicodeDammit
dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!")
print(dammit.unicode_markup)
# Sacré bleu!
dammit.original_encoding
# 'utf-8'

轉載于:https://www.cnblogs.com/newpython/p/6308987.html

總結

以上是生活随笔為你收集整理的bs的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Xcode8打包上传后构建版本消失问题
下一篇： 8、泛型程序设计与c++标准模板库2.4