當前位置：首頁 > 编程语言 > python >内容正文

python

python提取html正文为txt,python 提取html文本的方法

發布時間：2023/12/3 python 41 豆豆

生活随笔收集整理的這篇文章主要介紹了 python提取html正文为txt,python 提取html文本的方法小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

假設我們需要從各種網頁中提取全文，并且要剝離所有HTML標記。通常，默認解決方案是使用BeautifulSoup軟件包中的get_text方法，該方法內部使用lxml。這是一個經過充分測試的解決方案，但是在處理成千上萬個HTML文檔時可能會非常慢。

通過用selectolax替換BeautifulSoup，您幾乎可以免費獲得5-30倍的加速！

這是一個簡單的基準測試，可分析commoncrawl(`處理NLP問題時，有時您需要獲得大量的文本集?；ヂ摼W是文本的最大來源，但是不幸的是，從任意HTML頁面提取文本是一項艱巨而痛苦的任務。

通過用selectolax替換BeautifulSoup，您幾乎可以免費獲得5-30倍的加速！這是一個簡單的基準測試，可分析commoncrawl(https://commoncrawl.org/)的10,000個HTML頁面：

#?coding:?utf-8

from?time?import?time

import?warc

from?bs4?import?BeautifulSoup

from?selectolax.parser?import?HTMLParser

def?get_text_bs(html):

tree?=?BeautifulSoup(html,?'lxml')

body?=?tree.body

if?body?is?None:

return?None

for?tag?in?body.select('script'):

tag.decompose()

for?tag?in?body.select('style'):

tag.decompose()

text?=?body.get_text(separator='\n')

return?text

def?get_text_selectolax(html):

tree?=?HTMLParser(html)

if?tree.body?is?None:

return?None

for?tag?in?tree.css('script'):

tag.decompose()

for?tag?in?tree.css('style'):

tag.decompose()

text?=?tree.body.text(separator='\n')

return?text

def?read_doc(record,?parser=get_text_selectolax):

url?=?record.url

text?=?None

if?url:

payload?=?record.payload.read()

header,?html?=?payload.split(b'\r\n\r\n',?maxsplit=1)

html?=?html.strip()

if?len(html)?>?0:

text?=?parser(html)

return?url,?text

def?process_warc(file_name,?parser,?limit=10000):

warc_file?=?warc.open(file_name,?'rb')

t0?=?time()

n_documents?=?0

for?i,?record?in?enumerate(warc_file):

url,?doc?=?read_doc(record,?parser)

if?not?doc?or?not?url:

continue

n_documents?+=?1

if?i?>?limit:

break

warc_file.close()

print('Parser:?%s'?%?parser.__name__)

print('Parsing?took?%s?seconds?and?produced?%s?documents\n'?%?(time()?-?t0,?n_documents))

>>>?!?wget?https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz

>>>?file_name?=?"CC-MAIN-20180116070444-20180116090444-00000.warc.gz"

>>>?process_warc(file_name,?get_text_selectolax,?10000)

Parser:?get_text_selectolax

Parsing?took?16.170367002487183?seconds?and?produced?3317?documents

>>>?process_warc(file_name,?get_text_bs,?10000)

Parser:?get_text_bs

Parsing?took?432.6902508735657?seconds?and?produced?3283?documents

顯然，這并不是對某些事物進行基準測試的最佳方法，但是它提供了一個想法，即selectolax有時比lxml快30倍。

selectolax最適合將HTML剝離為純文本。如果我有10,000多個HTML片段，需要將它們作為純文本索引到Elasticsearch中。(Elasticsearch有一個html_strip文本過濾器，但這不是我想要/不需要在此上下文中使用的過濾器)。事實證明，以這種規模將HTML剝離為純文本實際上是非常低效的。那么，最有效的方法是什么？

PyQuery

from?pyquery?import?PyQuery?as?pq

text?=?pq(html).text()

selectolax

from?selectolax.parser?import?HTMLParser

text?=?HTMLParser(html).text()

正則表達式

import?re

regex?=?re.compile(r'<.>')

text?=?clean_regex.sub('',?html)

結果

我編寫了一個腳本來計算時間，該腳本遍歷包含HTML片段的10,000個文件。注意！這些片段不是完整的文檔(帶有

和等)，只是HTML的一小部分。平均大小為10,314字節(中位數為5138字節)。結果如下：

pyquery

SUM:????18.61?seconds

MEAN:???1.8633?ms

MEDIAN:?1.0554?ms

selectolax

SUM:????3.08?seconds

MEAN:???0.3149?ms

MEDIAN:?0.1621?ms

regex

SUM:????1.64?seconds

MEAN:???0.1613?ms

MEDIAN:?0.0881?ms

我已經運行了很多次，結果非常穩定。重點是：selectolax比PyQuery快7倍。

正則表達式好用？真的嗎？

對于最基本的HTML Blob，它可能工作得很好。實際上，如果HTML是

Foo＆amp; Bar p>，我希望純文本轉換應該是Foo＆Bar，而不是Foo＆amp; bar。

更重要的一點是，PyQuery和selectolax支持非常特定但對我的用例很重要的內容。在繼續之前，我需要刪除某些標簽(及其內容)。例如：

This?should?get?stripped.

Please?keep.

This?should?also?get?stripped.

正則表達式永遠無法做到這一點。

2.0 版本

因此，我的要求可能會發生變化，但基本上，我想刪除某些標簽。例如：

、和。因此，讓我們實現一下：

PyQuery

from?pyquery?import?PyQuery?as?pq

_display_none_regex?=?re.compile(r'display:\s*none')

doc?=?pq(html)

doc.remove('div.warning,?div.hidden')

for?div?in?doc('div[style]').items():

style_value?=?div.attr('style')

if?_display_none_regex.search(style_value):

div.remove()

text?=?doc.text()

selectolax

from?selectolax.parser?import?HTMLParser

_display_none_regex?=?re.compile(r'display:\s*none')

tree?=?HTMLParser(html)

for?tag?in?tree.css('div.warning,?div.hidden'):

tag.decompose()

for?tag?in?tree.css('div[style]'):

style_value?=?tag.attributes['style']

if?style_value?and?_display_none_regex.search(style_value):

tag.decompose()

text?=?tree.body.text()

這實際上有效。當我現在為10,000個片段運行相同的基準時，新結果如下：

pyquery

SUM:????21.70?seconds

MEAN:???2.1701?ms

MEDIAN:?1.3989?ms

selectolax

SUM:????3.59?seconds

MEAN:???0.3589?ms

MEDIAN:?0.2184?ms

regex

Skip

同樣，selectolax擊敗PyQuery約6倍。

結論

正則表達式速度快，但功能弱。selectolax的效率令人印象深刻。

以上就是python 提取html文本的方法的詳細內容，更多關于python 提取html文本的資料請關注腳本之家其它相關文章！

總結

以上是生活随笔為你收集整理的python提取html正文为txt,python 提取html文本的方法的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： linux给用户密码（linux 给用户
下一篇： python入门之类的基础语法-关于一些