python中nlp的库_用于nlp的python中的网站数据清理
python中nlp的庫(kù)
The most important step of any data-driven project is obtaining quality data. Without these preprocessing steps, the results of a project can easily be biased or completely misunderstood. Here, we will focus on cleaning data that is composed of scraped web pages.
任何數(shù)據(jù)驅(qū)動(dòng)項(xiàng)目中最重要的步驟是獲取質(zhì)量數(shù)據(jù)。 沒(méi)有這些預(yù)處理步驟,項(xiàng)目的結(jié)果很容易產(chǎn)生偏差或被完全誤解。 在這里,我們將著重于清理由抓取的網(wǎng)頁(yè)組成的數(shù)據(jù)。
獲取數(shù)據(jù) (Obtaining the data)
There are many tools to scrape the web. If you are looking for something quick and simple, the URL handling module in Python called urllib might do the trick for you. Otherwise, I recommend scrapyd because of the possible customizations and robustness.
有很多工具可以抓取網(wǎng)絡(luò)。 如果您正在尋找快速簡(jiǎn)單的方法,那么Python中稱為urllib的URL處理模塊可能會(huì)幫您這個(gè)忙 。 否則,我建議您使用scrapyd,因?yàn)榭赡軙?huì)進(jìn)行自定義和增強(qiáng)魯棒性。
It is important to ensure that the pages you are scraping contain rich text data that is suitable for your use case.
確保要抓取的頁(yè)面包含適合您的用例的富文本數(shù)據(jù)非常重要。
從HTML到文本 (From HTML to text)
Once we have obtained our scraped web pages, we begin by extracting the text out of each web page. Websites have lots of tags that don’t contain useful information when it comes to NLP, such as <script> and <button>. Thankfully, there is a Python module called boilerpy3 that makes text extraction easy.
獲取抓取的網(wǎng)頁(yè)后,我們將從每個(gè)網(wǎng)頁(yè)中提取文本開(kāi)始。 網(wǎng)站上有很多標(biāo)簽,其中包含<script>和<button> ,這些標(biāo)簽不包含有關(guān)NLP的有用信息。 值得慶幸的是,有一個(gè)名為boilerpy3的Python模塊,可輕松提取文本。
We use the ArticleExtractor to extract the text. This extractor has been tuned for news articles that works well for most HTMLs. You can try out other extractors listed in the documentation for boilerpy3 and see what works best for your dataset.
我們使用ArticleExtractor提取文本。 該提取器已針對(duì)大多數(shù)HTML都適用的新聞文章進(jìn)行了調(diào)整。 您可以嘗試鍋爐3文檔中列出的其他提取器,并查看最適合您的數(shù)據(jù)集的提取器。
Next, we condense all newline characters (\n and \r) into one \n character. This is done so that when we split the text up into sentences by \n and periods, we don’t get sentences with no words.
接下來(lái),我們將所有換行符( \n和\r )壓縮為一個(gè)\n字符。 這樣做是為了使我們?cè)趯⑽谋景碶n和句點(diǎn)分成句子時(shí),不會(huì)得到?jīng)]有單詞的句子。
import os import re from boilerpy3 import extractors# Condenses all repeating newline characters into one single newline character def condense_newline(text):return '\n'.join([p for p in re.split('\n|\r', text) if len(p) > 0])# Returns the text from a HTML file def parse_html(html_path):# Text extraction with boilerpy3html_extractor = extractors.ArticleExtractor()return condense_newline(html_extractor.get_content_from_file(html_path))# Extracts the text from all html files in a specified directory def html_to_text(folder):parsed_texts = []filepaths = os.listdir(folder)for filepath in filepaths:filepath_full = os.path.join(folder, filepath)if filepath_full.endswith(".html"):parsed_texts.append(parse_html(filepath_full))return parsed_texts# Your directory to the folder with scraped websites scraped_dir = './scraped_pages' parsed_texts = html_to_text(scraped_dir)If the extractors from boilerpy3 are not working for your web pages, you can use beautifulsoup to build your own custom text extractor. Below is an example replacement of the parse_html method.
如果boilerpy3的提取器不適用于您的網(wǎng)頁(yè),則可以使用beautifulsoup構(gòu)建自己的自定義文本提取器。 下面是parse_html方法的示例替換。
from bs4 import BeautifulSoup# Returns the text from a HTML file based on specified tags def parse_html(html_path):with open(html_path, 'r') as fr:html_content = fr.read()soup = BeautifulSoup(html_content, 'html.parser')# Check that file is valid HTMLif not soup.find():raise ValueError("File is not a valid HTML file")# Check the language of the filetag_meta_language = soup.head.find("meta", attrs={"http-equiv": "content-language"})if tag_meta_language:document_language = tag_meta_language["content"]if document_language and document_language not in ["en", "en-us", "en-US"]:raise ValueError("Language {} is not english".format(document_language))# Get text from the specified tags. Add more tags if necessary.TAGS = ['p']return ' '.join([remove_newline(tag.text) for tag in soup.findAll(TAGS)])大型N型清洗 (Large N-Gram Cleaning)
Once the text has been extracted, we want to continue with the cleaning process. It is common for web pages to contain repeated information, especially if you scrape multiple articles from the same domain. Elements such as website titles, company slogans, and page footers can be present in your parsed text. To detect and remove these phrases, we analyze our corpus by looking at the frequency of large n-grams.
提取文本后,我們要繼續(xù)進(jìn)行清理過(guò)程。 網(wǎng)頁(yè)通常包含重復(fù)的信息,尤其是當(dāng)您從同一域中抓取多篇文章時(shí)。 網(wǎng)站標(biāo)題,公司標(biāo)語(yǔ)和頁(yè)面頁(yè)腳等元素可以出現(xiàn)在您的解析文本中。 為了檢測(cè)和刪除這些短語(yǔ),我們通過(guò)查看大n-gram的頻率來(lái)分析我們的語(yǔ)料庫(kù)。
N-grams is a concept from NLP where the “gram” is a contiguous sequence of words from a body of text, and “N” is the size of these sequences. This is frequently used to build language models which can assist in tasks ranging from text summarization to word prediction. Below is an example for trigrams (3-grams):
N-gram是NLP的概念,其中“ gram”是來(lái)自正文的單詞的連續(xù)序列,而“ N”是這些序列的大小。 這通常用于構(gòu)建語(yǔ)言模型,該模型可以協(xié)助完成從文本摘要到單詞預(yù)測(cè)的任務(wù)。 以下是三字母組(3克)的示例:
input = 'It is quite sunny today.'output = ['It is quite', is quite sunny', 'quite sunny today.']
When we read articles, there are many single words (unigrams) that are repeated, such as “the” and “a”. However, as we increase our n-gram size, the probability of the n-gram repeating decreases. Trigrams start to become more rare, and it is almost impossible for the articles to contain the same sequence of 20 words. By searching for large n-grams that occur frequently, we are able to detect the repeated elements across websites in our corpus, and manually filter them out.
當(dāng)我們閱讀文章時(shí),有很多重復(fù)的單詞(字母組合),例如“ the”和“ a”。 但是,隨著我們?cè)黾觧-gram大小,n-gram重復(fù)的可能性降低。 Trigram開(kāi)始變得越來(lái)越稀有,文章幾乎不可能包含20個(gè)單詞的相同序列。 通過(guò)搜索頻繁出現(xiàn)的大n-gram,我們可以檢測(cè)到語(yǔ)料庫(kù)中各個(gè)網(wǎng)站上的重復(fù)元素,并手動(dòng)將其過(guò)濾掉。
We begin this process by breaking up our dataset up into sentences by splitting the text chunks up by the newline characters and periods. Next, we tokenize our sentences (break up the sentence into single word strings). With these tokenized sentences, we are able to generate n-grams of a specific size (we want to start large, around 15). We want to sort the n-grams by frequency using the FreqDist function provided by nltk. Once we have our frequency dictionary, we print the top 10 n-grams. If the frequency is higher than 1 or 2, the sentence might be something you would consider removing from the corpus. To remove the sentence, copy the entire sentence and add it as a single string in the filter_strs array. Copying the entire sentence can be accomplished by increasing the n-gram size until the entire sentence is captured in one n-gram and printed on the console, or simply printing the parsed_texts and searching for the sentence. If there is multiple unwanted sentences with slightly different words, you can copy the common substring into filter_strs, and the regular expression will filter out all sentences containing the substring.
我們通過(guò)以換行符和句點(diǎn)將文本塊拆分為句子來(lái)將數(shù)據(jù)集分解為句子來(lái)開(kāi)始此過(guò)程。 接下來(lái),我們將句子標(biāo)記化(將句子分解成單個(gè)單詞字符串)。 使用這些標(biāo)記化的句子,我們能夠生成特定大小的n-gram(我們希望從15開(kāi)始大一點(diǎn))。 我們要使用的正克按頻率進(jìn)行排序FreqDist所提供的功能nltk 。 有了頻率字典后,我們將打印出前10個(gè)n-gram。 如果頻率高于1或2,則可能要考慮從句子中刪除句子。 要?jiǎng)h除該句子,請(qǐng)復(fù)制整個(gè)句子并將其作為單個(gè)字符串添加到filter_strs數(shù)組中。 復(fù)制整個(gè)句子可以通過(guò)增加n-gram的大小,直到整個(gè)句子被捕獲為一個(gè)n-gram并打印在控制臺(tái)上,或者簡(jiǎn)單地打印parsed_texts并搜索句子來(lái)完成。 如果有多個(gè)不需要的句子,且單詞略有不同,則可以將公共子字符串復(fù)制到filter_strs ,并且正則表達(dá)式將過(guò)濾掉包含該子字符串的所有句子。
import nltk nltk.download('punkt') import matplotlib.pyplot as plt from nltk.util import ngrams from nltk.tokenize import word_tokenize# Helper method for generating n-grams def extract_ngrams_sentences(sentences, num):all_grams = []for sentence in sentences:n_grams = ngrams(sentence, num)all_grams += [ ' '.join(grams) for grams in n_grams]return all_grams# Splits text up by newline and period def split_by_newline_and_period(pages):sentences = []for page in pages:sentences += re.split('\n|\. ', page)return sentences# Break the dataset up into sentences, split by newline characters and periods sentences = split_by_newline_and_period(parsed_texts)# Add unwanted strings into this array filter_strs = []# Filter out unwanted strings sentences = [x for x in sentencesif not any([re.search(filter_str, x, re.IGNORECASE)for filter_str in filter_strs])]# Tokenize the sentences tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]# Adjust NGRAM_SIZE to capture unwanted phrases NGRAM_SIZE = 15 ngrams_all = extract_ngrams_sentences(tokenized_sentences, NGRAM_SIZE)# Sort the n-grams by most common n_gram_all = nltk.FreqDist(ngrams_all).most_common()# Print out the top 10 most commmon n-grams print(f'{NGRAM_SIZE}-Gram Frequencies') for gram, count in n_gram_all[:10]:print(f'{count}\t\"{gram}\"')# Plot the distribution of n-grams plt.plot([count for _, count in n_gram_all]) plt.xlabel('n-gram') plt.ylabel('frequency') plt.title(f'{NGRAM_SIZE}-Gram Frequencies') plt.show()If you run the code above on your dataset without adding any filters to filter_strs, you might get a graph similar to the one below. In my dataset, you can see that there are several 15-grams that are repeated 6, 3, and 2 times.
如果您在數(shù)據(jù)集上運(yùn)行上面的代碼時(shí)未向filter_strs添加任何過(guò)濾器,則可能會(huì)得到類似于下面的圖。 在我的數(shù)據(jù)集中,您可以看到有一些15克重復(fù)6、3和2次。
Once we go through the process of populating filter_strs with unwanted sentences, our plot of 15-grams flattens out.
一旦我們完成了用不需要的句子填充filter_strs的過(guò)程,我們的15克圖表便變平了。
Keep in mind there is no optimal threshold for n-gram size and frequency that determines whether or not a sentence should be removed, so play around with these two parameters. Sometimes you will need to lower the n-gram size to 3 or 4 to pick up a repeated title, but be careful not to remove valuable data. This block of code is designed to be an iterative process, where you slowly build the filter_strs array after many different experiments.
請(qǐng)記住,沒(méi)有確定n-gram大小和頻率的最佳閾值來(lái)確定是否應(yīng)刪除句子,因此請(qǐng)?jiān)囉眠@兩個(gè)參數(shù)。 有時(shí),您需要將n-gram大小減小到3或4才能獲得重復(fù)的標(biāo)題,但請(qǐng)注意不要?jiǎng)h除有價(jià)值的數(shù)據(jù)。 此代碼塊被設(shè)計(jì)為一個(gè)迭代過(guò)程,在此過(guò)程中,您將在許多不同的實(shí)驗(yàn)之后慢慢構(gòu)建filter_strs數(shù)組。
標(biāo)點(diǎn),大寫(xiě)和標(biāo)記化 (Punctuation, Capitalization, and Tokenization)
After we clean the corpus, the next step is to process the words of our corpus. We want to remove punctuation, lowercase all words, and break each sentence up into arrays of individual words (tokenization). To do this, I like to use the simple_preprocess library method from gensim. This function accomplishes all three of these tasks in one go and has a few parameters that allow some customization. By setting deacc=True, punctuation will be removed. When punctuation is removed, the punctuation itself is treated as a space, and the two substrings on each side of the punctuation is treated as two separate words. In most cases, words will be split up with one substring having a length of one. For example, “don’t” will end up as “don” and “t”. As a result, the default min_len value is 2, so words with 1 letter are not kept. If this is not suitable for your use case, you can also create a text processor from scratch. Python’s string class contains a punctuation attribute that lists all commonly used punctuation. Using this set of punctuation marks, you can use str.maketrans to remove all punctuation from a string, but keeping those words that did have punctuation as one single word (“don’t” becomes “dont”). Keep in mind this does not capture punctuation as well as gensim’s simple_preprocess. For example, there are three types of dashes (‘ — ’ em dash, –’ en dash, ‘-’ hyphen), and while simple_preprocess removes them all, string.punctuation does not contain the em dash, and therefore does not remove it.
在清理完語(yǔ)料庫(kù)之后,下一步就是處理語(yǔ)料庫(kù)中的單詞。 我們要?jiǎng)h除標(biāo)點(diǎn)符號(hào),將所有單詞都小寫(xiě),然后將每個(gè)句子分成單個(gè)單詞的數(shù)組(標(biāo)記化)。 為此,我喜歡使用gensim中的simple_preprocess庫(kù)方法。 此功能可一次性完成所有這三個(gè)任務(wù),并具有一些允許自定義的參數(shù)。 通過(guò)設(shè)置deacc=True ,標(biāo)點(diǎn)符號(hào)將被刪除。 刪除標(biāo)點(diǎn)符號(hào)后,標(biāo)點(diǎn)符號(hào)本身被視為一個(gè)空格,標(biāo)點(diǎn)符號(hào)兩側(cè)的兩個(gè)子字符串被視為兩個(gè)單獨(dú)的單詞。 在大多數(shù)情況下,單詞將被一個(gè)長(zhǎng)度為一的子字符串分割。 例如,“不要”將最終以“ don”和“ t”結(jié)尾。 結(jié)果,默認(rèn)的min_len值為2,因此不會(huì)保留帶有1個(gè)字母的單詞。 如果這不適合您的用例,您也可以從頭開(kāi)始創(chuàng)建文本處理器。 Python的string類包含一個(gè)punctuation屬性,該屬性列出了所有常用的標(biāo)點(diǎn)符號(hào)。 使用這組標(biāo)點(diǎn)符號(hào),可以使用str.maketrans刪除字符串中的所有標(biāo)點(diǎn)符號(hào),但是將那些確實(shí)具有標(biāo)點(diǎn)符號(hào)的單詞保留為一個(gè)單詞(“不要”變?yōu)椤安灰?。 請(qǐng)記住,它不像gensim的simple_preprocess一樣捕獲標(biāo)點(diǎn)符號(hào)。 例如,存在三種類型的破折號(hào)(' string.punctuation破折號(hào),-'en破折號(hào),'-'連字符),而simple_preprocess刪除了所有破折號(hào), string.punctuation不包含破折號(hào),因此也不會(huì)將其刪除。 。
import gensim import string# Uses gensim to process the sentences def sentence_to_words(sentences):for sentence in sentences:sentence_tokenized = gensim.utils.simple_preprocess(sentence,deacc=True,min_len=2,max_len=15)# Make sure we don't yield empty arraysif len(sentence_tokenized) > 0:yield sentence_tokenized# Process the sentences manually def sentence_to_words_from_scratch(sentences):for sentence in sentences:sentence_tokenized = [token.lower() for token in word_tokenize(sentence.translate(str.maketrans('','',string.punctuation)))]# Make sure we don't yield empty arraysif len(sentence_tokenized) > 0:yield sentence_tokenizedsentences = list(sentence_to_words(sentences))停用詞 (Stop Words)
Once we have our corpus nicely tokenized, we will remove all stop words from the corpus. Stop words are words that don’t provide much additional meaning to a sentence. Words in the English vocabulary include “the”, “a”, and “in”. nltk contains a list of English stopwords, so we use that to filter our lists of tokens.
一旦我們對(duì)語(yǔ)料庫(kù)進(jìn)行了很好的標(biāo)記,我們將從語(yǔ)料庫(kù)中刪除所有停用詞。 停用詞是不會(huì)為句子提供更多附加含義的詞。 英語(yǔ)詞匯中的單詞包括“ the”,“ a”和“ in”。 nltk包含英語(yǔ)停用詞列表,因此我們使用它來(lái)過(guò)濾標(biāo)記列表。
拔除和摘除 (Lemmatization and Stemming)
Lemmatization is the process of grouping together different forms of the same word and replacing these instances with the word’s lemma (dictionary form). For example, “functions” is reduced to “function”. Stemming is the process of reducing a word to its root word (without any suffixes or prefixes). For example, “running” is reduced to “run”. These two steps decreases the vocabulary size, making it easier for the machine to understand our corpus.
合法化是將同一個(gè)單詞的不同形式組合在一起,并用單詞的引理(字典形式)替換這些實(shí)例的過(guò)程。 例如,“功能”被簡(jiǎn)化為“功能”。 詞干處理是將單詞還原為其根單詞(沒(méi)有任何后綴或前綴)的過(guò)程。 例如,“運(yùn)行”減少為“運(yùn)行”。 這兩個(gè)步驟減小了詞匯量,使機(jī)器更容易理解我們的語(yǔ)料庫(kù)。
from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from nltk.stem import SnowballStemmer nltk.download('stopwords') nltk.download('wordnet')# Remove all stopwords stop_words = stopwords.words('english') def remove_stopwords(tokenized_sentences):for sentence in tokenized_sentences:yield([token for token in sentence if token not in stop_words])# Lemmatize all words wordnet_lemmatizer = WordNetLemmatizer() def lemmatize_words(tokenized_sentences):for sentence in tokenized_sentences:yield([wordnet_lemmatizer.lemmatize(token) for token in sentence])snowball_stemmer = SnowballStemmer('english') def stem_words(tokenized_sentences):for sentence in tokenized_sentences:yield([snowball_stemmer.stem(token) for token in sentence])sentences = list(remove_stopwords(sentences)) sentences = list(lemmatize_words(sentences)) sentences = list(stem_words(sentences))Now that you know how to extract and preprocess your text data, you can begin the data analysis. Best of luck with your NLP adventures!
既然您知道如何提取和預(yù)處理文本數(shù)據(jù),就可以開(kāi)始數(shù)據(jù)分析了。 祝您在NLP冒險(xiǎn)中好運(yùn)!
筆記 (Notes)
- If you are tagging the corpus with parts-of-speech tags, stop words should be kept in the dataset and lemmatization should not be done prior to tagging. 如果要使用詞性標(biāo)記來(lái)標(biāo)記語(yǔ)料庫(kù),則停用詞應(yīng)保留在數(shù)據(jù)集中,并且在標(biāo)記之前不應(yīng)該進(jìn)行詞形化。
The GitHub repository for the Jupyter Notebook can be found here.
可以在此處找到Jupyter Notebook的GitHub存儲(chǔ)庫(kù)。
翻譯自: https://towardsdatascience.com/website-data-cleaning-in-python-for-nlp-dda282a7a871
python中nlp的庫(kù)
總結(jié)
以上是生活随笔為你收集整理的python中nlp的库_用于nlp的python中的网站数据清理的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 数据可视化机器学习工具在线_为什么您不能
- 下一篇: python的power bi转换基础