當前位置：首頁 > 编程语言 > python >内容正文

python

人工智能：python 实现第十章，NLP 第四天 A　Ｂａｇ Of Words

發布時間：2023/12/20 python 35 豆豆

生活随笔收集整理的這篇文章主要介紹了人工智能：python 实现第十章，NLP 第四天 A　Ｂａｇ Of Words 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

使用用詞袋（a bag of words）模型提取頻繁項

文本分析的主要目標之一是將文本轉化為數值形式。以便使用機器進行學習。我們考慮下，數以百萬計的單詞文檔，為了去分析這些文檔，我們需要提取文本并且將其轉化為數值符號。

機器學習算法需要處理數值的數據，以便他們能夠分析數據并且提取有用的信息。用詞袋模型從文檔的所有單詞中提取特征單詞，并且用這些特征項矩陣建模。這就使得我們能夠將每一份文檔描述成一個用詞袋。我們只需要記錄單詞的數量，語法和單詞的順序都可以忽略。

那么一份文檔的單詞矩陣是怎樣的呢。一個文檔的單詞矩陣是一個記錄出現在文檔中的所有單詞的次數。因此一份文檔能被描述成各種單詞權重的組合體。我們能夠設置條件，篩選出更有意義的單詞。順帶，我們能構建出現在文檔中所有單詞的頻率柱狀圖，這就是一個特征向量。這個特征向量將被用在文本分類。

思考一下幾句話：

句1：the children are playing in the hall
句2：The hall has? a lot of space
句3：Lots of children like playing in an open space

假設你思考完了這三句話，我們能夠得到下面14個唯一的單詞：

the、children 、are 、playing、in 、hall、has 、a、lot、of、space、like、an、open

我們可以用出現在每句話中的單詞次數為每一句話構建一個柱狀圖。每一個特征矩陣都將有14維，因為有14個不同的單詞:

句1：[2 1 1 1 1 1 0 0 0 0 0 0 0 0]

句2：[1 0 0 0 0 1 1 1 1 1 1 0 0 0]

句3：[0 1 0 1 1 0 0 0 1 1 1 1 1 1]

既然我們已經提取這些特征向量，我們能夠使用機器學習算法分析這些數據

如何使用NLTK構建用詞袋模型呢？創建一個python程序，導入如下包

import numpy as np from sklearn.feature_extraction.text import CountVectorizer from nltk.corpus import brown

讀入布朗語料庫文本，我們將讀入5400個單詞，你能按照自己的意愿輸入

# Read the data from the Brown corpus input_data = ' '.join(brown.words()[:5400])定義每塊的單詞數量

# Number of words in each chunk chunk_size = 800

定義分塊函數：

#將輸入的文本分塊，每一塊含有N個單詞 def chunker(input_data,N): input_words = input_data.split(' ') output=[] cur_chunk = [] count = 0 for word in input_words: cur_chunk.append(word) count+=1 if count==N: output.append(' '.join(cur_chunk)) count,cur_chunk =0,[] output.append(' '.join(cur_chunk)) return output

對輸入文本分塊

text_chunks = chunker(input_data, chunk_size)

將所分的塊轉換為字典項

# Convert to dict items chunks = [] for count, chunk in enumerate(text_chunks):d = {'index': count, 'text': chunk}chunks.append(d)使用已經得到的單詞出現次數，提取文檔術語矩陣。我們將使用CountVectorizer方法完成此工作，該方法需要兩個輸入參數。。第一個參數是出現在文檔中單詞的最小頻率度，第二個參數是出現在文檔中的單詞的最大的頻率度。這兩個頻度是參考在文本中單詞的出現次數。

max_df：可以設置為范圍在[0.0 1.0]的float，也可以設置為沒有范圍限制的int，默認為1.0。這個參數的作用是作為一個閾值，當構造語料庫的關鍵詞集的時候，如果某個詞的document frequence大于max_df，這個詞不會被當作關鍵詞。如果這個參數是float，則表示詞出現的次數與語料庫文檔數的百分比，如果是int，則表示詞出現的次數。如果參數中已經給定了vocabulary，則這個參數無效

min_df：類似于max_df，不同之處在于如果某個詞的document frequence小于min_df，則這個詞不會被當作關鍵詞

# Extract the document term matrix count_vectorizer = CountVectorizer(min_df=7, max_df=20) document_term_matrix = count_vectorizer.fit_transform([chunk['text'] for chunk in chunks])

提取詞匯并顯示。單詞引用于之前步驟所提取的并去重的一系列單詞。

# Extract the vocabulary and display it vocabulary = np.array(count_vectorizer.get_feature_names()) print("\nVocabulary:\n", vocabulary)

創建顯示列：

# Generate names for chunks chunk_names = [] for i in range(len(text_chunks)):chunk_names.append('Chunk-' + str(i+1))

輸出文檔項矩陣：

# Print the document term matrix print("\nDocument term matrix:") formatted_text = '{:>12}' * (len(chunk_names) + 1) print('\n', formatted_text.format('Word', *chunk_names), '\n') for word, item in zip(vocabulary, document_term_matrix.T):# 'item' is a 'csr_matrix' data structureoutput = [word] + [str(freq) for freq in item.data]print(formatted_text.format(*output))

完整代碼如下：

import numpy as np from sklearn.feature_extraction.text import CountVectorizer from nltk.corpus import brown#將輸入的文本分塊，每一塊含有N個單詞 def chunker(input_data,N): input_words = input_data.split(' ') output=[] cur_chunk = [] count = 0 for word in input_words: cur_chunk.append(word) count+=1 if count==N: output.append(' '.join(cur_chunk)) count,cur_chunk =0,[] output.append(' '.join(cur_chunk)) return output # Read the data from the Brown corpus input_data = ' '.join(brown.words()[:5400])# Number of words in each chunk chunk_size = 800text_chunks = chunker(input_data, chunk_size)# Convert to dict items chunks = [] for count, chunk in enumerate(text_chunks):d = {'index': count, 'text': chunk}chunks.append(d)# Extract the document term matrix count_vectorizer = CountVectorizer(min_df=7, max_df=20) document_term_matrix = count_vectorizer.fit_transform([chunk['text'] for chunk in chunks])# Extract the vocabulary and display it vocabulary = np.array(count_vectorizer.get_feature_names()) print("\nVocabulary:\n", vocabulary)# Generate names for chunks chunk_names = [] for i in range(len(text_chunks)):chunk_names.append('Chunk-' + str(i+1))# Print the document term matrix print("\nDocument term matrix:") formatted_text = '{:>12}' * (len(chunk_names) + 1) print('\n', formatted_text.format('Word', *chunk_names), '\n') for word, item in zip(vocabulary, document_term_matrix.T):# 'item' is a 'csr_matrix' data structureoutput = [word] + [str(freq) for freq in item.data]print(formatted_text.format(*output))

我們能夠看到所有的文檔單詞矩陣和每個單詞在每一塊的出現次數

總結

以上是生活随笔為你收集整理的人工智能：python 实现第十章，NLP 第四天 A　Ｂａｇ Of Words的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：创建资产负债表
下一篇： vscode 折叠所有区域代码快捷键

python

人工智能：python 实现 第十章，NLP 第四天 A Ｂａｇ Of Words

總結

人工智能：python 实现第十章，NLP 第四天 A　Ｂａｇ Of Words