當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

自然语言处理库——NLTK

發(fā)布時間：2025/3/21 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了自然语言处理库——NLTK 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

? ? ? ? NLTK（www.nltk.org）是在處理預(yù)料庫、分類文本、分析語言結(jié)構(gòu)等多項操作中最長遇到的包。其收集的大量公開數(shù)據(jù)集、模型上提供了全面、易用的接口，涵蓋了分詞、詞性標注(Part-Of-Speech tag, POS-tag)、命名實體識別(Named Entity Recognition, NER)、句法分析(Syntactic Parse)等各項 NLP 領(lǐng)域的功能。

?1. 分詞

（1）句子切分（斷句）

（2）單詞切分（分詞）

2. 處理切詞

（1）移除標點符號

（2）移除停用詞

3. 詞匯規(guī)范化（Lexicon Normalization）

（1）詞形還原（lemmatization）

（2）詞干提取（stem）

4. 詞性標注

5. 獲取近義詞

NLTK模塊及功能介紹：

?1. 分詞

? ? ? ? 文本是由段落（Paragraph）構(gòu)成的，段落是由句子（Sentence）構(gòu)成的，句子是由單詞構(gòu)成的。切詞是文本分析的第一步，它把文本段落分解為較小的實體（如單詞或句子），每一個實體叫做一個Token，Token是構(gòu)成句子（sentence?）的單詞、是段落（paragraph）的句子。NLTK能夠?qū)崿F(xiàn)句子切分和單詞切分兩種功能。

（1）句子切分（斷句）

? ? ? ?句子切分是指把段落切分成句子：

from nltk.tokenize import sent_tokenizetext="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.The sky is pinkish-blue. You shouldn't eat cardboard"""tokenized_text=sent_tokenize(text)print(tokenized_text)''' 結(jié)果：['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.The sky is pinkish-blue.', "You shouldn't eat cardboard"] '''

（2）單詞切分（分詞）

? ? 單詞切分是把句子切分成單詞

import nltksent = "I am almost dead this time"token = nltk.word_tokenize(sent)結(jié)果：token['I','am','almost','dead','this','time']

2. 處理切詞

對切詞的處理，需要移除標點符號和移除停用詞和詞匯規(guī)范化。

（1）移除標點符號

? ? ? ?對每個切詞調(diào)用該函數(shù)，移除字符串中的標點符號，string.punctuation包含了所有的標點符號，從切詞中把這些標點符號替換為空格。

# 方式一 import strings = 'abc.' s = s.translate(str.maketrans(string.punctuation, " "*len(string.punctuation))) # abc# 方式二 english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%'] text_list = [word for word in text_list if word not in english_punctuations]

（2）移除停用詞

? ? ? ?停用詞（stopword）是文本中的噪音單詞，沒有任何意義，常用的英語停用詞，例如：is, am, are, this, a, an, the。NLTK的語料庫中有一個停用詞，用戶必須從切詞列表中把停用詞去掉。

nltk.download('stopwords') # Downloading package stopwords to C:\Users\Administrator\AppData\Roaming\nltk_data...Unzipping corpora\stopwords.zip.from nltk.corpus import stopwords stop_words = stopwords.words("english")text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome."""word_tokens = nltk.tokenize.word_tokenize(text.strip()) filtered_word = [w for w in word_tokens if not w in stop_words]''' word_tokens：['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?','The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.'] filtered_word：['Hello', 'Mr.', 'Smith', ',', 'today', '?', 'The', 'weather', 'great', ',', 'city', 'awesome', '.'] '''

3. 詞匯規(guī)范化（Lexicon Normalization）

詞匯規(guī)范化是指把詞的各種派生形式轉(zhuǎn)換為詞根，在NLTK中存在兩種抽取詞干的方法porter和wordnet。

（1）詞形還原（lemmatization）

? ? ?利用上下文語境和詞性來確定相關(guān)單詞的變化形式，根據(jù)詞性來獲取相關(guān)的詞根，也叫l(wèi)emma，結(jié)果是真實的單詞。

（2）詞干提取（stem）

? ? ?從單詞中刪除詞綴并返回詞干，可能不是真正的單詞。

from nltk.stem.wordnet import WordNetLemmatizer # from nltk.stem import WordNetLemmatizer lem = WordNetLemmatizer() # 詞形還原from nltk.stem.porter import PorterStemmer # from nltk.stem import PorterStemmer stem = PorterStemmer() # 詞干提取word = "flying" print("Lemmatized Word:",lem.lemmatize(word,"v")) print("Stemmed Word:",stem.stem(word))''' Lemmatized Word: fly Stemmed Word: fli '''

4. 詞性標注

? ? ? 詞性（POS）標記的主要目標是識別給定單詞的語法組，POS標記查找句子內(nèi)的關(guān)系，并為該單詞分配相應(yīng)的標簽。

sent = "Albert Einstein was born in Ulm, Germany in 1879." tokens = nltk.word_tokenize(sent)tags = nltk.pos_tag(tokens)''' [('Albert', 'NNP'), ('Einstein', 'NNP'), ('was', 'VBD'), ('born', 'VBN'), ('in', 'IN'), ('Ulm', 'NNP'), (',', ','), ('Germany', 'NNP'), ('in', 'IN'), ('1879', 'CD'), ('.', '.')] '''

5. 獲取近義詞

? ? 查看一個單詞的同義詞集用synsets(); 它有一個參數(shù)pos，可以指定查找的詞性。WordNet接口是面向語義的英語詞典，類似于傳統(tǒng)字典。它是NLTK語料庫的一部分。

import nltk nltk.download('wordnet') # Downloading package wordnet to C:\Users\Administrator\AppData\Roaming\nltk_data...Unzipping corpora\wordnet.zip.from nltk.corpus import wordnetword = wordnet.synsets('spectacular') print(word) # [Synset('spectacular.n.01'), Synset('dramatic.s.02'), Synset('spectacular.s.02'), Synset('outstanding.s.02')]print(word[0].definition()) print(word[1].definition()) print(word[2].definition()) print(word[3].definition())''' a lavishly produced performance sensational in appearance or thrilling in effect characteristic of spectacles or drama having a quality that thrusts itself into attention '''

總結(jié)

以上是生活随笔為你收集整理的自然语言处理库——NLTK的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：自然语言处理库——Gensim之Word
下一篇：自然语言处理库——TextBlob