日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

一、NLTK工具包使用

發(fā)布時(shí)間:2024/10/8 编程问答 37 豆豆
生活随笔 收集整理的這篇文章主要介紹了 一、NLTK工具包使用 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

@Author : By Runsen

文章目錄

  • 自然語(yǔ)言處理
  • 自然語(yǔ)言處理應(yīng)用
  • NLTK
  • 安裝語(yǔ)料庫(kù)
  • 了解Tokenize
  • 標(biāo)記文本
  • 加載內(nèi)置語(yǔ)料庫(kù)
  • 分詞(注意只能分英語(yǔ))
  • 停用詞
  • 具體使用
  • 過(guò)濾停用詞
  • 詞性標(biāo)注
  • 分塊
  • 命名實(shí)體識(shí)別

自然語(yǔ)言處理

自然語(yǔ)言處理(natural language processing)是計(jì)算機(jī)科學(xué)領(lǐng)域與人工智能領(lǐng)域中的一個(gè)重要方向。它研究能實(shí)現(xiàn)人與計(jì)算機(jī)之間用自然語(yǔ)言進(jìn)行有效通信的各種理論和方法。自然語(yǔ)言處理是一門融語(yǔ)言學(xué)、計(jì)算機(jī)科學(xué)、數(shù)學(xué)于一體的科學(xué)。

自然語(yǔ)言處理應(yīng)用

  • 搜索引擎,比如谷歌,雅虎等等。谷歌等搜索引擎會(huì)通過(guò)NLP了解到你是一個(gè)科技發(fā)燒友,所以它會(huì)返回科技相關(guān)的結(jié)果。
  • 社交網(wǎng)站信息流,比如 Facebook 的信息流。新聞饋送算法通過(guò)自然語(yǔ)言處理了解到你的興趣,并向你展示相關(guān)的廣告以及消息,而不是一些無(wú)關(guān)的信息。
  • 語(yǔ)音助手,諸如蘋果 Siri。
  • 垃圾郵件程序,比如 Google 的垃圾郵件過(guò)濾程序 ,這不僅僅是通常會(huì)用到的普通的垃圾郵件過(guò)濾,現(xiàn)在,垃圾郵件過(guò)濾器會(huì)對(duì)電子郵件的內(nèi)容進(jìn)行分析,看看該郵件是否是垃圾郵件。

NLTK

NLTK是構(gòu)建Python程序以使用人類語(yǔ)言數(shù)據(jù)的領(lǐng)先平臺(tái)。它為50多種語(yǔ)料庫(kù)和詞匯資源(如WordNet)提供了易于使用的界面,還提供了一套用于分類,標(biāo)記化,詞干化,標(biāo)記,解析和語(yǔ)義推理的文本處理庫(kù)。NLTK是Python上著名的?然語(yǔ)?處理庫(kù) ?帶語(yǔ)料庫(kù),具有詞性分類庫(kù) ?帶分類,分詞,等等功能。NLTK被稱為“使用Python進(jìn)行教學(xué)和計(jì)算語(yǔ)言學(xué)工作的絕佳工具”,以及“用自然語(yǔ)言進(jìn)行游戲的神奇圖書館”。

安裝語(yǔ)料庫(kù)

pip install nltk

注意,這只是安裝好了一個(gè)框子,里面是沒(méi)東西的

# 新建一個(gè)ipython,輸入 import nltk nltk.download()

我覺(jué)得下book 和popular下好就可以了

功能?覽表

安裝好了,我們來(lái)愉快的玩耍

了解Tokenize

把長(zhǎng)句?拆成有“意義”的?部件,,使用的是nltk.word_tokenize

>>> import nltk >>> sentence = "hello,,world" >>> tokens = nltk.word_tokenize(sentence) >>> tokens ['hello', ',', ',world']

標(biāo)記文本

>>> import nltk >>> sentence = """At eight o'clock on Thursday morning ... Arthur didn't feel very good.""" >>> tokens = nltk.word_tokenize(sentence) >>> tokens ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.'] >>> tagged = nltk.pos_tag(tokens) # 標(biāo)記詞性 >>> tagged[0:6] [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN')]

加載內(nèi)置語(yǔ)料庫(kù)

分詞(注意只能分英語(yǔ))

>>> from nltk.tokenize import word_tokenize >>> from nltk.text import Text >>> input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow." >>> tokens = word_tokenize(input_str) >>> tokens[:5] ['Today', "'s", 'weather', 'is', 'good'] >>> tokens = [word.lower() for word in tokens] #小寫 >>> tokens[:5] ['today', "'s", 'weather', 'is', 'good']

查看對(duì)應(yīng)單詞的位置和個(gè)數(shù)

>>> t = Text(tokens) >>> t.count('good') 1 >>> t.index('good') 4

還可以畫圖

t.plot(8)

停用詞

from nltk.corpus import stopwords stopwords.fileids() # 具體的語(yǔ)言果然沒(méi)有中文 ['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek','hungarian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian','spanish', 'swedish', 'turkish']# 看下英文的停用詞 stopwords.raw('english').replace('\n',' ') #會(huì)有很多\n,這里替換"i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't "

具體使用

test_words = [word.lower() for word in tokens] # tokens是上面的句子 test_words_set = set(test_words) # 集合 test_words_set.intersection(set(stopwords.words('english'))) >>>{'and', 'have', 'in', 'is', 'no', 'the', 'to', 'very', 'we'}

在 "Today’s weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow."中有這么多個(gè)停用詞

‘a(chǎn)nd’, ‘have’, ‘in’, ‘is’, ‘no’, ‘the’, ‘to’, ‘very’, ‘we’

過(guò)濾停用詞

filtered = [w for w in test_words_set if(w not in stopwords.words('english'))] filtered ['today','good','windy','sunny','afternoon','play','basketball','tomorrow','weather','classes',',','.',"'s"]

詞性標(biāo)注

from nltk import pos_tag tags = pos_tag(tokens) tags [('Today', 'NN'),("'s", 'POS'),('weather', 'NN'),('is', 'VBZ'),('good', 'JJ'),(',', ','),('very', 'RB'),('windy', 'JJ'),('and', 'CC'),('sunny', 'JJ'),(',', ','),('we', 'PRP'),('have', 'VBP'),('no', 'DT'),('classes', 'NNS'),('in', 'IN'),('the', 'DT'),('afternoon', 'NN'),(',', ','),('We', 'PRP'),('have', 'VBP'),('to', 'TO'),('play', 'VB'),('basketball', 'NN'),('tomorrow', 'NN'),('.', '.')]

分塊

from nltk.chunk import RegexpParser sentence = [('the','DT'),('little','JJ'),('yellow','JJ'),('dog','NN'),('died','VBD')] grammer = "MY_NP: {<DT>?<JJ>*<NN>}" cp = nltk.RegexpParser(grammer) #生成規(guī)則 result = cp.parse(sentence) #進(jìn)行分塊 print(result) out: result.draw() #調(diào)用matplotlib庫(kù)畫出來(lái)

命名實(shí)體識(shí)別

命名實(shí)體識(shí)別是NLP里的一項(xiàng)很基礎(chǔ)的任務(wù),就是指從文本中識(shí)別出命名性指稱項(xiàng),為關(guān)系抽取等任務(wù)做鋪墊。狹義上,是識(shí)別出人命、地名和組織機(jī)構(gòu)名這三類命名實(shí)體(時(shí)間、貨幣名稱等構(gòu)成規(guī)律明顯的實(shí)體類型可以用正則表達(dá)式等方式識(shí)別)。當(dāng)然,在特定的領(lǐng)域中,會(huì)相應(yīng)地定義領(lǐng)域內(nèi)的各種實(shí)體類型。

from nltk import ne_chunk sentence = "Edison went to Tsinghua University today." print(ne_chunk(pos_tag(word_tokenize(sentence)))) (S(PERSON Edison/NNP)went/VBDto/TO(ORGANIZATION Tsinghua/NNP University/NNP)today/NN./.)

總結(jié)

以上是生活随笔為你收集整理的一、NLTK工具包使用的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。