當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

语料与词汇资源

發布時間：2024/1/23 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了语料与词汇资源小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

當代自然語言處理都是基于統計的，統計自然需要很多樣本，因此語料和詞匯資源是必不可少的，本節介紹語料和詞匯資源的重要性和獲取方式

NLTK語料庫

NLTK包含多種語料庫，舉一個例子：Gutenberg語料庫，執行：

nltk.corpus.gutenberg.fileids()

返回Gutenberg語料庫的文件標識符

[u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']

nltk.corpus.gutenberg就是gutenberg語料庫的閱讀器，它有很多實用的方法，比如：

nltk.corpus.gutenberg.raw('chesterton-brown.txt')：輸出chesterton-brown.txt文章的原始內容

nltk.corpus.gutenberg.words('chesterton-brown.txt')：輸出chesterton-brown.txt文章的單詞列表

nltk.corpus.gutenberg.sents('chesterton-brown.txt')：輸出chesterton-brown.txt文章的句子列表

類似的語料庫還有：

from nltk.corpus import webtext：網絡文本語料庫，網絡和聊天文本

from nltk.corpus import brown：布朗語料庫，按照文本分類好的500個不同來源的文本

from nltk.corpus import reuters：路透社語料庫，1萬多個新聞文檔

from nltk.corpus import inaugural：就職演說語料庫，55個總統的演說

語料庫的一般結構

以上各種語料庫都是分別建立的，因此會稍有一些區別，但是不外乎以下幾種組織結構：散養式（孤立的多篇文章）、分類式（按照類別組織，相互之間沒有交集）、交叉式（一篇文章可能屬于多個類）、漸變式（語法隨著時間發生變化）

語料庫的通用接口

fileids()：返回語料庫中的文件

categories()：返回語料庫中的分類

raw()：返回語料庫的原始內容

words()：返回語料庫中的詞匯

sents()：返回語料庫句子

abspath()：指定文件在磁盤上的位置

open()：打開語料庫的文件流

加載自己的語料庫

收集自己的語料文件（文本文件）到某路徑下（比如/tmp)，然后執行：

>>> from nltk.corpus import PlaintextCorpusReader >>> corpus_root = '/tmp' >>> wordlists = PlaintextCorpusReader(corpus_root, '.*') >>> wordlists.fileids()

就可以列出自己語料庫的各個文件了，也可以使用如wordlists.sents('a.txt')和wordlists.words('a.txt')等方法來獲取句子和詞信息

條件頻率分布

條件分布大家都比較熟悉了，就是在一定條件下某個事件的概率分布。自然語言的條件頻率分布就是指定條件下某個事件的頻率分布。

比如要輸出在布朗語料庫中每個類別條件下每個詞的概率：

# coding:utf-8import sys reload(sys) sys.setdefaultencoding( "utf-8" )import nltk from nltk.corpus import brown# 鏈表推導式，genre是brown語料庫里的所有類別列表，word是這個類別中的詞匯列表 # (genre, word)就是類別加詞匯對 genre_word = [(genre, word)for genre in brown.categories()for word in brown.words(categories=genre)]# 創建條件頻率分布 cfd = nltk.ConditionalFreqDist(genre_word)# 指定條件和樣本作圖 cfd.plot(conditions=['news','adventure'], samples=[u'stock', u'sunbonnet', u'Elevated', u'narcotic', u'four', u'woods', u'railing', u'Until', u'aggression', u'marching', u'looking', u'eligible', u'electricity', u'$25-a-plate', u'consulate', u'Casey', u'all-county', u'Belgians', u'Western', u'1959-60', u'Duhagon', u'sinking', u'1,119', u'co-operation', u'Famed', u'regional', u'Charitable', u'appropriation', u'yellow', u'uncertain', u'Heights', u'bringing', u'prize', u'Loen', u'Publique', u'wooden', u'Loeb', u'963', u'specialties', u'Sands', u'succession', u'Paul', u'Phyfe'])

注意：這里如果把plot直接換成tabulate ，那么就是輸出表格形式，和圖像表達的意思相同

我們還可以利用條件頻率分布，按照最大條件概率生成雙連詞，最終生成一個隨機文本

這可以直接使用bigrams()函數，它的功能是生成詞對鏈表。

創建python文件如下：

# coding:utf-8import sys reload(sys) sys.setdefaultencoding( "utf-8" )import nltk# 循環10次，從cfdist中取當前單詞最大概率的連詞,并打印出來 def generate_model(cfdist, word, num=10):for i in range(num):print word,word = cfdist[word].max()# 加載語料庫 text = nltk.corpus.genesis.words('english-kjv.txt')# 生成雙連詞 bigrams = nltk.bigrams(text)# 生成條件頻率分布 cfd = nltk.ConditionalFreqDist(bigrams)# 以the開頭，生成隨機串 generate_model(cfd, 'the')

執行效果如下：

the land of the land of the land of the

the的最大概率的雙連詞是land，land最大概率雙連詞是of，of最大概率雙連詞是the，所以后面就循環了

其他詞典資源

有一些僅是詞或短語以及一些相關信息的集合，叫做詞典資源。

詞匯列表語料庫：nltk.corpus.words.words()，所有英文單詞，這個可以用來識別語法錯誤

停用詞語料庫：nltk.corpus.stopwords.words，用來識別那些最頻繁出現的沒有意義的詞

發音詞典：nltk.corpus.cmudict.dict()，用來輸出每個英文單詞的發音

比較詞表：nltk.corpus.swadesh，多種語言核心200多個詞的對照，可以作為語言翻譯的基礎

同義詞集：WordNet，面向語義的英語詞典，由同義詞集組成，并組織成一個網絡

總結

以上是生活随笔為你收集整理的语料与词汇资源的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Scala入门到精通——第三节 Arra
下一篇：深入理解Spark 2.1 Core （