日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程语言 > python >内容正文

python

【Python 自然语言处理 第二版】读书笔记2:获得文本语料和词汇资源

發布時間:2025/3/19 python 22 豆豆
生活随笔 收集整理的這篇文章主要介紹了 【Python 自然语言处理 第二版】读书笔记2:获得文本语料和词汇资源 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

  • 一、獲取文本語料庫
    • 1、古騰堡語料庫
      • (1)輸出語料庫中的文件標識符
      • (2)詞的統計與索引
      • (3)文本統計
    • 2、網絡和聊天文本
    • 3、布朗語料庫
      • (1)初識
      • (2)比較不同文體中的情態動詞的用法
    • 4、路透社語料庫
      • (1)初識
      • (2)通過主題和fileids查找words
      • (3)以文檔或類別為單位查找想要的詞或句子
    • 5、就職演說語料庫
      • (1)初識
      • (2)條件頻率分布圖
    • 6、標注文本語料庫
    • 7、多國語言語料庫
    • 8、文本語料庫的結構
    • 9、加載你自己的語料庫
      • (1)PlaintextCorpusReader
      • (2)BracketParseCorpusReader
  • 二、條件頻率分布
    • 1、條件和事件
    • 2、按文體計數詞匯
    • 3、繪制分布圖和分布表
    • 4、使用雙連詞生成隨機文本
  • 三、代碼重用(Python)
  • 四、詞典資源
    • 1、詞匯列表語料庫
      • (1)過濾文本
      • (2)停用詞語料庫
      • (3)一個字母拼詞謎題
      • (4)名字語料庫
    • 2、發音的詞典
    • 3、比較詞表
  • 五、WordNet
    • 1、意義與同義詞
    • 2、層次結構
    • 3、更多的詞匯關系
    • 4、語義相似度

大量的語言數據或者語料庫。

一、獲取文本語料庫

1、古騰堡語料庫

NLTK 包含 古騰堡項目(Project Gutenberg) 電子文本檔案的經過挑選的一小部分文本,該項目大約有25,000本免費電子圖書。

(1)輸出語料庫中的文件標識符

import nltk # 輸出語料庫中的文件標識符 print(nltk.corpus.gutenberg.fileids())

輸出結果

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

(2)詞的統計與索引

from nltk.corpus import gutenbergemma = gutenberg.words('austen-emma.txt') print(len(emma))emma = nltk.Text(gutenberg.words('austen-emma.txt')) print(emma.concordance("surprize"))

輸出結果

192427 Displaying 25 of 37 matches: er father , was sometimes taken by surprize at his being still able to pity ` hem do the other any good ." " You surprize me ! Emma must do Harriet good : a Knightley actually looked red with surprize and displeasure , as he stood up , r . Elton , and found to his great surprize , that Mr . Elton was actually on d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great , father was quite taken up with the surprize of so sudden a journey , and his f y , in all the favouring warmth of surprize and conjecture . She was , moreove he appeared , to have her share of surprize , introduction , and pleasure . Th ir plans ; and it was an agreeable surprize to her , therefore , to perceive t talking aunt had taken me quite by surprize , it must have been the death of m f all the dialogue which ensued of surprize , and inquiry , and congratulationthe present . They might chuse to surprize her ." Mrs . Cole had many to agre the mode of it , the mystery , the surprize , is more like a young woman ' s sto her song took her agreeably by surprize -- a second , slightly but correct " " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ; t to be considered . Emma ' s only surprize was that Jane Fairfax should accep of your admiration may take you by surprize some day or other ." Mr . Knightle ation for her will ever take me by surprize .-- I never had a thought of her iexpected by the best judges , for surprize -- but there was great joy . Mr . sound of at first , without great surprize . " So unreasonably early !" she w d Frank Churchill , with a look of surprize and displeasure .-- " That is easy ; and Emma could imagine with what surprize and mortification she must be retu tled that Jane should go . Quite a surprize to me ! I had not the least idea !. It is impossible to express our surprize . He came to speak to his father o g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai

(3)文本統計

for fileid in gutenberg.fileids():# raw()函數:沒有進行過任何語言學處理的文件的內容num_chars = len(gutenberg.raw(fileid))num_words = len(gutenberg.words(fileid))# sents()函數將文本劃分為句子,每個句子都是一個單詞列表。num_sents = len(gutenberg.sents(fileid))num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))print('平均詞長:', round(num_chars/num_words), '平均句長:', round(num_words/num_sents), '每個單詞出現的平均次數:', round(num_words/num_vocab), 'from:', fileid)

輸出結果

平均詞長: 5 平均句長: 25 每個單詞出現的平均次數: 26 from: austen-emma.txt 平均詞長: 5 平均句長: 26 每個單詞出現的平均次數: 17 from: austen-persuasion.txt 平均詞長: 5 平均句長: 28 每個單詞出現的平均次數: 22 from: austen-sense.txt 平均詞長: 4 平均句長: 34 每個單詞出現的平均次數: 79 from: bible-kjv.txt 平均詞長: 5 平均句長: 19 每個單詞出現的平均次數: 5 from: blake-poems.txt 平均詞長: 4 平均句長: 19 每個單詞出現的平均次數: 14 from: bryant-stories.txt 平均詞長: 4 平均句長: 18 每個單詞出現的平均次數: 12 from: burgess-busterbrown.txt 平均詞長: 4 平均句長: 20 每個單詞出現的平均次數: 13 from: carroll-alice.txt 平均詞長: 5 平均句長: 20 每個單詞出現的平均次數: 12 from: chesterton-ball.txt 平均詞長: 5 平均句長: 23 每個單詞出現的平均次數: 11 from: chesterton-brown.txt 平均詞長: 5 平均句長: 18 每個單詞出現的平均次數: 11 from: chesterton-thursday.txt 平均詞長: 4 平均句長: 21 每個單詞出現的平均次數: 25 from: edgeworth-parents.txt 平均詞長: 5 平均句長: 26 每個單詞出現的平均次數: 15 from: melville-moby_dick.txt 平均詞長: 5 平均句長: 52 每個單詞出現的平均次數: 11 from: milton-paradise.txt 平均詞長: 4 平均句長: 12 每個單詞出現的平均次數: 9 from: shakespeare-caesar.txt 平均詞長: 4 平均句長: 12 每個單詞出現的平均次數: 8 from: shakespeare-hamlet.txt 平均詞長: 4 平均句長: 12 每個單詞出現的平均次數: 7 from: shakespeare-macbeth.txt 平均詞長: 5 平均句長: 36 每個單詞出現的平均次數: 12 from: whitman-leaves.txt

顯示每個文本的三個統計量:平均詞長、平均句子長度和本文中每個詞出現的平均次數(我們的詞匯多樣性得分)。平均詞長似乎是英語的一個一般屬性,因為它的值總是4。(事實上,平均詞長是3而不是4,因為num_chars變量計數了空白字符。)相比之下,平均句子長度和詞匯多樣性看上去是作者個人的特點。

2、網絡和聊天文本

  • from nltk.corpus import webtext:網絡文本小集合
  • from nltk.corpus import nps_chat:即時消息聊天會話語料庫
# 網絡文本小集合 from nltk.corpus import webtext for fileid in webtext.fileids():print(fileid, webtext.raw(fileid)[:65], '...')# 即時消息聊天會話語料庫 # 10-19-20s_706posts.xml包含2006 年10 月19 日 # 從20 多歲聊天室收集的706 個帖子。 from nltk.corpus import nps_chat chatroom = nps_chat.posts('10-19-20s_706posts.xml') print(chatroom[123])

輸出結果

firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ... grail.txt SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there! [clop ... overheard.txt White guy: So, do you have any plans for this evening?Asian girl ... pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ... singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ... wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ... ['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']

3、布朗語料庫

1961年,布朗大學,第一個百萬詞語的英語電子語料庫,包含500個不同來源的文本。

(1)初識

布朗語料庫每一部分的示例文檔

from nltk.corpus import brown print(brown.categories())print(brown.words(categories='news')) print(brown.words(fileids=['cg22'])) print(brown.sents(categories=['new', 'editorial', 'reviews']))

輸出結果

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] ['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...] [['Assembly', 'session', 'brought', 'much', 'good'], ['The', 'General', 'Assembly', ',', 'which', 'adjourns', 'today', ',', 'has', 'performed', 'in', 'an', 'atmosphere', 'of', 'crisis', 'and', 'struggle', 'from', 'the', 'day', 'it', 'convened', '.'], ...]

(2)比較不同文體中的情態動詞的用法

一個文體中情態動詞的對比

import nltk from nltk.corpus import brownnews_text = brown.words(categories='news') fdist = nltk.FreqDist(w.lower() for w in news_text) modals = ['can', 'could', 'may', 'might', 'must', 'will'] for m in modals:print(m + ':', fdist[m], end=' ')

輸出結果

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

在不同的文體中統計感興趣詞的詞頻分布

# 帶條件的頻率分布函數 cfd = nltk.ConditionalFreqDist((genre, word) for genre in brown.categories()for word in brown.words(categories=genre)) # 填寫我們想要展示的文體種類 genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] # 填寫我們想要統計的詞 modals = ['can', 'could', 'may', 'might', 'must', 'will'] cfd.tabulate(conditions = genres, samples = modals) cfd.plot(conditions = genres, samples = modals)

輸出結果

can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 science_fiction 16 49 4 12 8 16 romance 74 193 11 51 45 43 humor 16 30 8 8 9 13

4、路透社語料庫

10,788 個新聞文檔,90個主題,共計130 萬字,按照“training”和“test”分為兩組。

(1)初識

from nltk.corpus import reuters print(reuters.fileids()) print(reuters.categories()) # 主題 print("\n")

輸出結果

['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840', 'test/14841', 'test/14842', 'test/14843', ,..., 'test/21567', 'test/21568', 'test/21570', 'test/21571', 'test/21573', 'test/21574', 'test/21575', 'test/21576', 'training/1', 'training/10', 'training/100', 'training/1000', 'training/10000', 'training/10002', 'training/10005', ...'training/9988', 'training/9989', 'training/999', 'training/9992', 'training/9993', 'training/9994', 'training/9995'] ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', ..., 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']

(2)通過主題和fileids查找words

路透社語料庫的類別是有互相重疊的:新聞報道往往涉及多個主題。

# 查詢該id中包含的主題 print(reuters.categories('training/9865')) print(reuters.categories(['training/9865', 'training/9880'])) print("\n") # 查詢該主題中包含的id print(reuters.fileids('barley')) print(reuters.fileids(['barley', 'corn'])) print("\n")

輸出結果

['barley', 'corn', 'grain', 'wheat'] ['barley', 'corn', 'grain', 'money-fx', 'wheat']['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', 'test/15875', 'test/15952', 'test/17767', 'test/17769', ..., 'training/8257', 'training/8759', 'training/9865', 'training/9958'] ['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106', 'test/15287', 'test/15341', 'test/15618', 'test/15648',...,'training/9058', 'training/9093', 'training/9094', 'training/934', 'training/9470', 'training/9521', 'training/9667', 'training/97', 'training/9865', 'training/9958', 'training/9989']

(3)以文檔或類別為單位查找想要的詞或句子

print(reuters.words('training/9865')[:14]) print(reuters.words(['training/9865', 'training/9880'])) print(reuters.words(categories='barley')) print(reuters.words(categories=['barley', 'corn']))

輸出結果

['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', 'BIDS', 'DETAILED', 'French', 'operators', 'have', 'requested', 'licences', 'to', 'export'] ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...] ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...] ['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]

5、就職演說語料庫

(1)初識

import nltk from nltk.corpus import inauguralprint(inaugural.fileids()) print([fileid[:4] for fileid in inaugural.fileids()])

輸出結果

['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', ..., '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt'] ['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', '1825', '1829', '1833', '1837', '1841', '1845', '1849', '1853', '1857', '1861', '1865', '1869', '1873', '1877', '1881', '1885', '1889', '1893', '1897', '1901', '1905', '1909', '1913', '1917', '1921', '1925', '1929', '1933', '1937', '1941', '1945', '1949', '1953', '1957', '1961', '1965', '1969', '1973', '1977', '1981', '1985', '1989', '1993', '1997', '2001', '2005', '2009']

(2)條件頻率分布圖

cfd = nltk.ConditionalFreqDist((target, fileid[:4])for fileid in inaugural.fileids()for w in inaugural.words(fileid)for target in ['america', 'citizen']if w.lower().startswith(target)) cfd.plot()

輸出結果:條件頻率分布圖
計數就職演說語料庫中所有以america 或citizen開始的詞。

6、標注文本語料庫

7、多國語言語料庫

print(nltk.corpus.cess_esp.words()) print(nltk.corpus.floresta.words()) print(nltk.corpus.indian.words('hindi.pos')) print(nltk.corpus.udhr.fileids()) print(nltk.corpus.udhr.words('Javanese-Latin1')[11:])

條件頻率分布來研究“世界人權宣言”(udhr)語料庫中不同語言版本中的字長差異

from nltk.corpus import udhr languages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik'] cfd = nltk.ConditionalFreqDist((lang, len(word))for lang in languagesfor word in udhr.words(lang + '-Latin1')) cfd.plot(cumulative=True)

8、文本語料庫的結構


文本語料庫的常見結構

  • isolated:一些孤立的沒有什么特別的組織的文本集合;
  • categorized:分類組織結構;
  • overlapping:重疊,如主題類別(路透社語料庫);
  • temporal:隨時間變化語言用法的改變(就職演說語料庫)。

9、加載你自己的語料庫

(1)PlaintextCorpusReader

PlaintextCorpusReader更適合文本文件,eg:添加 corpus_root 下的語料庫

from nltk.corpus import PlaintextCorpusReadercorpus_root = '/usr/share/dict' wordlists = PlaintextCorpusReader(corpus_root, '.*') print(wordlists.fileids()) print(wordlists.words('american-english'))

輸出結果

['README.select-wordlist', 'american-english', 'british-english', 'cracklib-small', 'words', 'words.pre-dictionaries-common'] ['A', 'A', "'", 's', 'AMD', 'AMD', "'", 's', 'AOL', ...]

(2)BracketParseCorpusReader

BracketParseCorpusReader更適合已解析過的語料庫

from nltk.corpus import BracketParseCorpusReadercorpus_root = '' # 路徑 file_pattern = r".*/wsj_.*\.mrg" # 匹配模式 # 初始化讀取器:語料庫目錄和要加載文件的格式,默認utf8格式的編碼 ptb = BracketParseCorpusReader(corpus_root, file_pattern) print(ptb.fileids()) print(len(ptb.sents())) print(ptb.sents(fileids='20/wsj_2013/mrg')[19])

二、條件頻率分布

1、條件和事件

每個配對pairs的形式是:(條件, 事件)。如果我們按文體處理整個布朗語料庫,將有15 個條件(每個文體一個條件)和1,161,192 個事件(每一個詞一個事件)。

text = ['The', 'Fulton', 'Country', 'Grand', 'Jury', 'said', ...] pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...]

2、按文體計數詞匯

# 構建文體與詞的配對 genre_word = [(genre, word)for genre in ['news', 'romance']for word in brown.words(categories=genre)] print(len(genre_word)) print(genre_word[:4]) print(genre_word[-4:], '\n')# 頻率分布 cfd = nltk.ConditionalFreqDist(genre_word) print(cfd) print(cfd.conditions(), '\n')print(cfd['news']) print(cfd['romance']) print(cfd['romance'].most_common(20)) print(cfd['romance']['could'])

輸出結果

170576 [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] <ConditionalFreqDist with 2 conditions> ['news', 'romance'] <FreqDist with 14394 samples and 100554 outcomes> <FreqDist with 8452 samples and 70022 outcomes> [(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502), ('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993), ('I', 951), ('in', 875), ('he', 702), ('had', 692), ('?', 690), ('her', 651), ('that', 583), ('it', 573), ('his', 559), ('she', 496)] 193

3、繪制分布圖和分布表

from nltk.corpus import inaugural cfd = nltk.ConditionalFreqDist((target, fileid[:4])for fileid in inaugural.fileids()for w in inaugural.words(fileid)for target in ['america', 'citizen']if w.lower().startswith(target)) cfd.plot()from nltk.corpus import udhrlanguages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik'] cfd = nltk.ConditionalFreqDist((lang, len(word))for lang in languagesfor word in udhr.words(lang + '-Latin1')) cfd.tabulate(conditions=['English', 'German_Deutsch'],samples=range(10), cumulative=True) cfd.plot(cumulative=True)

4、使用雙連詞生成隨機文本

利用bigrams制作生成模型

def generate_model(cfdist, word, num=15):for i in range(num):print(word, end=" ")word = cfdist[word].max()text = nltk.corpus.genesis.words("english-kjv.txt") bigrams = nltk.bigrams(text) cfd = nltk.ConditionalFreqDist(bigrams) print(cfd) print(list(cfd)) print(cfd["so"]) print(cfd["living"])generate_model(cfd, "so") generate_model(cfd, "living")

輸出結果

<ConditionalFreqDist with 2789 conditions> ['In', 'the', 'beginning', 'God', 'created', 'heaven', 'and', 'earth', '.', 'And', 'was', 'without', 'form', ',', 'void', ';', 'darkness', 'upon', 'face', 'of', 'deep', 'Spirit', 'moved', 'waters', 'said', 'Let', 'there', 'be', 'light', ':', 'saw', 'that', 'it', 'good', 'divided', 'from', 'called', 'Day', 'he', ..., ', 'embalmed', 'past', 'elders', 'chariots', 'horsemen', 'threshingfloor', 'Atad', 'lamentati', 'floor', 'Egyptia', 'Abelmizraim', 'requite', 'messenger', 'Forgive', 'forgive', 'meant', 'Machir', 'visit', 'coffin']FreqDist({'that': 8, '.': 7, ',': 4, 'the': 3, 'I': 2, 'doing': 2, 'much': 2, ':': 2, 'did': 1, 'Noah': 1, ...}) FreqDist({'creature': 7, 'thing': 4, 'substance': 2, 'soul': 1, '.': 1, ',': 1})so that he said , and the land of the land of the land of living creature that he said , and the land of the land of

條件頻率分布 的 常用方法

示例描述
cfdist= ConditionalFreqDist(pairs)從配對鏈表中創建條件頻率分布
cfdist.conditions()將條件按字母排序
cfdist[condition]此條件下的頻率分布
cfdist[condition][sample]此條件下給定樣本的頻率
cfdist.tabulate()為條件頻率分布制表
cfdist.tabulate(samples, conditions)指定樣本和條件限制下制表
cfdist.plot()為條件頻率分布繪圖
cfdist.plot(samples, conditions)指定樣本和條件
cfdist1 < cfdist2測試樣本在 cfdist1 中出現次數是否小于在 cfdist2 中出現次數

三、代碼重用(Python)

  • 函數、方法
  • 模塊(module):一個文件中的變量和函數定義的集合??赏ㄟ^文件入來訪問自定義的函數。
  • 包(package):相關模塊的集合。

注意:當 Python 導入模塊時,它先查找當前目錄(文件夾)。

四、詞典資源

詞典或者詞典資源:一個詞和(或)短語以及一些相關信息的集合,附屬于文本,通常在文本的幫助下創建和豐富。


上圖為詞典術語:兩個拼寫相同的詞條但意義不同(同音異義詞)的詞匯項(包括詞
目(也叫詞條)以及其他附加信息),其他附加信息包括詞性和注釋信息。

1、詞匯列表語料庫

詞匯語料庫是Unix 中的/usr/share/dict/words文件,被一些拼寫檢查程序使用。我們可以用它來尋找文本語料中不尋常的或拼寫錯誤的詞匯。

(1)過濾文本

此程序計算文本的詞匯表,然后刪除所有在現有的詞匯列表中出現的元
素,只留下罕見或拼寫錯誤的詞。

def unusual_words(text):text_vocab = set(w.lower() for w in text if w.isalpha())english_vocab = set(w.lower() for w in nltk.corpus.words.words())unusual = text_vocab - english_vocabreturn sorted(unusual)un_words1 = unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt')) print(un_words1, '\n') un_words2 = unusual_words(nltk.corpus.nps_chat.words()) print(un_words2)

輸出結果

['abbeyland', 'abhorred', 'abilities', 'abounded', 'abridgement', 'abused', 'abuses', 'accents', 'accepting', 'accommodations', ..., 'wiping', 'wisest', 'wishes', 'withdrew', 'witnessed', 'witnesses', 'witnessing', 'witticisms', 'wittiest', 'wives', 'women', 'wondered', 'woods', 'words', 'workmen', 'worlds', 'wrapt', 'writes', 'yards', 'years', 'yielded', 'youngest'] ['aaaaaaaaaaaaaaaaa', 'aaahhhh', 'abortions', 'abou', 'abourted', 'abs', 'ack', 'acros', 'actualy', ...,'yuuuuuuuuuuuummmmmmmmmmmm', 'yvw', 'yw', 'zebrahead', 'zoloft', 'zyban', 'zzzzzzzing', 'zzzzzzzz']

(2)停用詞語料庫

停用詞通常幾乎沒有什么詞匯內容,eg:如the,to和also…

from nltk.corpus import stopwordsprint(stopwords.words('english'))# 計算文本中沒有在停用詞列表中的詞的比例。 def content_fraction(text):stopwords = nltk.corpus.stopwords.words('english')content = [w for w in text if w.lower() not in stopwords]return len(content)/len(text)frac = content_fraction(nltk.corpus.reuters.words()) print(frac)

輸出結果

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's",..., 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"] 0.735240435097661

(3)一個字母拼詞謎題

在由隨機選擇的字母組成的網格中,選擇里面的字母組成詞;這個謎題叫做“目標”。

要求:

  • 長度不小于6
  • 每個詞必須包括中間的字母
  • 每個字母在每個詞中只能被用一次
import nltkpuzzle_letters = nltk.FreqDist('egivrvonl') obligatory = 'r' wordlist = nltk.corpus.words.words() print([w for w in wordlist if len(w) >= 6and obligatory in wand nltk.FreqDist(w) <= puzzle_letters])

輸出結果

['glover', 'gorlin', 'govern', 'grovel', 'ignore', 'involver', 'lienor', 'linger', 'longer', 'lovering', 'noiler', 'overling', 'region', 'renvoi', 'revolving', 'ringle', 'roving', 'violer', 'virole']

FreqDist 比較法:允許我們檢查每個字母在候選詞中的頻率是否小于或等于相應的字母在拼詞謎題中的頻率

(4)名字語料庫

包括 8000 個按性別分類的名字。男性和女性的名字存儲在單獨的文件中。

names = nltk.corpus.names print(names.fileids()) male_names = names.words('male.txt') female_names = names.words('female.txt')## 男女同名 print([w for w in male_names if w in female_names])# 條件頻率分布:此圖顯示男性和女性名字的結尾字母 cfd = nltk.ConditionalFreqDist((fileid, name[-1])for fileid in names.fileids()for name in names.words(fileid)) cfd.plot()

輸出結果

['female.txt', 'male.txt'] ['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis', 'Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel',..., 'Ted', 'Teddie', 'Teddy', 'Terri', 'Terry', 'Theo', 'Tim', 'Timmie', 'Timmy', 'Tobe', 'Tobie', 'Toby', 'Tommie', 'Tommy', 'Tony', 'Torey', 'Trace', 'Tracey', 'Tracie', 'Tracy', 'Val', 'Vale', 'Valentine', 'Van', 'Vin', 'Vinnie', 'Vinny', 'Virgie', 'Wallie', 'Wallis', 'Wally', 'Whitney', 'Willi', 'Willie', 'Willy', 'Winnie', 'Winny', 'Wynn']


條件頻率分布:此圖顯示男性和女性名字的結尾字母;大多數以 a,e 或 i 結尾的名字是女性;以 h 和 l 結尾的男性和女性同樣多;以 k,o,r,s 和 t 結尾的更可能是男性。

2、發音的詞典

語音合成器使用而設計的

entries = nltk.corpus.cmudict.entries() print(len(entries)) for entry in entries[39943:39951]:print(entry)for word, pron in entries:if len(pron) == 3:ph1, ph2, ph3 = pronif ph1 == 'P' and ph3 == 'T':print(word, ph2, end=' ')

輸出結果

133737 ('explorer', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0']) ('explorers', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0', 'Z']) ('explores', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'Z']) ('exploring', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'IH0', 'NG']) ('explosion', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N']) ('explosions', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N', 'Z']) ('explosive', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V']) ('explosively', ['EH2', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V', 'L', 'IY0'])pait EY1 pat AE1 pate EY1 patt AE1 peart ER1 peat IY1 peet IY1 peete IY1 pert ER1 pet EH1 pete IY1 pett EH1 piet IY1 piette IY1 pit IH1 pitt IH1 pot AA1 pote OW1 pott AA1 pout AW1 puett UW1 purt ER1 put UH1 putt AH1

找到所有發音結尾與 nicks 相似的詞匯。

syllable = ['N', 'IH0', 'K', 'S'] print([word for word, pron in entries if pron[-4:] == syllable])

輸出結果

["atlantic's", 'audiotronics', 'avionics', 'beatniks', 'calisthenics', 'centronics', 'chamonix', 'chetniks', "clinic's", 'clinics', 'conics', 'conics', 'cryogenics', 'cynics', 'diasonics', "dominic's", 'ebonics', 'electronics', "electronics'", "endotronics'", 'endotronics', 'enix', 'environics', 'ethnics', 'eugenics', 'fibronics', 'flextronics', 'harmonics', 'hispanics', 'histrionics', 'identics', 'ionics', 'kibbutzniks', 'lasersonics', 'lumonics', 'mannix', 'mechanics', "mechanics'", 'microelectronics', 'minix', 'minnix', 'mnemonics', 'mnemonics', 'molonicks', 'mullenix', 'mullenix', 'mullinix', 'mulnix', "munich's", 'nucleonics', 'onyx', 'organics', "panic's", 'panics', 'penix', 'pennix', 'personics', 'phenix', "philharmonic's", 'phoenix', 'phonics', 'photronics', 'pinnix', 'plantronics', 'pyrotechnics', 'refuseniks', "resnick's", 'respironics', 'sconnix', 'siliconix', 'skolniks', 'sonics', 'sputniks', 'technics', 'tectonics', 'tektronix', 'telectronics', 'telephonics', 'tonics', 'unix', "vinick's", "vinnick's", 'vitronics']

3、比較詞表

斯瓦迪士核心詞列表

from nltk.corpus import swadeshprint(swadesh.fileids()) print(swadesh.words('en'))# entries()方法:指定一個語言鏈表來訪問多語言中的同源詞 fr2en = swadesh.entries(['fr', 'en']) print(fr2en) translate = dict(fr2en) print(translate['chien']) print(translate['jeter'])de2en = swadesh.entries(['de', 'en']) # German-English es2en = swadesh.entries(['es', 'en']) # Spanish-English translate.update(dict(de2en)) translate.update(dict(es2en)) print(translate['Hund']) print(translate['perro'])languages = ['en', 'de', 'nl', 'es', 'fr', 'pt', 'la'] for i in [139, 140, 141, 142]:print(swadesh.entries(languages)[i])

輸出結果

['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr', 'hr', 'it', 'la', 'mk', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sw', 'uk']['I', 'you (singular), thou', 'he', 'we', 'you (plural)', 'they', 'this', 'that', 'here', 'there', 'who', 'what', 'where', 'when', 'how', 'not', 'all', 'many', 'some', 'few', 'other', 'one', 'two', 'three', 'four', 'five', 'big', 'long', 'wide', 'thick', 'heavy', 'small', 'short', 'narrow', 'thin', 'woman', 'man (adult male)', 'man (human being)', 'child', 'wife', 'husband', 'mother', 'father', 'animal', 'fish', 'bird', 'dog', 'louse', 'snake', 'worm', 'tree', 'forest', 'stick', 'fruit', 'seed', 'leaf', 'root', 'bark (from tree)', 'flower', 'grass', 'rope', 'skin', 'meat', 'blood', 'bone', 'fat (noun)', 'egg', 'horn', 'tail', 'feather', 'hair', 'head', 'ear', 'eye', 'nose', 'mouth', 'tooth', 'tongue', 'fingernail', 'foot', 'leg', 'knee', 'hand', 'wing', 'belly', 'guts', 'neck', 'back', 'breast', 'heart', 'liver', 'drink', 'eat', 'bite', 'suck', 'spit', 'vomit', 'blow', 'breathe', 'laugh', 'see', 'hear', 'know (a fact)', 'think', 'smell', 'fear', 'sleep', 'live', 'die', 'kill', 'fight', 'hunt', 'hit', 'cut', 'split', 'stab', 'scratch', 'dig', 'swim', 'fly (verb)', 'walk', 'come', 'lie', 'sit', 'stand', 'turn', 'fall', 'give', 'hold', 'squeeze', 'rub', 'wash', 'wipe', 'pull', 'push', 'throw', 'tie', 'sew', 'count', 'say', 'sing', 'play', 'float', 'flow', 'freeze', 'swell', 'sun', 'moon', 'star', 'water', 'rain', 'river', 'lake', 'sea', 'salt', 'stone', 'sand', 'dust', 'earth', 'cloud', 'fog', 'sky', 'wind', 'snow', 'ice', 'smoke', 'fire', 'ashes', 'burn', 'road', 'mountain', 'red', 'green', 'yellow', 'white', 'black', 'night', 'day', 'year', 'warm', 'cold', 'full', 'new', 'old', 'good', 'bad', 'rotten', 'dirty', 'straight', 'round', 'sharp', 'dull', 'smooth', 'wet', 'dry', 'correct', 'near', 'far', 'right', 'left', 'at', 'in', 'with', 'and', 'if', 'because', 'name'][('je', 'I'), ('tu, vous', 'you (singular), thou'), ('il', 'he'), ('nous', 'we'), ('vous', 'you (plural)'), ('ils, elles', 'they'), ('ceci', 'this'), ('cela', 'that'), ('ici', 'here'), ('là', 'there'), ('qui', 'who'), ('quoi', 'what'), ('où', 'where'), ('quand', 'when'), ('comment', 'how'), ('ne...pas', 'not'), ('tout', 'all'), ('plusieurs', 'many'), ... , ('sec', 'dry'), ('juste, correct', 'correct'), ('proche', 'near'), ('loin', 'far'), ('à droite', 'right'), ('à gauche', 'left'), ('à', 'at'), ('dans', 'in'), ('avec', 'with'), ('et', 'and'), ('si', 'if'), ('parce que', 'because'), ('nom', 'name')]dog throwdog dog('say', 'sagen', 'zeggen', 'decir', 'dire', 'dizer', 'dicere') ('sing', 'singen', 'zingen', 'cantar', 'chanter', 'cantar', 'canere') ('play', 'spielen', 'spelen', 'jugar', 'jouer', 'jogar, brincar', 'ludere') ('float', 'schweben', 'zweven', 'flotar', 'flotter', 'flutuar, boiar', 'fluctuare')

五、WordNet

面向語義的英語詞典,共有155,287 個詞和117,659 個同義詞集合。

1、意義與同義詞

from nltk.corpus import wordnet as wnprint(wn.synsets('motorcar')) # 意義相同的詞(或“詞條”)的集合 print(wn.synset('car.n.01').lemma_names()) # 獲取該詞在該詞集的定義 print(wn.synset('car.n.01').definition()) # 獲取該詞在該詞集下的例句 print(wn.synset('car.n.01').examples())# 得到指定同義詞集的所有詞條 print(wn.synset('car.n.01').lemmas()) # 查找特定的詞條 print(wn.lemma('car.n.01.automobile')) # 得到一個詞條對應的同義詞集 print(wn.lemma('car.n.01.automobile').synset()) # 以得到一個詞條的“名字” print(wn.lemma('car.n.01.automobile').name(), '\n')print(wn.synsets('car')) for synset in wn.synsets('car'):print(synset.lemma_names())print(wn.lemmas('car'))

輸出結果

[Synset('car.n.01')] ['car', 'auto', 'automobile', 'machine', 'motorcar'] a motor vehicle with four wheels; usually propelled by an internal combustion engine ['he needs a car to get to work'][Lemma('car.n.01.car'), Lemma('car.n.01.auto'), Lemma('car.n.01.automobile'), Lemma('car.n.01.machine'), Lemma('car.n.01.motorcar')] Lemma('car.n.01.automobile') Synset('car.n.01') automobile [Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')] ['car', 'auto', 'automobile', 'machine', 'motorcar'] ['car', 'railcar', 'railway_car', 'railroad_car'] ['car', 'gondola'] ['car', 'elevator_car'] ['cable_car', 'car'][Lemma('car.n.01.car'), Lemma('car.n.02.car'), Lemma('car.n.03.car'), Lemma('car.n.04.car'), Lemma('cable_car.n.01.car')]

2、層次結構

WordNet的同義詞集相當于抽象的概念,它們并不總是有對應的英語詞匯。這些概念在層次結構中相互聯系在一起。

from nltk.corpus import wordnet as wnmotorcar = wn.synset('car.n.01') type_of_motorcar = motorcar.hyponyms() print(type_of_motorcar[26])# 下位詞 print(sorted([lemma.name()for synset in type_of_motorcarfor lemma in synset.lemmas()]))# 上位詞 print(motorcar.hypernyms()) paths = motorcar.hypernym_paths() print(len(paths)) print([synset.name for synset in paths[0]]) print([synset.name() for synset in paths[0]]) print([synset.name() for synset in paths[1]])# 個最一般的上位(或根上位)同義詞集: print(motorcar.root_hypernyms())

輸出結果

Synset('stanley_steamer.n.01')['Model_T', 'S.U.V.', 'SUV', 'Stanley_Steamer', 'ambulance', 'beach_waggon', 'beach_wagon', 'bus', 'cab', 'compact', 'compact_car', 'convertible', 'coupe', 'cruiser', 'electric', 'electric_automobile', 'electric_car', 'estate_car', 'gas_guzzler', 'hack', 'hardtop', 'hatchback', 'heap', 'horseless_carriage', 'hot-rod', 'hot_rod', 'jalopy', 'jeep', 'landrover', 'limo', 'limousine', 'loaner', 'minicar', 'minivan', 'pace_car', 'patrol_car', 'phaeton', 'police_car', 'police_cruiser', 'prowl_car', 'race_car', 'racer', 'racing_car', 'roadster', 'runabout', 'saloon', 'secondhand_car', 'sedan', 'sport_car', 'sport_utility', 'sport_utility_vehicle', 'sports_car', 'squad_car', 'station_waggon', 'station_wagon', 'stock_car', 'subcompact', 'subcompact_car', 'taxi', 'taxicab', 'tourer', 'touring_car', 'two-seater', 'used-car', 'waggon', 'wagon'][Synset('motor_vehicle.n.01')] 2 [<bound method Synset.name of Synset('entity.n.01')>, <bound method Synset.name of Synset('physical_entity.n.01')>, <bound method Synset.name of Synset('object.n.01')>, <bound method Synset.name of Synset('whole.n.02')>, <bound method Synset.name of Synset('artifact.n.01')>, <bound method Synset.name of Synset('instrumentality.n.03')>, <bound method Synset.name of Synset('container.n.01')>, <bound method Synset.name of Synset('wheeled_vehicle.n.01')>, <bound method Synset.name of Synset('self-propelled_vehicle.n.01')>, <bound method Synset.name of Synset('motor_vehicle.n.01')>, <bound method Synset.name of Synset('car.n.01')>] ['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'] ['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'][Synset('entity.n.01')]

3、更多的詞匯關系

  • part_meronyms():部分,例如:一棵樹的部分是它的樹干,樹冠等。
  • substance_meronyms():實質包括…組成,例如:一棵樹的實質是包括心材和邊材組成的。
  • member_holonyms():形成…整體,例如:樹木的集合形成了一個森林。
  • entailments():蘊含關系
  • antonyms():反義詞
  • dir():查看詞匯關系和同義詞集上定義的其它方法
print(wn.synset('tree.n.01').part_meronyms()) print(wn.synset('tree.n.01').substance_meronyms()) print(wn.synset('tree.n.01').member_holonyms())print(wn.synset('mint.n.04').part_holonyms()) print(wn.synset('mint.n.04').substance_holonyms())for synset in wn.synsets('mint', wn.NOUN):print(synset.name() + ':', synset.definition()) print('-----------' * 4)# 蘊涵關系 print(wn.synset('walk.v.01').entailments()) print(wn.synset('eat.v.01').entailments()) print(wn.synset('tease.v.03').entailments()) print('-----------' * 4)# 反義詞 print(wn.lemma('supply.n.02.supply').antonyms()) print(wn.lemma('rush.v.01.rush').antonyms()) print(wn.lemma('horizontal.a.01.horizontal').antonyms()) print(wn.lemma('staccato.r.01.staccato').antonyms())# dir():查看詞匯關系和同義詞集上定義的其它方法 print(dir(wn.synset('harmony.n.02')))

輸出結果

[Synset('burl.n.02'), Synset('crown.n.07'), Synset('limb.n.02'), Synset('stump.n.01'), Synset('trunk.n.01')] [Synset('heartwood.n.01'), Synset('sapwood.n.01')] [Synset('forest.n.01')] [Synset('mint.n.02')] [Synset('mint.n.05')] batch.n.02: (often followed by `of') a large number or amount or extent mint.n.02: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers mint.n.03: any member of the mint family of plants mint.n.04: the leaves of a mint plant used fresh or candied mint.n.05: a candy that is flavored with a mint oil mint.n.06: a plant where money is coined by authority of the government -------------------------------------------- [Synset('step.v.01')] [Synset('chew.v.01'), Synset('swallow.v.01')] [Synset('arouse.v.07'), Synset('disappoint.v.01')] -------------------------------------------- [Lemma('demand.n.02.demand')] [Lemma('linger.v.04.linger')] [Lemma('vertical.a.01.vertical'), Lemma('inclined.a.02.inclined')] [Lemma('legato.r.01.legato')] ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_all_hypernyms', '_definition', '_examples', '_frame_ids', '_hypernyms', '_instance_hypernyms', '_iter_hypernym_lists', '_lemma_names', '_lemma_pointers', '_lemmas', '_lexname', '_max_depth', '_min_depth', '_name', '_needs_root', '_offset', '_pointers', '_pos', '_related', '_shortest_hypernym_paths', '_wordnet_corpus_reader', 'also_sees', 'attributes', 'causes', 'closure', 'common_hypernyms', 'definition', 'entailments', 'examples', 'frame_ids', 'hypernym_distances', 'hypernym_paths', 'hypernyms', 'hyponyms', 'in_region_domains', 'in_topic_domains', 'in_usage_domains', 'instance_hypernyms', 'instance_hyponyms', 'jcn_similarity', 'lch_similarity', 'lemma_names', 'lemmas', 'lexname', 'lin_similarity', 'lowest_common_hypernyms', 'max_depth', 'member_holonyms', 'member_meronyms', 'min_depth', 'name', 'offset', 'part_holonyms', 'part_meronyms', 'path_similarity', 'pos', 'region_domains', 'res_similarity', 'root_hypernyms', 'shortest_path_distance', 'similar_tos', 'substance_holonyms', 'substance_meronyms', 'topic_domains', 'tree', 'unicode_repr', 'usage_domains', 'verb_groups', 'wup_similarity']

4、語義相似度

right = wn.synset('right_whale.n.01') orca = wn.synset('orca.n.01') minke = wn.synset('minke_whale.n.01') tortoise = wn.synset('tortoise.n.01') novel = wn.synset('novel.n.01')# 共同的上位詞 print(right.lowest_common_hypernyms(minke)) print(right.lowest_common_hypernyms(orca)) print(right.lowest_common_hypernyms(tortoise)) print(right.lowest_common_hypernyms(novel))# 查找每個同義詞集深度量化 print(wn.synset('baleen_whale.n.01').min_depth()) print(wn.synset('whale.n.02').min_depth()) print(wn.synset('vertebrate.n.01').min_depth()) print(wn.synset('entity.n.01').min_depth())# 基于上位詞層次結構中相互連接的概念之間的最短路徑 # 在 0-1 范圍的打分(兩者之間沒有路徑就返回-1)。 print(right.path_similarity(minke)) print(right.path_similarity(orca)) print(right.path_similarity(tortoise)) print(right.path_similarity(novel))

輸出結果

[Synset('baleen_whale.n.01')] [Synset('whale.n.02')] [Synset('vertebrate.n.01')] [Synset('entity.n.01')]14 13 8 00.25 0.16666666666666666 0.07692307692307693 0.043478260869565216

總結

以上是生活随笔為你收集整理的【Python 自然语言处理 第二版】读书笔记2:获得文本语料和词汇资源的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。