當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

自然语言10_分类与标注

發(fā)布時間：2025/3/15 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了自然语言10_分类与标注小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

sklearn實戰(zhàn)-乳腺癌細胞數(shù)據(jù)挖掘(博客主親自錄制視頻教程)

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

http://www.tuicool.com/articles/feAfi2

NLTK讀書筆記 — 分類與標注

時間?2009-10-20 15:58:44 SuperAngevil's Blog 原文? http://superangevil.wordpress.com/2009/10/20/nltk5/ 主題 NLTK

0. 本章所關(guān)注的問題

(1) 什么是lexical categories，怎樣將它們應(yīng)用于NLP?
(2) 什么樣的python數(shù)據(jù)結(jié)構(gòu)更適合存儲詞和它們的類別?
(3) 我們怎樣自動地給詞做標注

另外，本章還會包含NLP中一些基礎(chǔ)的技術(shù)： sequence labeling ,? n-gram models ,? backoff , evaluation

在典型的NLP中，第一步是將文本流切分成語義單元( tokenization 如分詞)，第二步就是詞性標注( POS tagging )

1. 使用Tagger工具

>>> import nltk

>>> text = nltk.word_tokenize(“And now for something completely different”)

>>> nltk.pos_tag(text)

POS-tagger處理一個詞序列，并且給每個詞標定一個詞性，以列表的形式返回

2. Tagged Corpora

(1) 重表達標記的token

>>> tagged_token = nltk.tag.str2tuple(‘fly/NN’)

>>> tagged_token[0] == ‘fly’

>>> tagged_token[1] == ‘NN’

>>> [nltk.tag.str2tuple(t) for t in sent.split()]

(2) 讀取標記的語料庫

NLTK的corpus reader提供一個唯一的讀取標記語料庫的接口 tagged_words ():

>>> nltk.corpus.brown.tagged_words()>>> nltk.corpus.brown.tagged_words(simplify_tags=True)

NLTK中還包括中文的語料庫 — Unicode編碼；若語料庫同時還按句切分過，那么它將會有一個tagged_sents()方法

(3) 簡化的POS Tagset

TagMeaningExamples

ADJ	adjective	new, good, high, special, big, local
ADV	adverb	really, already, still, early, now
CNJ	conjunction	and, or, but, if, while, although
DET	determiner	the, a, some, most, every, no
EX	existential	there, there’s
FW	foreign word	dolce, ersatz, esprit, quo, maitre
MOD	modal verb	will, can, would, may, must, should
N	noun	year, home, costs, time, education
NP	proper noun	Alison, Africa, April, Washington
NUM	number	twenty-four, fourth, 1991, 14:24
PRO	pronoun	he, their, her, its, my, I, us
P	preposition	on, of, at, with, by, into, under
TO	the word to	to
UH	interjection	ah, bang, ha, whee, hmpf, oops
V	verb	is, has, get, do, make, see, run
VD	past tense	said, took, told, made, asked
VG	present participle	making, going, playing, working
VN	past participle	given, taken, begun, sung
WH	wh determiner	who, which, when, what, where, how

>>> from nltk.corpus import brown

>>> brown_news_tagged = brown.tagged_words(categories=’news’, simplify_tags=True)

>>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)

>>> tag_fd.keys()

名詞 Nons: 通常指代人、地點、事情、概念

動詞 Verbs: 用以描述事件和行為

形容詞和副詞 Adjectives and Adverbs: 形容詞用來描述名詞，副詞用來描述動詞…

>>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags = True)

>>> word_tag_fd = nltk.FreqDist(wsj)

>>> [word + “/” + tag for (word, tag) in word_tag_fd if tag.startwith(‘V’)]

(5) 使用標注語料庫

>>> brown_learned_text = brown.words(categories=’learned’)

>>> sorted(set(b for (a, b) in nltk.ibigrams(brown_learned_text) if a == ‘often’))

>>> brown_lrnd_tagged = brown.tagged_words(categories=’learned’, simplify_tags=True)

>>> tags = [b[1] for (a, b) in nltk.ibigrams(brown_lrnd_tagged) if a[0] == ‘often’]

>>> fd = nltk.FreqDist(tags)

>>> fd.tabulate()

>>> brown_news_tagged = brown.tagged_words(categories=’news’, simplify_tags=True)

>>> data = nltk.ConditionalFreqDist((word.lower(), tag)

…???????????????????????????????? for (word, tag) in brown_news_tagged)

>>> for word in data.conditions():

…???? if len(data[word]) > 3:

…???????? tags = data[word].keys()

…???????? print word, ‘ ‘.join(tags)

3. 使用Python的詞典將詞與屬性之間建立映射

POS-Tagging中每個詞都會對應(yīng)一個tag, 很自然地，要建立詞與屬性的映射

python的dict提供一種defaultdict，nltk也提供一種 nltk.defauldict ，這樣使得使用不在dict中的key取value時不拋出異常，而給出默認值

key和value都可以很復(fù)雜

4. Automatic Tagging 自動標注

>>> from nltk.corpus import brown

>>> brown_tagged_sents = brown.tagged_sents(categories=’news’)

>>> brown_sents = brown.sents(categories=’news’)

(1) The Default Tagger

最簡單的標注方法就是為每個token標注同樣地tag — 把most likely tag標注給每個token是最“懶”的簡便方法:

>>> tags = [tag for (word, tag) in brown.tagged_words(categories=’news’)]>>> nltk.FreqDist(tags).max()

執(zhí)行這兩句發(fā)現(xiàn)最常見的tag是’NN’

下面就創(chuàng)建一個tagger把所有的token都標為NN:

>>> raw = ‘I do not like green eggs and ham, I do not like them Sam I am!’

>>> tokens = nltk.word_tokenize(raw)

>>> default_tagger = nltk.DefaultTagger(‘NN’)

>>> default_tagger.tag(tokens)

當然這種tagger的實際效果很差：

>>> default_tagger.evaluate(brown_tagged_sents)0.13089484257215028

the default tagger的作用就是增加語言處理系統(tǒng)的魯棒性

(2) The Regular Expression Tagger

使用正則表達式匹配的tagger

>>> patterns = [

…???? (r’.*ing$’, ‘VBG’),?????????????? # gerunds

…???? (r’.*ed$’, ‘VBD’),??????????????? # simple past

…???? (r’.*es$’, ‘VBZ’),??????????????? # 3rd singular present

…???? (r’.*ould$’, ‘MD’),?????????????? # modals

…???? (r’.*\’s$’, ‘NN$’),?????????????? # possessive nouns

…???? (r’.*s$’, ‘NNS’),???????????????? # plural nouns

…???? (r’^-?[0-9]+(.[0-9]+)?$’, ‘CD’),? # cardinal numbers

…???? (r’.*’, ‘NN’)???????????????????? # nouns (default)

… ]

>>> regexp_tagger = nltk.RegexpTagger(patterns)

>>> regexp_tagger.tag(brown_sents[3])

相比與默認的tagger, 正則表達式tagger的效果要好一些

>>> regexp_tagger.evaluate(brown_tagged_sents)0.20326391789486245

(3) The Lookup Tagger

我們找出100個出現(xiàn)頻率最高的詞并存儲其tag — 使用這種信息作為一個”lookup tagger”的模型(在NLTK中是UnigramTagger):

>>> fd = nltk.FreqDist(brown.words(categories=’news’))

>>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories=’news’))

>>> most_freq_words = fd.keys()[:100]

>>> likely_tags = dict((word, cfd[word].max()) for word in most_freq_words)

>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags)

>>> baseline_tagger.evaluate(brown_tagged_sents)

0.45578495136941344

比之前兩個，這種tagger的效果又要好一些

我們首先使用lookup table, 如果不能決定一個token的tag，我們再使用default tagger — 這個過程就稱為backoff

那么這個過程怎么實現(xiàn)呢：將default tagger作為lookup tagger的輸入?yún)?shù)

>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags,…????????????????????????????????????? backoff=nltk.DefaultTagger(‘NN’))

(4) 評估 Evaluation

對各種工具的評估一直是NLP的核心主題之一 — 首先，最”好”的評估方式是語言專家的評估? 另外，我們可以使用 gold standard test data進行評估

5. N-Gram Tagging N元語法標注

(1) Unigram Tagging 一元語法標注

Unigram tagger基于這樣一個簡單的統(tǒng)計算法:

for each token, assign the tag that is most likely for that particular token

簡單來說，對于詞frequent，總是將其標為JJ，因為它作為JJ的情形是最多的

unigram tagger的行為與lookup tagger差不多，所不同的是構(gòu)建過程： unigram tagger是通過訓練(training)過程構(gòu)建的 — 通過將tagged sentence data作為參數(shù)初始化時傳遞給UnigramTagger來進行訓練:

>>> from nltk.corpus import brown

>>> brown_tagged_sents = brown.tagged_sents(categories=’news’)

>>> brown_sents = brown.sents(categories=’news’)

>>> unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)

>>> unigram_tagger.tag(brown_sents[2007])

>>> unigram_tagger.evaluate(brown_tagged_sents)

0.9349006503968017

(2) 將訓練數(shù)據(jù)與測試數(shù)據(jù)分開

一般將數(shù)據(jù)集分開，訓練90%測試10%

(3) N-Gram Tagging N元語法標注

一元語法結(jié)構(gòu)只考慮當前的token，我們可以向前多看幾個token來決定對當前token的標注，這就是N元語法的含義 — 考慮當前token和之前處理的n-1個token — context變大了

N-Gram Tagger中一個特殊的二元語法標注器 bigram tagger

>>> bigram_tagger = nltk.BigramTagger(train_sents)

>>> bigram_tagger.tag(brown_sents[2007])

>>> unseen_sent = brown_sents[4203]

>>> bigram_tagger.tag(unseen_sent)

>>> bigram_tagger.evaluate(test_sents)

0.10276088906608193

可以看出Bigram-Tagger對未見過的句子的標注非常差！

N增大時， As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. This is known as the sparse data problem, and is quite pervasive in NLP. 這就導(dǎo)致accuracy與coverage之間的權(quán)衡問題: precision/recall trade off

(4) Combining Taggers 標注器的組合

為了獲取準確性與覆蓋性之間的權(quán)衡，一種方法是使用更精確的算法，但這一般又要求算法的覆蓋率同時要高

另一種方法就是使用標注器的組合：

1) 首先嘗試使用bigram tagger進行標注

2) 對于bigram tagger不能標注的token, 嘗試使用unigram tagger

3) 對于unigram tagger也不能標注的token, 使用default tagger

>>> t0 = nltk.DefaultTagger(‘NN’)

>>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)

>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)

>>> t2.evaluate(test_sents)

0.84491179108940495

(5) Tagging Unknown Words

對未知詞的處理可以使用regular-expression-tagger或default-tagger作為backoff

一種有用的方法A useful method to tag unknown words based on context is to limit the vocabulary of a tagger to the most frequent n words, and to replace every other word with a special word UNK.During training, a unigram tagger will probably learn that UNK is usually a noun. However, the n-gram taggers will detect contexts in which it has some other tag. For example, if the preceding word is to (tagged TO), then UNK will probably be tagged as a verb.

(6) Storing Taggers

訓練通常是個漫長的過程，將已經(jīng)訓練過的tagger存起來，以備以后重用

>>> from cPickle import dump

>>> output = open(‘t2.pkl’, ‘wb’)

>>> dump(t2, output, -1)

>>> output.close()

load過程如下

>>> from cPickle import load

>>> input = open(‘t2.pkl’, ‘rb’)

>>> tagger = load(input)

>>> input.close()

(7) Performance Limitations

1) 可以考慮其遇到的ambiguity

>>> cfd = nltk.ConditionalFreqDist(

…??????????? ((x[1], y[1], z[0]), z[1])

…??????????? for sent in brown_tagged_sents

…??????????? for x, y, z in nltk.trigrams(sent))

>>> ambiguous_contexts = [c for c in cfd.conditions() if len(cfd[c]) > 1]

>>> sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N()

0.049297702068029296

2) 可以研究其錯誤 — confusion matrix

>>> test_tags = [tag for sent in brown.sents(categories=’editorial’)

…????????????????? for (word, tag) in t2.tag(sent)]

>>> gold_tags = [tag for (word, tag) in brown.tagged_words(categories=’editorial’)]

>>> print nltk.ConfusionMatrix(gold, test)?????

基于這些判斷我們就可以修改我們的tagset

6. Transformation-Based Tagging

N-Gram Tagger的table size隨著N的增大而增大，下面有一種新的tagging方法:Brill tagging, 一種歸納的tagging方法，size比n-gram tagging小得多（只是它很小的一部分）

Brill tagging是一種基于狀態(tài)轉(zhuǎn)移的學習方法 (a kind of transformation-based leaning )，其基本思想為：猜每個詞的tag, 然后回頭去修正錯誤.這樣，一個Brill tagger就連續(xù)地將一個不好的tagging轉(zhuǎn)成一個好一些的，……

在n-gram tagging中，這是一個受監(jiān)督的學習過程(a supervised learning method) — 我們需要使用注釋好的訓練數(shù)據(jù)來判斷tagger的猜測是否正確

Brill tagging可以用著色問題來進行類比：假定我們給一棵樹做著色，要在一個天藍色的背景下對其所有的細節(jié)包括主干(boughs)、分枝 (branches)、細枝(twigs)、葉子(leaves)進行著色，我們不是首先給樹著色，然后再在其他地方著藍色；而是首先簡單地將整個畫布著成藍色，然后“修正”樹所在的部分，再已有藍色的基礎(chǔ)上重新著色 — begin with broad brush strokes then fix up the details, with successively finer changes .

舉例說明，給下句做標注：

The President said he will ask Congress to increase grants to states for vocational rehabilitation

我們首先使用unigram-tagger進行標注，然后按如下規(guī)則進行修正：(a) 當前一個詞是TO時，使用VB來代替NN?? (b) 當下一個詞是NNS時使用IN來代替TO，過程如下：

Phrase	to	increase	grants	to	states	for	vocational	rehabilitation
Unigram	TO	NN	NNS	TO	NNS	IN	JJ	NN
Rule 1	?	VB	?	?	?	?	?	?
Rule 2	?	?	?	IN	?	?	?	?
Output	TO	VB	NNS	IN	NNS	IN	JJ	NN
Gold	TO	VB	NNS	IN	NNS	IN	JJ	NN

規(guī)則的形式可以作為模板：

“replace T1 with T2 in the context C”

典型的context就是前面的詞或后面的詞的詞本身或者詞的標注；每個rule還可以有其評分

Brill tagger還有一個有用的屬性：這些規(guī)則都是語義上可解釋的

7. 怎樣決定一個詞的類別

在語言學上，我們使用詞法(morphological)、句法(syntactic)、語義(semantic)上的線索來決定一個詞的類別

(1) Morphological Clues

詞本身的結(jié)構(gòu)可能就給出詞的類別，如：-ness后綴與一個形容詞構(gòu)成一個名詞 happy –> happiness；-ment后綴; -ing后綴等等

(2) Syntactic Clues

使用某種詞出現(xiàn)的典型語境信息

(3) Semantic Clues

詞本身的語義也是一個很有用判斷詞類型的信息. 如，名詞的定義就是 “the name of a person, place or thing”

(4) 新詞

(5) Morphology in Part of Speech Tagsets

Like this:

Like Loading...

Categories: NLP Tags: NLP , NLTK

python風控評分卡建模和風控常識(博客主親自錄制視頻教程)

https://study.163.com/course/introduction.htm?courseId=1005214003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

轉(zhuǎn)載于:https://www.cnblogs.com/webRobot/p/6068968.html

總結(jié)

以上是生活随笔為你收集整理的自然语言10_分类与标注的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

自然语言

上一篇： Oracle-11g 基于 NBU 的
下一篇： [Design-Pattern]工厂模式

编程问答

自然语言10_分类与标注

sklearn實戰(zhàn)-乳腺癌細胞數(shù)據(jù)挖掘(博客主親自錄制視頻教程)

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

http://www.tuicool.com/articles/feAfi2

NLTK讀書筆記 — 分類與標注

Like this:

Related

python風控評分卡建模和風控常識(博客主親自錄制視頻教程)

總結(jié)