N-Gram的数据结构
ARPA的n-gram語(yǔ)法如下:
[html] view plaincopyprint? \data\ ngram 1=64000 ngram 2=522530 ngram 3=173445 \1-grams: -5.24036 'cause -0.2084827 -4.675221 'em -0.221857 -4.989297 'n -0.05809768 -5.365303 'til -0.1855581 -2.111539 </s> 0.0 -99 <s> -0.7736475 -1.128404 <unk> -0.8049794 -2.271447 a -0.6163939 -5.174762 a's -0.03869072 -3.384722 a. -0.1877073 -5.789208 a.'s 0.0 -6.000091 aachen 0.0 -4.707208 aaron -0.2046838 -5.580914 aaron's -0.06230035 -5.789208 aarons -0.07077657 -5.881973 aaronson -0.2173971具體說(shuō)明見(jiàn) :ARPA的n-gram語(yǔ)言模型格式
整個(gè)ARPA-LM由很多個(gè)n-gram項(xiàng)組成,分別說(shuō)明這兩個(gè)的數(shù)據(jù)結(jié)構(gòu)
一,n-gram數(shù)據(jù)結(jié)構(gòu)
n-gram的數(shù)據(jù)結(jié)構(gòu)如下:
typedef struct { real log_prob ; real log_bo ; int *words ; } ARPALMEntry ;words,表示當(dāng)前的n-gram所涉及的單詞,如果是1-gram,那就只有一個(gè),如果是2-gram,那么words就包括這兩個(gè)單詞的序號(hào)。
log_bo,表示ngram的回退概率。
log_prob,表示ngram的組合概率。
二,ARPA-LM數(shù)據(jù)結(jié)構(gòu)
多個(gè)項(xiàng)組成的整個(gè)n-gram語(yǔ)言模型的數(shù)據(jù)結(jié)構(gòu)如下:
[cpp] view plaincopyprint?
vocab,用于構(gòu)建語(yǔ)言模型的詞典指針。詞典定義見(jiàn):詞典內(nèi)存存儲(chǔ)模型
entries,語(yǔ)言模型的所有ngram項(xiàng),是ARPALMEntry類型的一個(gè)二維數(shù)組。entries[0]存儲(chǔ)1-gram,entries[1]存儲(chǔ)2-gram,依此類推。
n_ngrams,整型數(shù)組,依次包含1-gram,2-gram,3-gram,....所包含的ngram項(xiàng)個(gè)數(shù)。
unk_wrd,詞典中可以不在語(yǔ)言模型中的詞。
unk_id,詞典中可以不在語(yǔ)言模型中的詞的ID,這個(gè)ID指定為詞典的最后一個(gè)詞序號(hào)。
n_unk_words,在讀語(yǔ)言模型之后,統(tǒng)計(jì)在詞典中,但沒(méi)有用來(lái)建立語(yǔ)言模型的詞個(gè)數(shù),如果沒(méi)有指定unk_wrd的話,是不允許的,就表示所有的詞典中的詞都應(yīng)該用來(lái)建語(yǔ)言模型。
unk_words,存儲(chǔ)6中統(tǒng)計(jì)的詞序號(hào)。
words_in_lm,這個(gè)標(biāo)識(shí)詞典中的詞是否在語(yǔ)言模型中出現(xiàn)。
轉(zhuǎn)載于:https://www.cnblogs.com/jonky/p/10154115.html
總結(jié)
以上是生活随笔為你收集整理的N-Gram的数据结构的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。