日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

SentencePiece,subword-nmt,bpe算法

發布時間:2023/12/16 编程问答 38 豆豆
生活随笔 收集整理的這篇文章主要介紹了 SentencePiece,subword-nmt,bpe算法 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

BPE(Byte Pair Encoding,雙字節編碼)。2016年應用于機器翻譯,解決 集外詞(OOV)和罕見詞(Rare word)問題。論文題目《Neural Machine Translation of Rare Words with Subword Units》 —發表于ACL2016

http://www.sohu.com/a/115373230_465975

tensor2tensor有用到bpe,抽取:

data_generators/problem.py

data_generators/translate_ende.py

bpe算法實現:

1.參考:https://plmsmile.github.io/2017/10/19/subword-units/

import re def process_raw_words(words, endtag='-'):'''把單詞分割成最小的符號,并且加上結尾符號'''vocabs = {}for word, count in words.items():# 加上空格word = re.sub(r'([a-zA-Z])', r' \1', word)word += ' ' + endtagvocabs[word] = countreturn vocabsdef get_symbol_pairs(vocabs):''' 獲得詞匯中所有的字符pair,連續長度為2,并統計出現次數Args:vocabs: 單詞dict,(word, count)單詞的出現次數。單詞已經分割為最小的字符Returns:pairs: ((符號1, 符號2), count)'''#pairs = collections.defaultdict(int)pairs = dict()for word, freq in vocabs.items():# 單詞里的符號symbols = word.split()for i in range(len(symbols) - 1):p = (symbols[i], symbols[i + 1])pairs[p] = pairs.get(p, 0) + freqreturn pairsdef merge_symbols(symbol_pair, vocabs):'''把vocabs中的所有單詞中的'a b'字符串用'ab'替換Args:symbol_pair: (a, b) 兩個符號vocabs: 用subword(symbol)表示的單詞,(word, count)。其中word使用subword空格分割Returns:vocabs_new: 替換'a b'為'ab'的新詞匯表'''vocabs_new = {}raw = ' '.join(symbol_pair)merged = ''.join(symbol_pair)# 非字母和數字字符做轉義bigram = re.escape(raw)p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')for word, count in vocabs.items():word_new = p.sub(merged, word)vocabs_new[word_new] = countreturn vocabs_newraw_words = {"low":5, "lower":2, "newest":6, "widest":3} vocabs = process_raw_words(raw_words)num_merges = 10 print(vocabs) for i in range(num_merges):pairs = get_symbol_pairs(vocabs)# 選擇出現頻率最高的pairsymbol_pair = max(pairs, key=pairs.get)vocabs = merge_symbols(symbol_pair, vocabs) print(vocabs)

輸出:

原來:{"low":5, "lower":2, "newest":6, "widest":3} 經過bpe:{' low-': 5, ' low e r -': 2, ' newest-': 6, ' wi d est-': 3}

{“low”:5, “lower”:2, “newest”:6, “widest”:3}這個是原本每個單詞出現的頻率。最后輸出,可以以空格為劃分,比如作為建模單元,比如這里的建模單元為 low e r newest wi d est 。輸出文本經過建模單元就能都映射出來,一串表示。


2.參考 《Neural Machine Translation of Rare Words with Subword Units》

論文講解:http://www.sohu.com/a/115373230_465975

import re, collections def get_stats(vocab):pairs = collections.defaultdict(int)for word, freq in vocab.items():symbols = word.split()print(symbols)print("len(symbols) --- ",len(symbols))for i in range(len(symbols)-1):pairs[symbols[i],symbols[i+1]] += freqreturn pairs def merge_vocab(pair, v_in):v_out = {}bigram = re.escape(' '.join(pair))print("bigram ",bigram)p = re.compile(r'(? for word in v_in:w_out = p.sub(''.join(pair), word)print("w_out ",w_out)v_out[w_out] = v_in[word]return v_outvocab = {'l o w ' : 5, 'l o w e r ' : 2, 'n e w e s t ':6, 'w i d e s t ':3} num_merges = 10for i in range(num_merges):print("=#####################################=== ")pairs = get_stats(vocab)print("===========11111======= ")print(pairs)#print("===========11111======= ")best = max(pairs, key=pairs.get)print("===========2222======= ")print("pairs.get ",pairs.get)print("best ",best)#raise SystemExitvocab = merge_vocab(best, vocab)print("vocab ",vocab)

個人覺得分詞最好用的還是sentencepiece~~

SentencePiece

參考https://github.com/google/sentencepiece/tree/master/python

分詞20k個label id

>>> import sentencepiece as spm >>> spm.SentencePieceTrainer.Train('--input=/data/yelong/bpe_test/lib.txt --model_prefix=/data/yelong/bpe_test/bpe --vocab_size=20000 --model_type=bpe') import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.Load("/data/yelong/bpe_test/bpe.model") with open('/data/yelong/bpe_test/wav/train/text.txt', 'a') as fid, open('/data/yelong/bpe_test/wav/train/train.txt') as did:for line in did:a = line.strip().split()[1:] # eg. "TWO COME MUSE MIGRATE"aa = ' '.join([t for t in a])listid = sp.EncodeAsIds(aa)strid = ' '.join([str(t) for t in listid])b = line.strip().split()[:1]b =''.join([t for t in b])fid.write(b+' '+strid+'\n')

得到.model和.vocab兩個文件,

bpe.vocab:

<unk> 0 <s> 0 </s> 0 ▁T -0 HE -1 ▁A -2 ▁THE -3 IN -4 ▁S -5 ▁W -6

一個映射關系,右邊并不是id號,因為model_type有好幾種(unigram (default), bpe, char, or word),當選擇比如unigram種類時,得到的右邊是小數,所以并不是id號。

所以我不應該把nabu里配置里的alphabet里只寫了0-19996(bpe.vocab末尾是19996),而應該寫0-19999才對。

驗證過了,0-19999的id都有對應的piece,驗證方法:

% python >>> import sentencepiece as spm >>> sp = spm.SentencePieceProcessor() >>> sp.Load("/data/yelong/bpe_test/bpe.model") >>> for i in range(20000): ... sp.IdToPiece(i)

都能輸出。(不能輸出的話會報錯,退出)

總結

以上是生活随笔為你收集整理的SentencePiece,subword-nmt,bpe算法的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。