當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

NLP - sentencepiece

發(fā)布時(shí)間：2023/12/16 编程问答 44 豆豆

生活随笔收集整理的這篇文章主要介紹了 NLP - sentencepiece 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

- 一、關(guān)于 sentencepiece
- 二、安裝
- - 1、Python 模塊
  - 2、從 C++ 源構(gòu)建和安裝 SentencePiece 命令行工具
  - 3、使用 vcpkg 構(gòu)建和安裝
  - 4、從簽名發(fā)布的 wheels 下載和安裝 SentencePiece
- 三、命令行使用
- - 1、訓(xùn)練模型
  - 2、將原始文本編碼為 sentence pieces/ids
  - 3、編碼 sentence pieces/ids 到原始文本
  - 4、端到端示例 End-to-End Example
  - 5、導(dǎo)出詞匯表 Export vocabulary list
  - 6、重新定義特殊元token
  - 7、詞表限制 Vocabulary restriction
- 四、Python 調(diào)用

一、關(guān)于 sentencepiece

github : https://github.com/google/sentencepiece
論文《SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing》：https://aclanthology.org/D18-2012.pdf

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training.

SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences.

SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

重復(fù)出現(xiàn)次數(shù)多的詞組，就認(rèn)為是一個(gè)詞。
粒度比分詞大。
模型在訓(xùn)練中主要使用統(tǒng)計(jì)指標(biāo)，比如出現(xiàn)的頻率，左右連接度等，還有困惑度來(lái)訓(xùn)練最終的結(jié)果。

相關(guān)教程：

燭之文: sentencepiece原理與實(shí)踐
https://www.jianshu.com/p/d36c3e06fb98

二、安裝

SentencePiece分為兩部分：訓(xùn)練模型和使用模型。
訓(xùn)練模型部分是用C語(yǔ)言實(shí)現(xiàn)的，可編成二進(jìn)程程序執(zhí)行，訓(xùn)練結(jié)果是生成一個(gè)model和一個(gè)詞典文件。
模型使用部分同時(shí)支持二進(jìn)制程序和Python調(diào)用兩種方式，訓(xùn)練完生成的詞典數(shù)據(jù)是明文，可編輯，因此也可以用任何語(yǔ)言讀取和使用。

1、Python 模塊

SentencePiece 提供了 Python 封裝支持訓(xùn)練和 segmentation。
你可以通過(guò)以下命令安裝 Python 二進(jìn)制包：

% pip install sentencepiece

For more detail, see Python module

2、從 C++ 源構(gòu)建和安裝 SentencePiece 命令行工具

需要安裝一下工具和依賴(lài)庫(kù)：

cmake
C++11 compiler
gperftools library (optional, 10-40% performance improvement can be obtained.)

在 Ubuntu 上，可以使用 apt-get 安裝編譯工具：

sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

然后，你可以以如下方式，構(gòu)建和安裝命令行工具：

git clone https://github.com/google/sentencepiece.git cd sentencepiece mkdir build cd build cmake .. make -j $(nproc) sudo make install sudo ldconfig -v

在 macOS 上，最后一行用 sudo update_dyld_shared_cache 命令替代

3、使用 vcpkg 構(gòu)建和安裝

vcpkg : https://github.com/Microsoft/vcpkg

你可以使用 vcpkg 下載和安裝 sentencepiece
You can download and install sentencepiece using the vcpkg dependency manager:

git clone https://github.com/Microsoft/vcpkg.git cd vcpkg ./bootstrap-vcpkg.sh ./vcpkg integrate install ./vcpkg install sentencepiece

vcpkg 中的 sentencepiece 被微軟團(tuán)隊(duì)和社區(qū)貢獻(xiàn)者保持更新；
如果版本過(guò)時(shí)了，請(qǐng)聯(lián)系vcpkg 倉(cāng)庫(kù)這里創(chuàng)建 issue https://github.com/Microsoft/vcpkg

4、從簽名發(fā)布的 wheels 下載和安裝 SentencePiece

你可以從 GitHub releases page 下載 wheel：https://github.com/google/sentencepiece/releases/latest

在發(fā)布過(guò)程中，我們使用 OpenSSF 生成了 SLSA3 簽名，
OpenSSF’s : slsa-framework/slsa-github-generator
https://github.com/slsa-framework/slsa-github-generator

去驗(yàn)證一個(gè)發(fā)布的二進(jìn)制包：
To verify a release binary:
1、安裝驗(yàn)證工具：https://github.com/slsa-framework/slsa-verifier#installation
2、從 https://github.com/google/sentencepiece/releases/latest 下載 attestation.intoto.jsonl 源文件；
3、運(yùn)行驗(yàn)證器：

slsa-verifier -artifact-path <the-wheel> -provenance attestation.intoto.jsonl -source github.com/google/sentencepiece -tag <the-tag>pip install wheel_file.whl

三、命令行使用

1、訓(xùn)練模型

訓(xùn)練模型語(yǔ)法：

spm_train --input= --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=

例如：

spm_train --input='../corpus.txt' --model_prefix='../mypiece' --vocab_size=8000 --character_coverage=1 --model_type='bpe'

參數(shù)說(shuō)明

--input 指定需要訓(xùn)練的文本文件；不需要分詞、標(biāo)準(zhǔn)化或其他預(yù)處理；
SentencePiece 默認(rèn)采用 Unicode NFKC 進(jìn)行標(biāo)準(zhǔn)化；
如果有多個(gè)文件，可以使用逗號(hào)分隔；
--model_prefix 指定訓(xùn)練好的模型名前綴。
將會(huì)生成兩個(gè)文件： <model_name>.model 和 <model_name>.vocab （詞典信息）。
--vocab_size 訓(xùn)練后詞表的大小，比如 8000, 16000, 或 32000。
數(shù)量越大訓(xùn)練越慢，太小(<4000)可能訓(xùn)練不了。
--character_coverage 模型中覆蓋的字符數(shù)。中文、日語(yǔ)等字符多的語(yǔ)料可以設(shè)置為 0.9995；其他字符少的語(yǔ)料可設(shè)置為 1。
--model_type，訓(xùn)練時(shí)模型。可選擇的類(lèi)別有：unigram (默認(rèn)), bpe, char, 或 word。

max_sentence_length 最大句子長(zhǎng)度，默認(rèn)是4192，長(zhǎng)度貌似按字節(jié)來(lái)算，意味一個(gè)中文字代表長(zhǎng)度為2
max_sentencepiece_length 最大的句子塊長(zhǎng)度，默認(rèn)是16
seed_sentencepiece_size 控制句子數(shù)量，默認(rèn)是100w
num_threads 線程數(shù)，默認(rèn)是開(kāi)16個(gè)
use_all_vocab 使用所有的tokens作為詞庫(kù)，不過(guò)只對(duì)word/char 模型管用
input_sentence_size 訓(xùn)練器最大加載數(shù)量，默認(rèn)為0

2、將原始文本編碼為 sentence pieces/ids

Encode raw text into sentence pieces/ids

spm_encode --model=<model_file> --output_format=piece < input > output spm_encode --model=<model_file> --output_format=id < input > output

使用 --extra_options 標(biāo)識(shí)來(lái)插入 BOS/EOS 標(biāo)記，或反轉(zhuǎn)輸入順序。
Use --extra_options flag to insert the BOS/EOS markers or reverse the input sequence.

spm_encode --extra_options=eos (add </s> only) spm_encode --extra_options=bos:eos (add <s> and </s>) spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)

SentencePiece 支持 nbest segmentation 和使用 --output_format=(nbest|sample)_(piece|id) 標(biāo)識(shí)進(jìn)行 segmentation 抽樣。

spm_encode --model=<model_file> --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output spm_encode --model=<model_file> --output_format=nbest_id --nbest_size=10 < input > output

3、編碼 sentence pieces/ids 到原始文本

spm_decode --model=<model_file> --input_format=piece < input > output spm_decode --model=<model_file> --input_format=id < input > output

使用 --extra_options 選項(xiàng)來(lái)解碼倒序的文本。

spm_decode --extra_options=reverse < input > output

4、端到端示例 End-to-End Example

% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000 unigram_model_trainer.cc(494) LOG(INFO) Starts training with : input: "../data/botchan.txt" ... <snip> unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091 trainer_interface.cc(272) LOG(INFO) Saving model: m.model trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab% echo "I saw a girl with a telescope." | spm_encode --model=m.model ▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .% echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id 9 459 11 939 44 11 4 142 82 8 28 21 132 6% echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id I saw a girl with a telescope.

You can find that the original input sentence is restored from the vocabulary id sequence.

5、導(dǎo)出詞匯表 Export vocabulary list

spm_export_vocab --model=<model_file> --output=<output file>

<output file> 存儲(chǔ)詞匯表和排放日志概率的列表。詞匯表id對(duì)應(yīng)于此文件中的行號(hào)。
stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.

6、重新定義特殊元token

一般情況下，SentencePiece 使用 Unknown ( <unk>), BOS ( <s>) and EOS (</s>) 對(duì)應(yīng)的 id 為 0, 1 和 2。
我們也可以重新定義訓(xùn)練中對(duì)應(yīng)的id：

spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...

When setting -1 id e.g., bos_id=-1, this special token is disabled. Note that the unknown id cannot be disabled. We can define an id for padding () as --pad_id=3.

當(dāng)設(shè)置id 為-1，比如 bos_id=-1, 代表這個(gè) token 無(wú)效；unknown id 無(wú)法取消。

我們可以定義為 padding () 定義id：--pad_id=3。

如果你想為其他特殊token定義id，可以參考：Use custom symbols

https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md

7、詞表限制 Vocabulary restriction

spm_encode接收 --vocabularyand a --vocabulary_threshold選項(xiàng)，這樣 spm_encode 只會(huì)產(chǎn)生同樣出現(xiàn)在詞匯表中的符號(hào)（至少有一定頻率）。這個(gè)技術(shù)在 subword-nmt page 中有描述，用法與 subword-nmt 基本相同。

假設(shè) L1和 L2是兩種語(yǔ)言（源語(yǔ)言/目標(biāo)語(yǔ)言），訓(xùn)練共享的spm模型，并獲得每種語(yǔ)言的最終詞匯表：

% cat {train_file}.L1 {train_file}.L2 | shuffle > train % spm_train --input=train --model_prefix=spm --vocab_size=8000 --character_coverage=0.9995 % spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1 % spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2

shufflecommand is used just in case because spm_trainloads the first 10M lines of corpus by default.

segment train/test 語(yǔ)料使用 --vocabulary 選項(xiàng)：

% spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1 % spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2

四、Python 調(diào)用

import sentencepiece as spmsp = spm.SentencePieceProcessor() text = "食材上不會(huì)有這樣的糾結(jié)" sp.Load("/tmp/test.model") print(sp.EncodeAsPieces(text))

伊織 2022-11-01（天氣第二次變冷）

總結(jié)

以上是生活随笔為你收集整理的NLP - sentencepiece的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Java 岗史上最全八股文面试真题汇总，
下一篇：计算机组成与原理第三章答,计算机组成与原