nlp算法文本向量化_NLP中的标记化算法概述
nlp算法文本向量化
This article is an overview of tokenization algorithms, ranging from word level, character level and subword level tokenization, with emphasis on BPE, Unigram LM, WordPiece and SentencePiece. It is meant to be readable by both experts and beginners alike. If any concept or explanation is unclear, please contact me and I will be happy to clarify whatever is needed.
本文概述了分詞化算法,涵蓋詞級,字符級和子詞級分詞化,重點(diǎn)介紹了BPE,Unigram LM,WordPiece和SentencePiece。 它意味著專家和初學(xué)者都可以閱讀。 如果不清楚任何概念或解釋,請與我聯(lián)系,我很樂意澄清所需內(nèi)容。
什么是令牌化? (What is tokenization?)
Tokenization is one of the first steps in NLP, and it’s the task of splitting a sequence of text into units with semantic meaning. These units are called tokens, and the difficulty in tokenization lies on how to get the ideal split so that all the tokens in the text have the correct meaning, and there are no left out tokens.
標(biāo)記化是NLP中的第一步,它的任務(wù)是將文本序列分成具有語義含義的單元。 這些單位稱為令牌,令牌化的困難在于如何獲得理想的分割,以便文本中的所有令牌都具有正確的含義,并且沒有遺漏的令牌。
In most languages, text is composed of words divided by whitespace, where individual words have a semantic meaning. We will see later what happens with languages that use symbols, where a symbols has a much more complex meaning than a word. For now we can work with English. As an example:
在大多數(shù)語言中,文本是由用空格分隔的單詞組成的,其中單個單詞具有語義。 稍后我們將看到使用符號的語言會發(fā)生什么,其中符號的含義比單詞復(fù)雜得多。 目前,我們可以使用英語。 舉個例子:
- Raw text: I ate a burger, and it was good. 原始文本:我吃了一個漢堡,很好。
- Tokenized text: [’I’, ’ate’, ’a’, ’burger’, ‘,’, ‘a(chǎn)nd’, ’it’, ’was’, ’good’, ‘.’] 標(biāo)記文本:['I','ate','a','burger',',','and','it','was','good','。']
‘Burger’ is a type of food, ‘a(chǎn)nd’ is a conjunction, ‘good’ is a positive adjective, and so on. By tokenizing this way, each element has a meaning, and by joining all the meanings of each token we can understand the meaning of the whole sentence. The punctuation marks get their own tokens as well, the comma to separate clauses and the period to signal the end of the sentence. Here is an alternate tokenization:
“漢堡”是一種食物,“和”是一種連詞,“好”是一種肯定形容詞,依此類推。 通過這種標(biāo)記方式,每個元素都有一個含義,并且通過將每個標(biāo)記的所有含義結(jié)合在一起,我們可以理解整個句子的含義。 標(biāo)點(diǎn)符號也獲得自己的標(biāo)記,用逗號分隔各個子句,并用句點(diǎn)表示句子的結(jié)尾。 這是替代的標(biāo)記化:
- Tokenized text: [’I ate’, ’a’, ’bur’, ‘ger’, ‘a(chǎn)n’, ‘d it’, ’wa’, ‘s good’, ‘.’] 帶標(biāo)記的文本:['I ate','a','bur','ger','an','d it','wa','good','。']
For the multiword unit ‘I ate’, we can just add the meanings of ‘I’ and ‘a(chǎn)te’, and for the subword units ‘bur’ and ‘ger’, they have no meaning separately but by joining them we arrive at the familiar word and we can understand what it means.
對于多字單元“ I ate”,我們可以僅添加“ I”和“ ate”的含義,而對于子詞單元“ bur”和“ ger”,它們沒有單獨(dú)的含義,但是通過將它們結(jié)合在一起,我們可以得出熟悉的單詞,我們可以理解其含義。
But what do we do with ‘d it’? What meaning does this have? As humans and speakers of English, we can deduce that ‘it’ is a pronoun, and the letter ‘d’ belongs to a previous word. But following this tokenization, the previous word ‘a(chǎn)n’ already has a meaning in English, the article ‘a(chǎn)n’ very different from ‘a(chǎn)nd’. How to deal with this? You might be thinking: stick with words, and give punctuations their own tokens. This is the most common way of tokenizing, called word level tokenization.
但是我們?nèi)绾翁幚怼?d it”呢? 這有什么意思? 作為人類和說英語的人,我們可以推斷出“它”是代詞,而字母“ d”屬于先前的單詞。 但是在此標(biāo)記化之后,前面的單詞“ an”已經(jīng)具有英語含義,冠詞“ an”與“ and”非常不同。 怎么處理呢? 您可能在想: 堅(jiān)持單詞,并給標(biāo)點(diǎn)符號指定自己的標(biāo)記。 這是最常用的令牌化方式,稱為單詞級令牌化。
詞級標(biāo)記 (Word level tokenization)
It consists only of splitting a sentence by the whitespace and punctuation marks. There are plenty of libraries in Python that do this, including NLTK, SpaCy, Keras, Gensim or you can do a custom Regex.
它僅由用空格和標(biāo)點(diǎn)符號分隔句子組成。 Python中有很多這樣做的庫,包括NLTK,SpaCy,Keras,Gensim或您可以自定義Regex。
Splitting on whitespace can also split an element which should be regarded as a single token, for example, New York. This is problematic and mostly the case with names, borrowed foreign phrases, and compounds that are sometimes written as multiple words.
在空格上分割也可以分割應(yīng)該視為單個標(biāo)記的元素,例如,紐約。 這是有問題的,大多數(shù)情況下是名稱,借來的外來短語以及有時寫成多個單詞的復(fù)合詞。
What about words like ‘don’t’, or contractions like ‘John’s’? Is it better to obtain the token ‘don’t’ or ‘do’ and ‘n’t’? What if there is a typo in the text, and burger turns into ‘birger’? We as humans can see that it was a typo, replace the word with ‘burger’ and continue, but machines can’t. Should the typo affect the complete NLP pipeline?
諸如“不要”之類的詞或諸如“約翰之類”之類的縮略詞呢? 獲得令牌“不”,“做”和“不”更好嗎? 如果文本中有拼寫錯誤,而漢堡變成“伯爾格”怎么辦? 作為人類,我們可以看到這是一個錯字,用“ burger”代替“ burger”,然后繼續(xù),但是機(jī)器不能。 錯字應(yīng)該影響整個NLP管道嗎?
Another drawback of word level tokenization is the huge vocabulary size it creates. Each token is saved into a token vocabulary, and if the vocabulary is built with all the unique words found in all the input text, it creates a huge vocabulary, which produces memory and performance problems later on. A current state-of-the-art deep learning architecture, Transformer XL, has a vocabulary size of 267,735. To solve the problem of the big vocabulary size, we can think of creating tokens with characters instead of words, which is called character level tokenization.
單詞級標(biāo)記化的另一個缺點(diǎn)是它創(chuàng)建的詞匯量巨大。 每個令牌都保存到一個令牌詞匯表中,如果該詞匯表是用在所有輸入文本中找到的所有唯一單詞構(gòu)建的,它將創(chuàng)建一個龐大的詞匯表,此后會產(chǎn)生內(nèi)存和性能問題。 當(dāng)前最先進(jìn)的深度學(xué)習(xí)架構(gòu)Transformer XL的詞匯量為267,735。 為了解決詞匯量大的問題,我們可以考慮使用字符而不是單詞來創(chuàng)建標(biāo)記,這稱為字符級標(biāo)記化。
字符級標(biāo)記 (Character level tokenization)
First introduced by Karpathy in 2015, instead of splitting a text into words, the splitting is done into characters, for example, smarter becomes s-m-a-r-t-e-r. The vocabulary size is dramatically reduced to the number of characters in the language, 26 for English plus the special characters. Misspellings or rare words are handled better because they are broken down into characters and these characters are already known in the vocabulary.
Karpathy于2015年首次引入該方法 ,而不是將文本拆分為單詞,而是將其拆分為字符,例如,變得更聰明。 詞匯量大大減少到該語言中的字符數(shù),英語為26,再加上特殊字符。 拼寫錯誤或稀有單詞可以更好地處理,因?yàn)樗鼈儽环纸鉃樽址?#xff0c;并且這些字符在詞匯表中已經(jīng)為人所知。
Tokenizing sequences at the character level has shown some impressive results. Radford et al. (2017) from OpenAI showed that character level models can capture the semantic properties of text. Kalchbrenner et al. (2016) from Deepmind and Leet et al. (2017) both demonstrated translation at the character level. These are particularly impressive results as the task of translation captures the semantic understanding of the underlying text.
字符級別的標(biāo)記序列顯示了一些令人印象深刻的結(jié)果。 Radford等。 OpenAI (2017)的研究表明,字符級模型可以捕獲文本的語義屬性。 Kalchbrenner等。 (2016)來自Deepmind和Leet等。 (2017)都展示了角色層面的翻譯。 由于翻譯任務(wù)捕獲了對基礎(chǔ)文本的語義理解,因此這些結(jié)果尤其令人印象深刻。
Reducing the vocabulary size has a tradeoff with the sequence length. Now, each word being splitted into all its characters, the tokenized sequence is much longer than the initial text. The word ‘smarter’ is transformed into 7 different tokens. Additionally, the main goal of tokenization is not achieved, because characters, at least in English, have no semantic meaning. Only when joining characters together do they acquire a meaning. As an in-betweener between word and character tokenization, subword tokenization produces subword units, smaller than words but bigger than just characters.
減少詞匯量需要權(quán)衡序列長度。 現(xiàn)在,每個單詞都被分解成所有字符,標(biāo)記化序列比初始文本長得多。 “更聰明”一詞將轉(zhuǎn)換為7個不同的令牌。 此外,沒有實(shí)現(xiàn)標(biāo)記化的主要目的,因?yàn)橹辽僭谟⒄Z中,字符沒有語義。 只有將字符連接在一起才能獲得意義。 作為單詞和字符標(biāo)記化之間的中間人,子詞標(biāo)記化產(chǎn)生子詞單位 ,該子詞單位小于單詞,但大于字符。
子字級標(biāo)記 (Subword level tokenization)
Example of subword tokenization子詞標(biāo)記化的示例Subword level tokenization doesn’t transform most common words, and decomposes rare words in meaningful subword units. If ‘unfriendly’ was labelled as a rare word, it would be decomposed into ‘un-friend-ly’ which are all meaningful units, ‘un’ meaning opposite, ‘friend’ is a noun, and ‘ly’ turns it into an adverb. The challenge here is how to make that segmentation, how do we get ‘un-friend-ly’ and not ‘unfr-ien-dly’.
子詞級標(biāo)記化不會轉(zhuǎn)換最常見的詞,并以有意義的子詞為單位分解稀有詞。 如果將“不友好的”標(biāo)記為稀有單詞,它將分解為“ un-friend-ly”,它們都是有意義的單位,“ un”的含義相反,“ friend”是名詞,“ ly”將其轉(zhuǎn)換為副詞。 這里的挑戰(zhàn)是如何進(jìn)行細(xì)分,如何使我們變得“不友善”而不是“不友善”。
As of 2020, the state-of-the-art deep learning architectures, based on Transformers, use subword level tokenization. BERT makes the following tokenization for this example:
截至2020年,基于Transformers的最先進(jìn)的深度學(xué)習(xí)架構(gòu)使用子詞級標(biāo)記。 BERT針對此示例進(jìn)行以下標(biāo)記化:
- Raw text: I have a new GPU. 原始文本:我有一個新的GPU。
- Tokenized text: [’i’, ’have’, ’a’, ’new’, ’gp’, ’##u’, ’.’] 標(biāo)記文本:['i','have','a','new','gp','## u','。']
Words present in the vocabulary are tokenized as words themselves, but ‘GPU’ is not found in the vocabulary and is treated as a rare word. Following an algorithm it is decided that it is segmented into ‘gp-u’. The ## before ‘u’ are to show that this subword belongs to the same word as the previous subword. BPE, Unigram LM, WordPiece and SentencePiece are the most common subword tokenization algorithms. They will be explained briefly because this is an introductory post, if you are interested in deeper descriptions, let me know and I will do more detailed posts for each of them.
詞匯表中存在的單詞本身被標(biāo)記為單詞,但是在詞匯表中找不到“ GPU”,因此將其視為稀有單詞。 根據(jù)算法,決定將其分段為“ gp-u”。 “ u”之前的##表示此子字與上一個子字屬于同一字。 BPE,Unigram LM,WordPiece和SentencePiece是最常見的子詞標(biāo)記化算法。 因?yàn)檫@是一篇介紹性文章,所以將對其進(jìn)行簡要說明,如果您對更深入的描述感興趣,請告訴我,我將為每個文章做更詳細(xì)的介紹。
BPE (BPE)
Introduced by Sennrich et al. in 2015, it merges the most frequently occurring character or character sequences iteratively. This is roughly how the algorithm works:
由Sennrich等人介紹。 在2015年 ,它迭代地合并最頻繁出現(xiàn)的字符或字符序列。 該算法大致是這樣工作的:
BPE is a greedy and deterministic algorithm and can not provide multiple segmentations. That is, for a given text, the tokenized text is always the same. A more detailed explanation of how BPE works will be detailed in a later article, or you can also find it in many other articles.
BPE是一種貪婪的確定性算法,不能提供多個細(xì)分。 也就是說,對于給定的文本,標(biāo)記化文本始終是相同的。 稍后的文章中將詳細(xì)介紹BPE的工作原理,或者您也可以在許多其他文章中找到它。
Unigram LM (Unigram LM)
Unigram language modelling (Kudo, 2018) is based on the assumption that all subword occurrences are independent and therefore subword sequences are produced by the product of subword occurrence probabilities. These are the steps of the algorithm:
Unigram語言建模( Kudo,2018 )基于所有子詞出現(xiàn)都是獨(dú)立的假設(shè),因此子詞序列是由子詞出現(xiàn)概率的乘積產(chǎn)生的。 這些是算法的步驟:
Kudo argues that the unigram LM model is more flexible than BPE because it is based on a probabilistic LM and can output multiple segmentations with their probabilities. Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it reduces progressively.
Kudo認(rèn)為,unigram LM模型比BPE更為靈活,因?yàn)樗诟怕蔐M并可以輸出具有概率的多個分段。 它不是從一組基本符號開始并且學(xué)習(xí)與某些規(guī)則(例如BPE或WordPiece)合并,而是從一個很大的詞匯(例如,所有預(yù)先加標(biāo)記的單詞和最常見的子字符串)開始,逐漸減少。
單詞集 (WordPiece)
WordPiece (Schuster and Nakajima, 2012) was initially used to solve Japanese and Korean voice problem, and is currently known for being used in BERT, but the precise tokenization algorithm and/or code has not been made public. It is similar to BPE in many ways, except that it forms a new subword based on likelihood, not on the next highest frequency pair. These are the steps of the algorithm:
WordPiece( Schuster和Nakajima,2012年 )最初用于解決日文和韓文語音問題,目前因在BERT中使用而聞名,但精確的標(biāo)記化算法和/或代碼尚未公開。 它在許多方面與BPE相似,不同之處在于它基于似然而不是下一個最高頻率對形成一個新的子字。 這些是算法的步驟:
句子片段 (SentencePiece)
All the tokenization methods so far required some form of pretokenization, which constitutes a problem because not all languages use spaces to separate words, or some languages are made of symbols. SentencePiece is equipped to accept pretokenization for specific languages. You can find the open source software in Github. For example, XLM uses SentencePiece and adds specific pretokenizers for Chinese, Japanese and Thai.
到目前為止,所有的標(biāo)記化方法都需要某種形式的預(yù)標(biāo)記化,這構(gòu)成了一個問題,因?yàn)椴⒎撬姓Z言都使用空格來分隔單詞,或者某些語言??是由符號組成的。 SentencePiece可以接受特定語言的預(yù)令牌化。 您可以在Github中找到開源軟件。 例如, XLM使用SentencePiece,并為中文,日語和泰語添加特定的預(yù)令牌。
SentencePiece is conceptually similar to BPE, but it does not use the greedy encoding strategy, achieving higher quality tokenization. SentencePiece sees ambiguity in character grouping as a source of regularization for the model during training, which makes training much slower because there are more parameters to optimize for and discouraged Google from using it in BERT, opting for WordPiece instead.
SentencePiece在概念上類似于BPE,但是它不使用貪婪編碼策略,從而實(shí)現(xiàn)了更高質(zhì)量的標(biāo)記化。 SentencePiece將字符分組的歧義視為訓(xùn)練期間模型正則化的來源,這使訓(xùn)練速度變慢,因?yàn)橛懈鄥?shù)可以優(yōu)化并阻止Google在BERT中使用它,而是選擇了WordPiece。
結(jié)論 (Conclusion)
Historically, tokenization methods have evolved from word to character, and lately subword level. This is a quick overview of tokenization methods, I hope the text is readable and understandable. Follow me on Twitter for more NLP information, or ask me any questions there :)
從歷史上看,標(biāo)記化方法已經(jīng)從單詞演變?yōu)樽址?#xff0c;以及最近的子單詞級別。 這是令牌化方法的快速概述,我希望文本可讀易懂。 在Twitter上關(guān)注我以獲取更多NLP信息,或在這里問我任何問題:)
翻譯自: https://towardsdatascience.com/overview-of-nlp-tokenization-algorithms-c41a7d5ec4f9
nlp算法文本向量化
總結(jié)
以上是生活随笔為你收集整理的nlp算法文本向量化_NLP中的标记化算法概述的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 水滴公司CEO沈鹏发布全员信:2023年
- 下一篇: 数据科学与大数据排名思考题_排名前5位的