當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

The Illustrated Transformer 翻译

發(fā)布時(shí)間：2025/3/21 编程问答 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 The Illustrated Transformer 翻译小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

In the?previous post, we looked at Attention?– a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at?The Transformer?– a model that uses attention to boost the speed with which these models can be trained. The Transformers outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their?Cloud TPU?offering. So let’s try to break the model apart and look at how it functions.

在前一篇文章中，我們討論了注意力——一種在現(xiàn)代深度學(xué)習(xí)模型中普遍存在的方法。注意力是一個(gè)有助于提高神經(jīng)機(jī)器翻譯應(yīng)用程序性能的概念。在這篇文章中，我們將著眼于Transformer——一個(gè)使用注意力來提高這些模型的訓(xùn)練速度的模型。Transformer在特定任務(wù)中的性能優(yōu)于谷歌神經(jīng)機(jī)器翻譯模型。然而，最大的好處來自于轉(zhuǎn)換器如何實(shí)現(xiàn)并行化。實(shí)際上，谷歌Cloud推薦使用Transformer作為參考模型來使用他們的云TPU產(chǎn)品。讓我們?cè)囍涯Ｐ筒痖_看看它是如何工作的。

The Transformer was proposed in the paper?Attention is All You Need. A TensorFlow implementation of it is available as a part of the?Tensor2Tensor?package. Harvard’s NLP group created a?guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.

Transformer的提出是在?Attention is All You Need?論文中。它的TensorFlow實(shí)現(xiàn)是tensor2張量包的一部分。哈佛大學(xué)的NLP小組創(chuàng)建了一個(gè)指南，用PyTorch實(shí)現(xiàn)對(duì)論文進(jìn)行注釋。在這篇文章中，我們將嘗試把事情簡單化一點(diǎn)，并逐一介紹概念，希望讓沒有深入了解主題的人更容易理解。

A High-Level Look

Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.

讓我們首先將模型看作一個(gè)單獨(dú)的黑盒。在機(jī)器翻譯應(yīng)用程序中，它將使用一種語言的句子，并輸出另一種語言的翻譯。

Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.

看其內(nèi)部，我們看到一個(gè)編碼組件，一個(gè)解碼組件，以及它們之間的聯(lián)系。

The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.

編碼組件是一堆編碼器(紙上6個(gè)編碼器疊在一起——數(shù)字6沒有什么神奇之處，肯定可以嘗試其他安排)。解碼組件是相同數(shù)量的解碼器的堆棧。

The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:

編碼器在結(jié)構(gòu)上都是相同的(但它們不共享權(quán)重)。每一個(gè)都被分解成兩個(gè)子層:

The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.

編碼器的輸入首先通過一個(gè)Self-Attention層——這個(gè)層在編碼器編碼特定單詞時(shí)幫助編碼器查看輸入句子中的其他單詞。我們將在稍后的文章中更詳細(xì)地討論它。

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

將self-attention層的輸出輸入前饋神經(jīng)網(wǎng)絡(luò)。完全相同的前饋網(wǎng)絡(luò)分別應(yīng)用于每個(gè)位置。

The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in?seq2seq models).

解碼器也有這兩個(gè)層，但它們之間是一個(gè)attention層，幫助解碼器將注意力集中于輸入句子的相關(guān)部分(類似于注意在seq2seq模型中的作用)。

Bringing The Tensors Into The Picture

Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.

現(xiàn)在我們已經(jīng)了解了模型的主要組成部分，讓我們開始研究各種向量/張量，以及它們?nèi)绾卧谶@些組成部分之間流動(dòng)，從而將經(jīng)過訓(xùn)練的模型的輸入轉(zhuǎn)換為輸出。

As is the case in NLP applications in general, we begin by turning each input word into a vector using an?embedding algorithm.

與NLP應(yīng)用程序中的一般情況一樣，我們首先使用嵌入算法將每個(gè)輸入單詞轉(zhuǎn)換成一個(gè)向量。

Each word is embedded into a vector of size 512. We'll represent those vectors with these simple boxes.

每個(gè)單詞都嵌入到一個(gè)大小為512的向量中。我們用這些簡單的盒子來表示這些向量。

The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.

嵌入只發(fā)生在最底部的編碼器。所有編碼器都有一個(gè)共同的抽象概念，那就是它們接收到的向量列表，每個(gè)向量的大小都是512——在底部的編碼器中是詞嵌入，但是在其他編碼器中，它是下面的編碼器的輸出。這個(gè)列表的大小是我們可以設(shè)置的超參數(shù)——基本上就是我們的訓(xùn)練數(shù)據(jù)集中最長句子的長度。

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.

在輸入序列中嵌入單詞之后，每個(gè)單詞都流經(jīng)編碼器的兩層。

Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

在這里，我們開始看到Transformer的一個(gè)關(guān)鍵屬性，即每個(gè)位置上的單詞在編碼器中通過自己的路徑流動(dòng)。在self-attention層中，這些路徑之間存在依賴關(guān)系。但是，前饋層沒有這些依賴項(xiàng)，因此可以在流經(jīng)前饋層時(shí)并行執(zhí)行各種路徑。

Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.

接下來，我們將把示例轉(zhuǎn)換為一個(gè)更短的句子，并查看在編碼器的每個(gè)子層中發(fā)生了什么。

Now We’re Encoding!

As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

正如我們已經(jīng)提到的，編碼器接收一組向量作為輸入。它通過將這些向量傳遞到一個(gè)“self-attention”層來處理這個(gè)列表，然后進(jìn)入前饋神經(jīng)網(wǎng)絡(luò)，然后向上發(fā)送輸出到下一個(gè)編碼器。

The word at each position passes through a self-encoding process. Then, they each pass through a feed-forward neural network -- the exact same network with each vector flowing through it separately.

每個(gè)位置的單詞都經(jīng)過一個(gè)自編碼過程。然后，它們各自通過前饋神經(jīng)網(wǎng)絡(luò)——完全相同的網(wǎng)絡(luò)，每個(gè)向量分別通過它。

Self-Attention at a High Level

Don’t be fooled by me throwing around the word “self-attention” like it’s a concept everyone should be familiar with. I had personally never came across the concept until reading the Attention is All You Need paper. Let us distill how it works.

不要被我隨便說的“自我關(guān)注”這個(gè)詞所愚弄，好像它是每個(gè)人都應(yīng)該熟悉的概念。我個(gè)人從來沒有遇到過這個(gè)概念，直到閱讀這篇論文?Attention is All You Need?。讓我們提煉一下它是如何工作的。

Say the following sentence is an input sentence we want to translate:

”The animal didn't cross the street because it was too tired”

假設(shè)下面這句話是我們要翻譯的輸入句:

”The animal didn't cross the street because it was too tired”

What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.

這個(gè)句子中的“it”是什么意思?它指的是街道還是動(dòng)物?這對(duì)人類來說是一個(gè)簡單的問題，但對(duì)算法來說就沒那么簡單了。

When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.

當(dāng)模型處理單詞“it”時(shí)，self-attention將“it”與“animal”聯(lián)系起來。

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

當(dāng)模型處理每個(gè)單詞(輸入序列中的每個(gè)位置)時(shí)，self - attention允許它查看輸入序列中的其他位置，以尋找有助于對(duì)該單詞進(jìn)行更好編碼的線索。

If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.

如果您熟悉RNN，請(qǐng)考慮如何維護(hù)一個(gè)隱藏狀態(tài)，使RNN能夠?qū)⑺幚磉^的前一個(gè)單詞/向量的表示與它正在處理的當(dāng)前單詞/向量結(jié)合起來。Self-attention是Transformer用來將其他相關(guān)單詞的“理解”轉(zhuǎn)換成我們正在處理的單詞的方法。

As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism was focusing on "The Animal", and baked a part of its representation into the encoding of "it".

當(dāng)我們?cè)诰幋a器5中編碼“it”(堆棧中最上面的編碼器)時(shí)，注意力機(jī)制的一部分集中在“動(dòng)物”上，并將其表示的一部分融入到“it”的編碼中。

Be sure to check out the?Tensor2Tensor notebook?where you can load a Transformer model, and examine it using this interactive visualization.

請(qǐng)務(wù)必查看"Tensor2Tensor notebook"，在那里您可以加載Transformer模型，并使用這種交互式可視化來檢查它。

Self-Attention in Detail

Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented – using matrices.

讓我們先看看如何用向量來計(jì)算self-attention，然后再看看它是如何實(shí)現(xiàn)的——用矩陣。

The?first step?in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

計(jì)算自我注意的第一步是從編碼器的每個(gè)輸入向量中創(chuàng)建三個(gè)向量(在本例中是每個(gè)單詞的嵌入)。因此，對(duì)于每個(gè)單詞，我們創(chuàng)建一個(gè)查詢向量、一個(gè)鍵向量和一個(gè)值向量。這些向量是通過將嵌入乘以我們?cè)谟?xùn)練過程中訓(xùn)練的三個(gè)矩陣得到的。

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.

注意這些新向量的維數(shù)比嵌入向量小。其維數(shù)為64，而嵌入和編碼器的輸入/輸出向量的維數(shù)為512。它們不需要更小，這是一種架構(gòu)選擇，可以使多目標(biāo)注意力的計(jì)算(大部分)保持不變。

Multiplying?x1?by the?WQ?weight matrix produces?q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.

x1乘以WQ權(quán)重矩陣得到q1，即與這個(gè)詞相關(guān)的“查詢”向量。我們最終創(chuàng)建了輸入句子中每個(gè)單詞的“查詢”、“鍵”和“值”投影。

What are the “query”, “key”, and “value” vectors??

什么是“查詢”、“鍵”和“值”向量?

They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays.

它們是對(duì)計(jì)算和思考注意力很有用的抽象概念。一旦你開始閱讀下面計(jì)算注意力的方法，你就會(huì)知道這些向量所扮演的角色。

The?second step?in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

計(jì)算自我注意的第二步是計(jì)算分?jǐn)?shù)。假設(shè)我們?cè)谟?jì)算本例中第一個(gè)單詞“Thinking”的self-attention。我們需要將輸入句子中的每個(gè)單詞與這個(gè)單詞進(jìn)行評(píng)分。分?jǐn)?shù)決定了當(dāng)我們?cè)谀硞€(gè)位置編碼一個(gè)單詞時(shí)，對(duì)輸入句子的其他部分的關(guān)注程度。

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.

分?jǐn)?shù)是通過查詢向量與我們正在評(píng)分的單詞的key向量的點(diǎn)積計(jì)算出來的。如果我們處理位置1的單詞的self-attention，第一個(gè)分?jǐn)?shù)就是q1和k1的點(diǎn)積。第二個(gè)分?jǐn)?shù)是q1和k2的點(diǎn)積。

The?third and forth steps?are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.

第三步和第四步是將分?jǐn)?shù)除以8(論文中使用的關(guān)鍵向量的維數(shù)的平方根- 64)。這導(dǎo)致了更穩(wěn)定的梯度。這里可能有其他可能的值，但這是默認(rèn)值)，然后通過softmax操作傳遞結(jié)果。Softmax將分?jǐn)?shù)標(biāo)準(zhǔn)化，使其均為正數(shù)，加起來為1。

This softmax score determines how much how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.

softmax分?jǐn)?shù)決定每個(gè)單詞在這個(gè)位置的表達(dá)量。顯然，這個(gè)位置的單詞將擁有最高的softmax分?jǐn)?shù)，但有時(shí)關(guān)注與當(dāng)前單詞相關(guān)的另一個(gè)單詞是有用的。

The?fifth step?is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

第五步是將每個(gè)value向量乘以softmax分?jǐn)?shù)(為求和做準(zhǔn)備)。這里直觀展示的是保持我們要關(guān)注的單詞的值，淹沒無關(guān)單詞的值(例如，將它們乘以0.001這樣的小數(shù)字)。

The?sixth step?is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

第六步是加權(quán)值向量的求和。這將在這個(gè)位置(對(duì)于第一個(gè)單詞)生成self-attention層的輸出。

That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural network. In the actual implementation, however, this calculation is done in matrix form for faster processing. So let’s look at that now that we’ve seen the intuition of the calculation on the word level.

這就是自我注意計(jì)算的結(jié)論。得到的向量是一個(gè)我們可以發(fā)送到前饋神經(jīng)網(wǎng)絡(luò)的向量。然而，在實(shí)際實(shí)現(xiàn)中，這種計(jì)算是以矩陣形式進(jìn)行的，以便更快地進(jìn)行處理。現(xiàn)在我們來看看這個(gè)我們已經(jīng)在單詞層面上看到了計(jì)算的直觀展示。

Matrix Calculation of Self-Attention

The first step?is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix?X, and multiplying it by the weight matrices we’ve trained (WQ,?WK,?WV).

第一步是計(jì)算query、key和value矩陣。我們把embeddings包裝成矩陣X，然后乘以我們訓(xùn)練過的權(quán)矩陣(WQ WK WV)。

Every row in the?X?matrix corresponds to a word in the input sentence. We again see the difference in size of the embedding vector (512, or 4 boxes in the figure), and the q/k/v vectors (64, or 3 boxes in the figure)

X矩陣中的每一行對(duì)應(yīng)輸入句子中的一個(gè)單詞。我們?cè)俅慰吹絜mbeddings向量(512，或圖中4個(gè)框)和q/k/v向量(64，或圖中3個(gè)框)大小的差異

Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.

最后，由于我們處理的是矩陣，我們可以將步驟2到步驟6壓縮到一個(gè)公式中來計(jì)算自我注意層的輸出。

The self-attention calculation in matrix form

矩陣形式的self-attention計(jì)算

The Beast With Many Heads

The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways:

本文進(jìn)一步細(xì)化了self-attention層，增加了“多頭”注意機(jī)制。這通過兩種方式提高了注意力層的性能:

1. It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. It would be useful if we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to.

它擴(kuò)展了模型關(guān)注不同位置的能力。是的，在上面的例子中，z1包含了一些其他編碼，但是它可以被實(shí)際單詞本身所控制。如果我們翻譯像“動(dòng)物沒過馬路是因?yàn)樗哿恕边@樣的句子，它會(huì)很有用，我們想知道“它”指的是哪個(gè)單詞。

2.?It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.

它給了attention?層多個(gè)“表示子空間”。接下來我們將看到，對(duì)于multi-headed?，我們不僅有一個(gè)，而且有多個(gè)query/key/value權(quán)重矩陣集(Transformer?使用8個(gè)注意頭，因此我們最終為每個(gè)編碼器/解碼器使用8個(gè)注意頭)。每個(gè)集合都是隨機(jī)初始化的。然后，在訓(xùn)練之后，使用每個(gè)集合將輸入嵌入(或來自較低編碼器/解碼器的向量)投影到不同的表示子空間中。

With multi-headed attention, we maintain separate Q/K/V weight matrices for each head resulting in different Q/K/V matrices. As we did before, we multiply X by the WQ/WK/WV matrices to produce Q/K/V matrices.

對(duì)于多頭注意，我們?yōu)槊總€(gè)頭部維護(hù)單獨(dú)的Q/K/V權(quán)重矩陣，從而產(chǎn)生不同的Q/K/V矩陣。和之前一樣，我們用X乘以WQ/WK/WV矩陣得到Q/K/V矩陣。

If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices.

如果我們做同樣的self-attention計(jì)算，就像上面概述的那樣，用不同的權(quán)重矩陣做8次不同的計(jì)算，我們得到8個(gè)不同的Z矩陣。

This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.

這給我們留下了一點(diǎn)挑戰(zhàn)。前饋層不需要八個(gè)矩陣——它只需要一個(gè)矩陣(每個(gè)單詞對(duì)應(yīng)一個(gè)向量)。我們需要一種方法把這8個(gè)壓縮成一個(gè)矩陣。

How do we do that? We concat the matrices then multiple them by an additional weights matrix WO.

我們?cè)趺醋瞿?我們把這些矩陣連起來然后乘以一個(gè)額外的權(quán)重矩陣。

That’s pretty much all there is to multi-headed self-attention. It’s quite a handful of matrices, I realize. Let me try to put them all in one visual so we can look at them in one place

這就是多腦self-attention的全部。我意識(shí)到這是相當(dāng)多的矩陣。我試著把它們放在一個(gè)圖像中這樣我們就能在一個(gè)地方看到它們。

Now that we have touched upon attention heads, let’s revisit our example from before to see where the different attention heads are focusing as we encode the word “it” in our example sentence:

既然我們已經(jīng)提到了注意力頭部，讓我們?cè)賮砜纯粗暗睦?#xff0c;看看當(dāng)我們?cè)诶渲芯幋a單詞“it”時(shí)，不同的注意力頭部集中在哪里:

As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" -- in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".

當(dāng)我們對(duì)“it”這個(gè)詞進(jìn)行編碼時(shí)，一個(gè)注意力集中在“動(dòng)物”上，而另一個(gè)注意力集中在“疲倦”上——從某種意義上說，“it”這個(gè)詞的模型表現(xiàn)在“動(dòng)物”和“疲倦”兩個(gè)詞的表現(xiàn)中。

If we add all the attention heads to the picture, however, things can be harder to interpret:

然而，如果我們把所有的注意力都集中到這幅圖上，事情就很難解釋了:

Representing The Order of The Sequence Using Positional Encoding

用位置編碼表示序列的順序

One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.

到目前為止，我們所描述的模型中缺少的一件事是一種解釋輸入序列中單詞順序的方法。

To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.

為了解決這個(gè)問題，transformer向每個(gè)輸入的embedding添加一個(gè)向量。這些向量遵循模型學(xué)習(xí)到的特定模式，這有助于它確定每個(gè)單詞的位置，或序列中不同單詞之間的距離。直覺上，在點(diǎn)積attention時(shí)，一旦embeddings被投影到Q/K/V向量中，添加到embeddings中的這些值可以提供embeddings向量之間有意義的距離。

To give the model a sense of the order of the words, we add positional encoding vectors -- the values of which follow a specific pattern.

為了給模型一個(gè)單詞順序的感覺，我們添加位置編碼向量——它們的值遵循特定的模式。

If we assumed the embedding has a dimensionality of 4, the actual positional encodings would look like this:

如果我們假設(shè)嵌入的維數(shù)是4，那么實(shí)際的位置編碼應(yīng)該是這樣的:

A real example of positional encoding with a toy embedding size of 4

一個(gè)真實(shí)的位置編碼的例子與玩具嵌入大小為4

What might this pattern look like?

這個(gè)模式可能是什么樣的?

In the following figure, each row corresponds the a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. Each row contains 512 values – each with a value between 1 and -1. We’ve color-coded them so the pattern is visible.

在下面的圖中，每一行對(duì)應(yīng)一個(gè)向量的位置編碼。所以第一行代表的向量就是我們?cè)谳斎胄蛄兄星度氲谝粋€(gè)單詞的向量。每行包含512個(gè)值——每個(gè)值在1到-1之間。我們用顏色標(biāo)記了它們，所以模式是可見的。

A real example of positional encoding for 20 words (rows) with an embedding size of 512 (columns). You can see that it appears split in half down the center. That's because the values of the left half are generated by one function (which uses sine), and the right half is generated by another function (which uses cosine). They're then concatenated to form each of the positional encoding vectors.

一個(gè)實(shí)際的位置編碼示例，包含20個(gè)單詞(行)，embedding大小為512(列)。你可以看到它從中間一分為二。這是因?yàn)樽蟀氩糠值闹凳怯梢粋€(gè)函數(shù)(使用sin)生成的，而右半部分是由另一個(gè)函數(shù)(使用cos)生成的。然后將它們連接起來形成每個(gè)位置編碼向量。

The formula for positional encoding is described in the paper (section 3.5). You can see the code for generating positional encodings in?get_timing_signal_1d(). This is not the only possible method for positional encoding. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).

位置編碼的公式在本文(章節(jié)3.5)中進(jìn)行了描述。您可以在get_timing_signal_1d()中看到生成位置編碼的代碼。這不是唯一可能的位置編碼方法。然而，它的優(yōu)勢在于能夠擴(kuò)展到序列的看不見的長度(例如，如果我們訓(xùn)練的模型被要求翻譯一個(gè)比我們訓(xùn)練集中的任何一個(gè)都長的句子)。

The Residuals

殘差

One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a?layer-normalization?step.

在繼續(xù)之前，我們需要提到編碼器架構(gòu)中的一個(gè)細(xì)節(jié)，即每個(gè)編碼器中的每個(gè)子層(self-attention, ffnn)在其周圍都有一個(gè)殘差連接，然后是一個(gè)分層規(guī)范化步驟。

If we’re to visualize the vectors and the layer-norm operation?associated with self attention, it would look like this:

如果我們要把這些向量和與自我注意相關(guān)的層規(guī)范操作形象化，它看起來是這樣的:

This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:

這也適用于解碼器的子層。如果我們考慮一個(gè)由兩個(gè)堆疊編碼器和解碼器組成的transformer，它看起來是這樣的:

The Decoder Side

Now that we’ve covered most of the concepts on the encoder side, we basically know how the components of decoders work as well. But let’s take a look at how they work together.

現(xiàn)在我們已經(jīng)介紹了編碼器端的大多數(shù)概念，我們基本上也知道解碼器的組件是如何工作的。下面讓我們來看看它們是如何一起工作的。

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence:

編碼器從處理輸入序列開始。將頂層編碼器的輸出轉(zhuǎn)換為一組attention向量K和v，每個(gè)編碼器在其“編碼器-譯碼器attention”層中使用，幫助譯碼器在輸入序列中適當(dāng)?shù)奈恢镁劢?

After finishing the encoding phase, we begin the decoding phase. Each step in the decoding phase outputs an element from the output sequence (the English translation sentence in this case).

完成編碼階段后，我們開始解碼階段。解碼階段的每個(gè)步驟從輸出序列中輸出一個(gè)元素(在本例中是英語翻譯句)。

The following steps repeat the process until a special?symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

下面的步驟重復(fù)這個(gè)過程，直到到達(dá)一個(gè)特殊的符號(hào)，表明transformer解碼器已經(jīng)完成了輸出。每一步的輸出在下一次的步驟中被輸入到底層解碼器，解碼器就像編碼器一樣把它們的解碼結(jié)果放大。就像我們對(duì)編碼器輸入所做的那樣，我們嵌入并添加位置編碼到這些解碼器輸入中來表示每個(gè)單詞的位置。

The self attention layers in the decoder operate in a slightly different way than the one in the encoder:

解碼器中的自我注意層的工作方式與編碼器中的略有不同:

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to?-inf) before the softmax step in the self-attention calculation.

在解碼器中，self-attention層只允許注意輸出序列中較早的位置。這是通過在self-attention計(jì)算中的softmax步驟之前屏蔽未來位置(將它們?cè)O(shè)置為-inf)來實(shí)現(xiàn)的。

The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

編碼器-解碼器注意層的工作原理與多頭自注意層類似，只是它從下面的層創(chuàng)建查詢矩陣，并從編碼器堆棧的輸出中獲取鍵和值矩陣。

The Final Linear and Softmax Layer

最后的線性和軟最大層

The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.

解碼器堆棧輸出一個(gè)浮點(diǎn)向量。我們?cè)趺窗阉兂梢粋€(gè)詞呢?這是最后一個(gè)線性層的工作，然后是一個(gè)Softmax層。

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.

線性層是一個(gè)簡單的全連接神經(jīng)網(wǎng)絡(luò)，它將解碼器堆棧產(chǎn)生的向量投影到一個(gè)更大的稱為邏輯向量的向量中。

Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.

讓我們假設(shè)我們的模型從它的訓(xùn)練數(shù)據(jù)集中知道10,000個(gè)獨(dú)特的英語單詞(我們模型的“輸出詞匯表”)。這將使logits向量寬10,000個(gè)單元格—每個(gè)單元格對(duì)應(yīng)一個(gè)惟一單詞的得分。這就是我們解釋線性層之后的模型輸出的方式。

The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

softmax層然后將這些分?jǐn)?shù)轉(zhuǎn)換為概率(全部為正數(shù)，全部加起來為1.0)。選擇概率最大的單元格，并生成與之關(guān)聯(lián)的單詞作為此時(shí)間步驟的輸出。

This figure starts from the bottom with the vector produced as the output of the decoder stack. It is then turned into an output word.

此圖從底部開始，輸出的矢量作為解碼器堆棧的輸出。然后將其轉(zhuǎn)換為輸出字。

Recap Of Training

訓(xùn)練回顧

Now that we’ve covered the entire forward-pass process through a trained Transformer, it would be useful to glance at the intuition of training the model.

既然我們已經(jīng)介紹了通過訓(xùn)練過的轉(zhuǎn)換器的整個(gè)前向傳遞過程，那么了解一下訓(xùn)練模型的直觀了解將是很有用的。

During training, an untrained model would go through the exact same forward pass. But since we are training it on a labeled training dataset, we can compare its output with the actual correct output.

在訓(xùn)練過程中，未訓(xùn)練的模型將經(jīng)歷完全相同的向前傳球。但是由于我們是在一個(gè)標(biāo)記的訓(xùn)練數(shù)據(jù)集上訓(xùn)練它，我們可以將它的輸出與實(shí)際正確的輸出進(jìn)行比較。

To visualize this, let’s assume our output vocabulary only contains six words(“a”, “am”, “i”, “thanks”, “student”, and “<eos>” (short for ‘end of sentence’)).

為了直觀地理解這一點(diǎn)，我們假設(shè)輸出詞匯表只包含6個(gè)單詞(“a”、“am”、“i”、“thanks”、“student”和“<eos>”(“end of sentence”的縮寫))。

The output vocabulary of our model is created in the preprocessing phase before we even begin training.

我們模型的輸出詞匯表是在我們開始訓(xùn)練之前的預(yù)處理階段創(chuàng)建的。

Once we define our output vocabulary, we can use a vector of the same width to indicate each word in our vocabulary. This also known as one-hot encoding. So for example, we can indicate the word “am” using the following vector:

一旦定義了輸出詞匯表，就可以使用相同寬度的向量來表示詞匯表中的每個(gè)單詞。這也稱為one-hot編碼。例如，我們可以用下面的向量來表示am:

Example: one-hot encoding of our output vocabulary

示例:輸出詞匯表的one-hot編碼

The Loss Function

損失函數(shù)

Say we are training our model. Say it’s our first step in the training phase, and we’re training it on a simple example – translating “merci” into “thanks”.

假設(shè)我們正在訓(xùn)練我們的模型。假設(shè)這是我們?cè)谟?xùn)練階段的第一步，我們用一個(gè)簡單的例子來進(jìn)行訓(xùn)練——將“merci”翻譯成“thanks”。

What this means, is that we want the output to be a probability distribution indicating the word “thanks”. But since this model is not yet trained, that’s unlikely to happen just yet.

這意味著，我們希望輸出是一個(gè)表示“謝謝”的概率分布。但是由于這個(gè)模型還沒有經(jīng)過訓(xùn)練，所以現(xiàn)在還不太可能實(shí)現(xiàn)。

Since the model's parameters (weights) are all initialized randomly, the (untrained) model produces a probability distribution with arbitrary values for each cell/word. We can compare it with the actual output, then tweak all the model's weights using backpropagation to make the output closer to the desired output.

由于模型的參數(shù)(權(quán)重)都是隨機(jī)初始化的，因此(未經(jīng)訓(xùn)練的)模型生成每個(gè)單元格/單詞的任意值的概率分布。我們可以將其與實(shí)際輸出進(jìn)行比較，然后使用反向傳播調(diào)整模型的所有權(quán)重，使輸出更接近所需的輸出。

How do you compare two probability distributions? We simply subtract one from the other. For more details, look atcross-entropy?and?Kullback–Leibler divergence.

如何比較兩個(gè)概率分布?我們只要把其中一個(gè)減去另一個(gè)。更多細(xì)節(jié)，請(qǐng)看交叉熵和庫爾巴克-萊布爾散度（KL散度）。

But note that this is an oversimplified example. More realistically, we’ll use a sentence longer than one word. For example – input: “je suis étudiant” and expected output: “i am a student”. What this really means, is that we want our model to successively output probability distributions where:

但是請(qǐng)注意，這是一個(gè)過于簡化的示例。更實(shí)際的情況是，我們將使用一個(gè)句子，而不是一個(gè)單詞。例如，輸入:“je suis etudiant”和期望輸出:“i am a student”。這真正的意思是，我們想要我們的模型連續(xù)輸出概率分布，其中:

Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 3,000 or 10,000)
The first probability distribution has the highest probability at the cell associated with the word “i”
The second probability distribution has the highest probability at the cell associated with the word “am”
And so on, until the fifth output distribution indicates ‘<end of sentence>’ symbol, which also has a cell associated with it from the 10,000 element vocabulary.

每個(gè)概率分布都由一個(gè)寬度為vocab_size的向量表示(在我們的玩具示例中是6，但更實(shí)際的數(shù)字是3000或10000)

第一個(gè)概率分布在與“i”相關(guān)的單元格中具有最高的概率

第二個(gè)概率分布在與am相關(guān)的單元格中具有最高的概率

以此類推，直到第5個(gè)輸出分布表示' <end of sentence> '符號(hào)，該符號(hào)也有一個(gè)來自10,000個(gè)元素詞匯表的與之關(guān)聯(lián)的單元格。

After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this:

在足夠大的數(shù)據(jù)集上訓(xùn)練模型足夠長的時(shí)間后，我們希望生成的概率分布是這樣的:

Hopefully upon training, the model would output the right translation we expect. Of course it's no real indication if this phrase was part of the training dataset (see:?cross validation). Notice that every position gets a little bit of probability even if it's unlikely to be the output of that time step -- that's a very useful property of softmax which helps the training process.

希望通過訓(xùn)練，該模型能夠輸出我們期望的正確翻譯。當(dāng)然，如果這個(gè)短語是訓(xùn)練數(shù)據(jù)集的一部分(參見:交叉驗(yàn)證)，這并不是真正的指示。注意，每個(gè)位置都有一點(diǎn)概率即使它不太可能是那個(gè)時(shí)間步長的輸出——這是softmax的一個(gè)非常有用的屬性，它有助于訓(xùn)練過程。

Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That’s one way to do it (called greedy decoding). Another way to do it would be to hold on to, say, the top two words (say, ‘I’ and ‘a(chǎn)’ for example), then in the next step, run the model twice: once assuming the first output position was the word ‘I’, and another time assuming the first output position was the word ‘me’, and whichever version produced less error considering both positions #1 and #2 is kept. We repeat this for positions #2 and #3…etc. This method is called “beam search”, where in our example, beam_size was two (because we compared the results after calculating the beams for positions #1 and #2), and top_beams is also two (since we kept two words). These are both hyperparameters that you can experiment with.

現(xiàn)在，因?yàn)檫@個(gè)模型每次產(chǎn)生一個(gè)輸出，我們可以假設(shè)這個(gè)模型從概率分布中選擇概率最大的單詞，然后扔掉剩下的。這是一種方法(稱為貪婪解碼)。另一個(gè)方法是堅(jiān)持,好比,前兩個(gè)單詞(比如“我”和“a”),然后在下一步中,運(yùn)行模型兩次:一次假設(shè)第一個(gè)輸出位置是“I”這個(gè)詞,而另一個(gè)假設(shè)第一個(gè)輸出位置是‘me’這個(gè)詞,和哪個(gè)版本產(chǎn)生更少的錯(cuò)誤考慮# 1和# 2保存位置。我們對(duì)2號(hào)和3號(hào)位置重復(fù)這個(gè)。這種方法稱為“beam search”，在我們的示例中，beam_size為2(因?yàn)槲覀冊(cè)谟?jì)算位置1和位置2的beam后對(duì)結(jié)果進(jìn)行了比較)，top_beam也是2(因?yàn)槲覀儽Ａ袅藘蓚€(gè)單詞)。這兩個(gè)都是可以實(shí)驗(yàn)的超參數(shù)。

Go Forth And Transform

I hope you’ve found this a useful place to start to break the ice with the major concepts of the Transformer. If you want to go deeper, I’d suggest these next steps:

我希望這篇文章能對(duì)你有所幫助，讓你可以從這里開始了解Transformer的主要概念。如果你想深入了解，我建議你采取以下步驟:

Read the?Attention Is All You Need?paper, the Transformer blog post (Transformer: A Novel Neural Network Architecture for Language Understanding), and the?Tensor2Tensor announcement.
閱讀《Attention Is All You Need》這篇文章、Transformer博客文章(Transformer: A Novel Neural Network Architecture for Language Understanding)和Tensor2Tensor announcement。

Watch??ukasz Kaiser’s talk?walking through the model and its details
看??ukasz Kaiser’s talk，講的是關(guān)于模型和細(xì)節(jié)

Play with the?Jupyter Notebook provided as part of the Tensor2Tensor repo
操作一下Jupyter Notebook provided as part of the Tensor2Tensor repo
Explore the?Tensor2Tensor repo.
探索?Tensor2Tensor repo.

Acknowledgements

Thanks to?Illia Polosukhin,?Jakob Uszkoreit,?Llion Jones?,?Lukasz Kaiser,?Niki Parmar, and?Noam Shazeer?for providing feedback on earlier versions of this post.

Please hit me up on?Twitter?for any corrections or feedback.

Written on June 27, 2018

感謝Illia Polosukhin、Jakob Uszkoreit、Llion Jones、Lukasz Kaiser、Niki Parmar和Noam Shazeer對(duì)本文早期版本提供的反饋。

如果有任何更正或反饋，請(qǐng)?jiān)赥witter上聯(lián)系我。

寫于2018年6月27日

原文：?https://jalammar.github.io/illustrated-transformer/

本來是有圖片的，粘貼過來就沒了。

總結(jié)

以上是生活随笔為你收集整理的The Illustrated Transformer 翻译的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：这么多年，终于有人讲清楚 Transfo
下一篇： The Illustrated Tran