日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

The Illustrated Transformer:中英文(看原文,很多翻译是错误的)

發(fā)布時(shí)間:2025/3/21 编程问答 38 豆豆
生活随笔 收集整理的這篇文章主要介紹了 The Illustrated Transformer:中英文(看原文,很多翻译是错误的) 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

在上一篇文章中(previous post),我們研究了注意力機(jī)制 - 一種在現(xiàn)代深度學(xué)習(xí)模型中無(wú)處不在的(ubiquitous)方法。 注意力是一個(gè)有助于提高神經(jīng)機(jī)器翻譯(neural machine translation)應(yīng)用程序性能的概念。 在這篇文章中(In this post),我們將介紹The Transformer–一個(gè)使用注意力來(lái)提高(boost)這些模型訓(xùn)練速度的模型。The Transformers在特定任務(wù)中優(yōu)于(outperforms)Google神經(jīng)機(jī)器翻譯模型。 然而,最大的好處來(lái)自于Transformer如何為并行化(parallelization)做出貢獻(xiàn)。 事實(shí)上,Google Cloud建議使用The Transformer作為參考模型來(lái)使用他們的Cloud TPU產(chǎn)品。 因此,讓我們嘗試將模型分開(kāi),看看它是如何運(yùn)作的。

The Transformer在文章中提出了Attention is All You Need。 它的TensorFlow實(shí)現(xiàn)作為T(mén)ensor2Tensor包的一部分提供。 哈佛大學(xué)的NLP小組創(chuàng)建了一個(gè)用PyTorch實(shí)現(xiàn)注釋論文的指南。 在這篇文章中,我們將嘗試過(guò)度簡(jiǎn)化一些事情并逐一介紹這些概念,以便在沒(méi)有深入了解主題的情況下讓人們更容易理解。

A High-Level Look

讓我們首先將模型看作一個(gè)黑盒子。 在機(jī)器翻譯應(yīng)用程序中,它將使用一種語(yǔ)言的句子,并將其翻譯輸出到另一種語(yǔ)言中。

Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.


The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.

The encoders are all identical in structure (yet they do not share weights注:他們的權(quán)重不共享). Each one is broken down into two sub-layers:


The encoder’s inputs first flow through a?self-attention layer?– a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in?seq2seq?models).

Bringing The Tensors Into The Picture

Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.

As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.

The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below.?The size of this list is hyperparameter we can set(這是其中的一個(gè)超參數(shù))?– basically it would be the length of the longest sentence in our training dataset.

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.


Here we begin to see one key property of the Transformer, which is that the word in each position flows through?its own path?in the encoder.?There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

Next, we’ll switch up(切換) the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.

Now We’re Encoding!

As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

Self-Attention at a High Level

Don’t be fooled by me throwing around the word “self-attention” like it’s a concept everyone should be familiar with. I had personally never came across the concept until reading the Attention is All You Need paper. Let us distill how it works.

Say the following sentence is an input sentence we want to translate:

”The animal didn't cross the street because it was too tired”
  • 1

What does “it” in this sentence refer to?(指什么)?? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.

When the model is processing the word “it”, self-attention allows it to associate?(相關(guān)聯(lián))?“it” with “animal”.

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate?its representation of previous words/vectors it has processed?with?the current one it’s processing. Self-attention is the method the Transformer uses to bake(融入) the “understanding” of other relevant words into the one we’re currently processing.


Be sure to check out the Tensor2Tensor notebook where you can load a Transformer model, and examine it using this interactive visualization.

Self-Attention in Detail

Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented – using matrices.

The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So?for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created?by multiplying the embedding by three matrices?that we trained during the training process.

Notice that these?new vectors are smaller in dimension than the embedding vector. Their dimensionality is?64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.


What are the “query”, “key”, and “value” vectors?

They’re?abstractions?that?are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays.

The second step in calculating self-attention is to?calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
計(jì)算自我關(guān)注度的第二步是計(jì)算得分。 假設(shè)我們正在計(jì)算這個(gè)例子中第一個(gè)單詞“思考”的自我關(guān)注。 我們需要根據(jù)這個(gè)詞對(duì)輸入句子的每個(gè)單詞進(jìn)行評(píng)分。 當(dāng)我們?cè)谀硞€(gè)位置編碼單詞時(shí),分?jǐn)?shù)決定了對(duì)輸入句子的其他部分放置多少焦點(diǎn)。

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.
通過(guò)將查詢向量的點(diǎn)積與我們得分的相應(yīng)單詞的關(guān)鍵向量計(jì)算得分。 因此,如果我們處理位置#1中單詞的自我關(guān)注,則第一個(gè)分?jǐn)?shù)將是q1和k1的點(diǎn)積。 第二個(gè)分?jǐn)?shù)是q1和k2的點(diǎn)積。

The third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.
第三步和第四步是將得分除以8(論文中使用的關(guān)鍵向量的維數(shù)的平方根 - 64.這導(dǎo)致具有更穩(wěn)定的梯度。這里可能存在其他可能的值,但這是 默認(rèn)),然后通過(guò)softmax操作傳遞結(jié)果。 Softmax將分?jǐn)?shù)標(biāo)準(zhǔn)化,因此它們都是正數(shù)并且加起來(lái)為1。

The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition(直覺(jué)) here is to keep intact the values of the word(s) we want to focus on, and drown-out(淹沒(méi)) irrelevant words (by multiplying them by tiny numbers like 0.001, for example).
第五步是將每個(gè)值向量乘以softmax得分(準(zhǔn)備將它們相加)。 這里的直覺(jué)是保持我們想要關(guān)注的單詞的值不變,并淹沒(méi)不相關(guān)的單詞(例如,通過(guò)將它們乘以像0.001這樣的小數(shù)字)。

The sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).
第六步是總結(jié)加權(quán)值向量。 這會(huì)在此位置產(chǎn)生自我關(guān)注層的輸出(對(duì)于第一個(gè)單詞)。

Matrix Calculation of Self-Attention

The first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV).
第一步是計(jì)算Query,Key和Value矩陣。 我們通過(guò)將嵌入包裝到矩陣X中,并將其乘以我們訓(xùn)練過(guò)的權(quán)重矩陣(WQ,WK,WV)來(lái)實(shí)現(xiàn)。

Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.
最后,由于我們正在處理矩陣,我們可以在一個(gè)公式中濃縮步驟2到6來(lái)計(jì)算自我關(guān)注層的輸出。

The Beast With Many Heads

The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways:
本文通過(guò)增加一種稱為“多頭”關(guān)注的機(jī)制,進(jìn)一步完善了自我關(guān)注層。 這以兩種方式改善了關(guān)注層的性能:

  • It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. It would be useful if we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to.
    它擴(kuò)展了模型關(guān)注不同位置的能力。 是的,在上面的例子中,z1包含了所有其他編碼的一點(diǎn)點(diǎn),但它可能由實(shí)際的單詞本身支配。 如果我們翻譯一句“動(dòng)物沒(méi)有過(guò)馬路,因?yàn)樗哿恕?#xff0c;我們會(huì)想知道“它”指的是哪個(gè)詞,這將是有用的。

  • It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.
    它給予attention層多個(gè)“表示子空間”。 正如我們接下來(lái)將看到的,我們不僅有一個(gè),而且還有多組Query / Key / Value權(quán)重矩陣(Transformer使用8個(gè)注意頭,因此我們最終為每個(gè)編碼器/解碼器設(shè)置了8個(gè))。 這些集合中的每一個(gè)都是隨機(jī)初始化的。 然后,在訓(xùn)練之后,每組用于將輸入嵌入(或來(lái)自較低編碼器/解碼器的矢量)投影到不同的表示子空間中。

    If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices
    如果我們進(jìn)行上面概述的相同的自我關(guān)注計(jì)算,只有八個(gè)不同的時(shí)間使用不同的權(quán)重矩陣,我們最終得到八個(gè)不同的Z矩陣

    This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.
    這讓我們面臨一些挑戰(zhàn)。 前饋層不期望八個(gè)矩陣 - 它期望單個(gè)矩陣(每個(gè)字的向量)。 所以我們需要一種方法將這八個(gè)壓縮成一個(gè)矩陣。

  • How do we do that? We concat the matrices then multiple them by an additional weights matrix WO.
    我們?cè)趺醋?#xff1f; 我們將矩陣連接起來(lái)然后通過(guò)另外的權(quán)重矩陣WO將它們多個(gè)。

    That’s pretty much all there is to multi-headed self-attention. It’s quite a handful of matrices, I realize. Let me try to put them all in one visual so we can look at them in one place
    這就是多頭自我關(guān)注的全部?jī)?nèi)容。 我意識(shí)到這是一小部分矩陣。 讓我嘗試將它們?nèi)糠旁谝粋€(gè)視覺(jué)中,這樣我們就可以在一個(gè)地方看到它們

    Now that we have touched upon attention heads, let’s revisit our example from before to see where the different attention heads are focusing as we encode the word “it” in our example sentence:
    現(xiàn)在我們已經(jīng)觸及了注意力的頭,讓我們重新審視我們之前的例子,看看不同的注意力頭在哪里聚焦,因?yàn)槲覀冊(cè)谖覀兊睦渲芯幋a“it”這個(gè)詞:

    If we add all the attention heads to the picture, however, things can be harder to interpret:
    但是,如果我們將所有注意力添加到圖片中,那么事情可能更難理解:

    Representing The Order of The Sequence Using Positional Encoding

    One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.
    到目前為止,模型中缺少的一件事就是考慮輸入序列中單詞順序的一種方法。

    To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.
    為了解決這個(gè)問(wèn)題,變換器為每個(gè)輸入嵌入添加了一個(gè)向量。 這些向量遵循模型學(xué)習(xí)的特定模式,這有助于確定每個(gè)單詞的位置,或者序列中不同單詞之間的距離。 這里的直覺(jué)是,將這些值添加到嵌入中,一旦它們被投影到Q / K / V向量中并且在點(diǎn)積注意期間,就在嵌入向量之間提供有意義的距離。

    If we assumed the embedding has a dimensionality of 4, the actual positional encodings would look like this:
    如果我們假設(shè)嵌入的維數(shù)為4,那么實(shí)際的位置編碼將如下所示:

    What might this pattern look like?
    這種模式可能是什么樣的?

    In the following figure, each row corresponds the a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. Each row contains 512 values – each with a value between 1 and -1. We’ve color-coded them so the pattern is visible.
    在下圖中,每行對(duì)應(yīng)矢量的位置編碼。 因此第一行將是我們添加到輸入序列中嵌入第一個(gè)單詞的向量。 每行包含512個(gè)值 - 每個(gè)值的值介于1和-1之間。 我們對(duì)它們進(jìn)行了顏色編碼,使圖案可見(jiàn)。

    The formula for positional encoding is described in the paper (section 3.5). You can see the code for generating positional encodings in get_timing_signal_1d(). This is not the only possible method for positional encoding. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).
    位置編碼的公式在論文(第3.5節(jié))中描述。 您可以在get_timing_signal_1d()中看到用于生成位置編碼的代碼。 這不是位置編碼的唯一可能方法。 然而,它具有能夠擴(kuò)展到看不見(jiàn)的序列長(zhǎng)度的優(yōu)點(diǎn)(例如,如果要求我們訓(xùn)練的模型翻譯句子的時(shí)間長(zhǎng)于我們訓(xùn)練集中的任何句子)。

    The Residuals

    One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a?layer-normalization?step.
    在繼續(xù)之前我們需要提到的編碼器架構(gòu)中的一個(gè)細(xì)節(jié)是每個(gè)編碼器中的每個(gè)子層(自注意,ffnn)在其周?chē)哂袣堄噙B接,然后是層規(guī)范化步驟。

    If we’re to visualize the vectors and the layer-norm operation associated with self attention, it would look like this:
    如果我們要將矢量和與自我關(guān)注相關(guān)的圖層規(guī)范操作可視化,它將如下所示:

    This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:
    這也適用于解碼器的子層。 如果我們想到2個(gè)堆疊編碼器和解碼器的變壓器,它看起來(lái)像這樣:

    The Decoder Side

    Now that we’ve covered most of the concepts on the encoder side, we basically know how the components of decoders work as well. But let’s take a look at how they work together.
    既然我們已經(jīng)涵蓋了編碼器方面的大多數(shù)概念,我們基本上都知道解碼器的組件是如何工作的。 但是讓我們來(lái)看看它們?nèi)绾螀f(xié)同工作。


    The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.
    以下步驟重復(fù)此過(guò)程,直到達(dá)到特殊符號(hào),表示變壓器解碼器已完成其輸出。 在下一個(gè)時(shí)間步驟中,每個(gè)步驟的輸出被饋送到底部解碼器,并且解碼器像編碼器那樣冒泡它們的解碼結(jié)果。 就像我們對(duì)編碼器輸入所做的那樣,我們?cè)谶@些解碼器輸入中嵌入并添加位置編碼,以指示每個(gè)字的位置。

    The self attention layers in the decoder operate in a slightly different way than the one in the encoder:
    解碼器中的自關(guān)注層以與編碼器中的自注意層略有不同的方式操作:

    In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.
    在解碼器中,僅允許自我關(guān)注層關(guān)注輸出序列中的較早位置。 這是通過(guò)在自我關(guān)注計(jì)算中的softmax步驟之前屏蔽未來(lái)位置(將它們?cè)O(shè)置為-inf)來(lái)完成的。

    The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.
    “編碼器 - 解碼器注意”層就像多頭自我注意一樣,除了它從它下面的層創(chuàng)建其查詢矩陣,并從編碼器堆棧的輸出中獲取鍵和值矩陣。

    The Final Linear and Softmax Layer

    The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.
    解碼器堆棧輸出浮點(diǎn)數(shù)向量。 我們?nèi)绾螌⑵渥兂梢粋€(gè)單詞? 這是最終線性層的工作,其后是Softmax層。

    The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.
    線性層是一個(gè)簡(jiǎn)單的完全連接的神經(jīng)網(wǎng)絡(luò),它將堆疊的解碼器產(chǎn)生的矢量投影到一個(gè)更大,更大的向量中,稱為logits向量。

    Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.
    讓我們假設(shè)我們的模型知道10,000個(gè)獨(dú)一無(wú)二的英語(yǔ)單詞(我們的模型的“輸出詞匯表”),它是從訓(xùn)練數(shù)據(jù)集中學(xué)到的。 這將使logits向量10,000個(gè)細(xì)胞寬 - 每個(gè)細(xì)胞對(duì)應(yīng)于一個(gè)唯一單詞的得分。 這就是我們?nèi)绾谓忉屇P偷妮敵?#xff0c;然后是線性層。

    The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.
    然后softmax層將這些分?jǐn)?shù)轉(zhuǎn)換為概率(全部為正,全部加起來(lái)為1.0)。 選擇具有最高概率的單元,并且將與其相關(guān)聯(lián)的單詞作為該時(shí)間步的輸出。

    Recap Of Training

    Now that we’ve covered the entire forward-pass process through a trained Transformer, it would be useful to glance at the intuition of training the model.
    現(xiàn)在我們已經(jīng)講述了Transformer的整個(gè)前向傳播過(guò)程,看一下訓(xùn)練模型的直覺(jué)是有用的。

    During training, an untrained model would go through the exact same forward pass. But since we are training it on a labeled training dataset, we can compare its output with the actual correct output.
    在訓(xùn)練期間,未經(jīng)訓(xùn)練的模型將通過(guò)完全相同的前進(jìn)傳球。 但是,由于我們?cè)跇?biāo)記的訓(xùn)練數(shù)據(jù)集上訓(xùn)練它,我們可以將其輸出與實(shí)際正確的輸出進(jìn)行比較。

    To visualize this, let’s assume our output vocabulary only contains six words(“a”, “am”, “i”, “thanks”, “student”, and “” (short for ‘end of sentence’)).
    為了想象這一點(diǎn),讓我們假設(shè)我們的輸出詞匯只包含六個(gè)單詞(“a”,“am”,“i”,“thanks”,“student”和“”(“句末”的縮寫(xiě)))。

    Once we define our output vocabulary, we can use a vector of the same width to indicate each word in our vocabulary. This also known as one-hot encoding. So for example, we can indicate the word “am” using the following vector:
    一旦我們定義了輸出詞匯表,我們就可以使用相同寬度的向量來(lái)表示詞匯表中的每個(gè)單詞。 這也稱為單熱編碼。 例如,我們可以使用以下向量指示單詞“am”:

    Following this recap, let’s discuss the model’s loss function – the metric we are optimizing during the training phase to lead up to a trained and hopefully amazingly accurate model.
    在回顧一下之后,讓我們討論一下模型的損失函數(shù) - 我們?cè)谟?xùn)練階段優(yōu)化的指標(biāo),以引導(dǎo)一個(gè)訓(xùn)練有素且令人驚訝的精確模型。

    The Loss Function

    Say we are training our model. Say it’s our first step in the training phase, and we’re training it on a simple example – translating “merci” into “thanks”.
    假設(shè)我們正在訓(xùn)練我們的模型。 說(shuō)這是我們?cè)谟?xùn)練階段的第一步,我們正在訓(xùn)練它的一個(gè)簡(jiǎn)單例子 - 將“merci”翻譯成“謝謝”。

    What this means, is that we want the output to be a probability distribution indicating the word “thanks”. But since this model is not yet trained, that’s unlikely to happen just yet.
    這意味著,我們希望輸出是指示“謝謝”一詞的概率分布。 但由于這種模式還沒(méi)有接受過(guò)訓(xùn)練,所以這種情況不太可能發(fā)生。

    How do you compare two probability distributions? We simply subtract one from the other. For more details, look at cross-entropy and Kullback–Leibler divergence.
    你如何比較兩個(gè)概率分布? 我們簡(jiǎn)單地從另一個(gè)中減去一個(gè)。 有關(guān)更多詳細(xì)信息,請(qǐng)查看交叉熵和Kullback-Leibler散度。

    But note that this is an oversimplified example. More realistically, we’ll use a sentence longer than one word. For example – input: “je suis étudiant” and expected output: “i am a student”. What this really means, is that we want our model to successively output probability distributions where:
    但請(qǐng)注意,這是一個(gè)過(guò)于簡(jiǎn)單的例子。 更現(xiàn)實(shí)的是,我們將使用長(zhǎng)于一個(gè)單詞的句子。 例如 - 輸入:“jesuisétudiant”和預(yù)期輸出:“我是學(xué)生”。 這真正意味著,我們希望我們的模型能夠連續(xù)輸出概率分布,其中:

    • Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 3,000 or 10,000)
      每個(gè)概率分布由寬度為vocab_size的向量表示(在我們的玩具示例中為6,但更實(shí)際地是3,000或10,000的數(shù)字)
    • The first probability distribution has the highest probability at the cell associated with the word “i”
      第一概率分布在與單詞“i”相關(guān)聯(lián)的單元處具有最高概率
    • The second probability distribution has the highest probability at the cell associated with the word “am”
      第二概率分布在與單詞“am”相關(guān)聯(lián)的單元格中具有最高概率
    • And so on, until the fifth output distribution indicates ‘’ symbol, which also has a cell associated with it from the 10,000 element vocabulary.
      依此類推,直到第五個(gè)輸出分布表示’<句末結(jié)束>‘符號(hào),其中還有一個(gè)與10,000元素詞匯表相關(guān)聯(lián)的單元格。

      After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this:
      在足夠大的數(shù)據(jù)集上訓(xùn)練模型足夠的時(shí)間之后,我們希望產(chǎn)生的概率分布看起來(lái)像這樣:

      Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That’s one way to do it (called greedy decoding). Another way to do it would be to hold on to, say, the top two words (say, ‘I’ and ‘a(chǎn)’ for example), then in the next step, run the model twice: once assuming the first output position was the word ‘I’, and another time assuming the first output position was the word ‘me’, and whichever version produced less error considering both positions #1 and #2 is kept. We repeat this for positions #2 and #3…etc. This method is called “beam search”, where in our example, beam_size was two (because we compared the results after calculating the beams for positions #1 and #2), and top_beams is also two (since we kept two words). These are both hyperparameters that you can experiment with.
      現(xiàn)在,因?yàn)槟P鸵淮紊梢粋€(gè)輸出,我們可以假設(shè)模型從該概率分布中選擇具有最高概率的單詞并丟棄其余的單詞。 這是一種方法(稱為貪婪解碼)。 另一種方法是保持前兩個(gè)詞(例如,‘I’和’a’),然后在下一步中,運(yùn)行模型兩次:一旦假設(shè)第一個(gè)輸出位置是 單詞’I’,另一次假設(shè)第一個(gè)輸出位置是’me’這個(gè)單詞,考慮到#1和#2的位置,保留的是哪個(gè)版本產(chǎn)生的錯(cuò)誤較少。 我們重復(fù)這個(gè)位置#2和#3 …等。 這種方法稱為“波束搜索”,在我們的例子中,beam_size是兩個(gè)(因?yàn)槲覀冊(cè)谟?jì)算位置#1和#2的波束后比較了結(jié)果),top_beams也是兩個(gè)(因?yàn)槲覀儽A袅藘蓚€(gè)詞)。 這些都是您可以嘗試的超參數(shù)。

    Go Forth And Transform

    I hope you’ve found this a useful place to start to break the ice with the major concepts of the Transformer. If you want to go deeper, I’d suggest these next steps:
    我希望你已經(jīng)發(fā)現(xiàn)這是一個(gè)有用的地方,開(kāi)始用Transformer的主要概念打破僵局。 如果你想更深入,我建議接下來(lái)的步驟:

    • Read the?Attention Is All You Need paper, the Transformer blog post (Transformer: A Novel Neural Network Architecture for Language Understanding), and the?Tensor2Tensor announcement.
    • Watch??ukasz Kaiser’s talk?walking through the model and its details
    • Play with the?Jupyter Notebook provided as part of the Tensor2Tensor repo
    • Explore the?Tensor2Tensor repo.

    總結(jié)

    以上是生活随笔為你收集整理的The Illustrated Transformer:中英文(看原文,很多翻译是错误的)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

    如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。