变形金刚2_变形金刚(
變形金剛2
重點(diǎn) (Top highlight)
This is a 3 part series where we will be going through Transformers, BERT, and a hands-on Kaggle challenge — Google QUEST Q&A Labeling to see Transformers in action (top 4.4% on the leaderboard).In this part (1/3) we will be looking at how Transformers became state-of-the-art in various modern natural language processing tasks and their working.
這是一個(gè)分為3部分的系列文章,我們將通過Transformers,BERT和動(dòng)手Kaggle挑戰(zhàn)-Google QUEST問題與 解答 標(biāo)簽 來查看Transformers的使用情況(在排行榜上排名前4.4%)。在這一部分(1/3)我們將研究《變形金剛》如何在各種現(xiàn)代自然語言處理任務(wù)及其工作中成為最先進(jìn)的技術(shù)。
The Transformer is a deep learning model proposed in the paper Attention is All You Need by researchers at Google and the University of Toronto in 2017, used primarily in the field of natural language processing (NLP).
噸他變壓器是在提出一種深度學(xué)習(xí)模型關(guān)注的是你所需要的研究人員在谷歌和多倫多在2017年的大學(xué),主要是在自然語言處理(NLP)的領(lǐng)域。
Like recurrent neural networks (RNNs), Transformers are designed to handle sequential data, such as natural language, for tasks such as translation and text summarization. However, unlike RNNs, Transformers do not require that the sequential data be processed in the order. For example, if the input data is a natural language sentence, the Transformer does not need to process the beginning of it before the end. Due to this feature, the Transformer allows for much more parallelization than RNNs and therefore reduced training times.
像遞歸神經(jīng)網(wǎng)絡(luò)(RNN)一樣,變形金剛旨在處理順序數(shù)據(jù)(例如自然語言),以執(zhí)行翻譯和文本摘要之類的任務(wù)。 但是,與RNN不同,Transformer不需要按順序處理順序數(shù)據(jù)。 例如,如果輸入數(shù)據(jù)是自然語言語句,則Transformer不需要在結(jié)束之前處理它的開頭。 由于此功能,與RNN相比,Transformer允許更多的并行化,因此減少了訓(xùn)練時(shí)間。
Transformers were designed around the concept of attention mechanism which was designed to help memorize long source sentences in neural machine translation.
圍繞注意機(jī)制的概念設(shè)計(jì)了變壓器,該機(jī)制旨在幫助記憶神經(jīng)機(jī)器翻譯中的長句。
Sounds cool right?Let’s take a look under the hood and see how things work.
聽起來不錯(cuò)吧?讓我們看一下引擎蓋,看看它們是如何工作的。
source來源Transformers are based on an encoder-decoder architecture that comprises of encoders which consists of a set of encoding layers that processes the input iteratively one layer after another and decoders that consists of a set of decoding layers that does the same thing to the output of the encoder.
變壓器基于編碼器-解碼器體系結(jié)構(gòu),該體系結(jié)構(gòu)由編碼器和解碼器組成,其中編碼器由一組編碼層組成,這些編碼層一層又一層地迭代處理輸入,而解碼器由一組解碼層組成,這些解碼層對(duì)輸出的內(nèi)容進(jìn)行相同的處理編碼器。
So, when we pass a sentence into a transformer, it is embedded and passed into a stack of encoders. The output from the final encoder is then passed into each decoder block in the decoder stack. The decoder stack then generates the output.
因此,當(dāng)我們將句子傳遞給轉(zhuǎn)換器時(shí),它將被嵌入并傳遞給編碼器堆棧。 最終編碼器的輸出然后傳遞到解碼器堆棧中的每個(gè)解碼器塊。 然后,解碼器堆棧生成輸出。
All the encoder blocks in the transformer are identical and similarly, all the decoder blocks in the transformer are identical.
變壓器中的所有編碼器塊是相同的,并且類似地,變壓器中的所有解碼器塊是相同的。
source: http://jalammar.github.io/illustrated-transformer/來源: http : //jalammar.github.io/illustrated-transformer/This was a very high-level representation of a transformer and it wouldn’t probably make much sense when understanding how transformers are so efficient in modern NLP tasks.Don’t worry, to make things clearer, we will go through the internals of an encoder and decoder cell now…
這是一個(gè)非常高級(jí)的變壓器表示形式,當(dāng)理解變壓器如何在現(xiàn)代NLP任務(wù)中如此高效時(shí),可能沒有多大意義。不用擔(dān)心,為了使事情更清楚,我們將仔細(xì)研究變壓器的內(nèi)部結(jié)構(gòu)。編碼器和解碼器單元現(xiàn)在…
Encoder
編碼器
The encoder has 2 parts, self-attention, and a feed-forward neural network.
編碼器由兩部分組成:自我注意和前饋神經(jīng)網(wǎng)絡(luò)。
source: http://jalammar.github.io/illustrated-transformer/來源: http : //jalammar.github.io/illustrated-transformer/The encoder’s inputs first flow through a self-attention layer — a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. Basically for each input word ‘x’ the self-attention layer generates a vector Z such that it takes all the input words (x1, x2, x3, …, xn) into the picture before generating Z. I’ll come to why it takes all the input word’s embedding into the picture and how it generates Z later in this blog but for now, just remember these brief high-level summarizations of the subcomponents of an encoder.
編碼器的輸入首先流經(jīng)自我注意層,該層可以幫助編碼器在對(duì)特定單詞進(jìn)行編碼時(shí)查看輸入句子中的其他單詞。 基本上,對(duì)于每個(gè)輸入單詞'x',自我注意層都會(huì)生成一個(gè)向量Z ,以使其在生成Z之前將所有輸入單詞(x1,x2,x3,…,xn)放入圖片中。 在本博客的稍后部分,我將討論為什么要將所有輸入詞都嵌入圖片中以及如何生成Z ,但是現(xiàn)在,請(qǐng)記住編碼器子組件的這些簡要概述。
The outputs of the self-attention layer are fed to a feed-forward neural network. The feed-forward neural network generates an output for each input Z and the output from the feed-forward neural network is passed into the next encoder block’s self-attention layer and so on.
自我注意層的輸出被饋送到前饋神經(jīng)網(wǎng)絡(luò)。 前饋神經(jīng)網(wǎng)絡(luò)為每個(gè)輸入Z生成一個(gè)輸出,前饋神經(jīng)網(wǎng)絡(luò)的輸出將傳遞到下一個(gè)編碼器塊的自注意層,依此類推。
Now that we have an idea of what all is inside an encoder, let’s understand the tensor operations happening inside each component.
現(xiàn)在我們已經(jīng)了解了編碼器內(nèi)部的所有內(nèi)容,讓我們了解每個(gè)組件內(nèi)部發(fā)生的張量操作。
First comes the input:
首先是輸入:
We know that transformers are used for NLP tasks so the data we deal with is usually a corpus of sentences, but since machine learning algorithms are all about matrix operations, we first need to convert the human-readable sentences into a machine-readable format (numbers). To convert the sentences into numbers, we use ‘word embeddings’. This step is simple, each word in a sentence is represented as an n-dimensional vector (n is usually 512) and for transformers, we typically use GloVe embedding representation of words. There is also something called positional encoding that is applied to these embedding but I’ll come to it later.Once we have the embedding for each input word, we pass these embedding simultaneously to the self-attention layer.
我們知道轉(zhuǎn)換器是用于NLP任務(wù)的,因此我們處理的數(shù)據(jù)通常是句子的主體,但是由于機(jī)器學(xué)習(xí)算法都是關(guān)于矩陣運(yùn)算的,因此我們首先需要將人類可讀的句子轉(zhuǎn)換為機(jī)器可讀的格式(數(shù)字)。 要將句子轉(zhuǎn)換為數(shù)字,我們使用“單詞嵌入”。 此步驟很簡單,將句子中的每個(gè)單詞表示為n維向量(n通常為512),對(duì)于轉(zhuǎn)換器,我們通常使用GloVe嵌入單詞表示法。 還有一些叫做位置編碼的東西被應(yīng)用到這些嵌入中,但是稍后我會(huì)介紹。一旦我們?yōu)槊總€(gè)輸入單詞都嵌入了嵌入,我們就將這些嵌入同時(shí)傳遞給自我注意層。
The training parameters of self-attention layer:
自我注意層的訓(xùn)練參數(shù):
Different layers have different learning parameters eg. a Dense layer has weights and bias, a Convolutional layer has kernels as the learning parameters similarly in the self-attention layer, we have 4 learning parameters:- Query matrix: Wq- Key matrix: Wk- Value matrix: Wv- Output matrix: Wo (this is not the output matrix but a trainable parameter that generates the final output of the self-attention layer Z).
不同的層具有不同的學(xué)習(xí)參數(shù),例如。 一個(gè)密集層具有權(quán)重和偏差 ,一個(gè)卷積層也具有內(nèi)核作為自注意力層的學(xué)習(xí)參數(shù),我們有4個(gè)學(xué)習(xí)參數(shù): -查詢矩陣: Wq- 關(guān)鍵矩陣: Wk- 值矩陣: Wv- 輸出矩陣: 禾 (這不是輸出矩陣,而是可訓(xùn)練的參數(shù),該參數(shù)生成自我注意層Z的最終輸出)。
The first 3 trainable parameters have a special purpose, they are used for generating 3 new parameters:- Query: Q- Key: K- Value: Vwhich are later used for generating output Z from input x, let’s see how-
前三個(gè)可訓(xùn)練參數(shù)具有特殊用途,它們用于生成3個(gè)新參數(shù): -查詢: Q- 鍵: K- 值: V ,稍后用于從輸入x生成輸出Z ,讓我們看看如何-
Some points to keep in mind are:- The input tensor x has n-rows and m-columns where n is the number of input words and m is the vector size of each word i.e. 512.- The output tensors Q, K, V, and Z have n-rows and dk-columns where n is the number of input words and dk is 64. The values of m and dk are no random values but were found to work the best by researchers who came up with this architecture.
請(qǐng)記住以下幾點(diǎn):-輸入張量x具有n行和m列 ,其中n是輸入單詞的數(shù)量, m是每個(gè)單詞的向量大小,即512.-輸出張量Q,K,V和 Z n行 DK -columns其中n是輸入字的數(shù)量和DK為64 m的值和DK都沒有隨機(jī)值,但被發(fā)現(xiàn)誰用這種架構(gòu)想出了研究人員的工作是最好的。
source: http://jalammar.github.io/illustrated-transformer/來源: http : //jalammar.github.io/illustrated-transformer/After calculating the 3 parameters Q, K, V as mentioned above, the self-attention layer then calculates scores, a vector for each of the input words.
如上所述,在計(jì)算了三個(gè)參數(shù)Q,K,V之后 ,自我注意層將計(jì)算分?jǐn)?shù),即每個(gè)輸入單詞的向量。
Dot-product attention:
點(diǎn)產(chǎn)品注意事項(xiàng):
The next step in the self-attention layer is to calculate the value of the vector score corresponding to each input word. This score calculation is one of the most crucial steps that bring the attention mechanism to life (well… not literally). The vector score has a size of n where n is the number of input words and each element of this vector is a number that tells how much does the word that it corresponds to contributes to the current word.Let’s consider an example to get the intuition-“The animal didn’t cross the street because it was too tired”In the above sentence, the word it refers to the animal and not the road. For us, this is pretty simple to grasp but not for a machine with no attention, because we know how grammar works and we’ve developed a sense that it will be referring to animal more than words like cross or road. This sense of grammar comes to transformers after training but the fact that for a given word, it considers all the words in the input and then has the ability to select the one that it thinks contributes the most is what the attention mechanism is about.For the above sentence, the score vector generated for the word it will have 11 numbers, each corresponding to a word in the input sentence. For a well-trained model, this score vector will have larger numbers at positions 2 and 8 because the words at 2(animal) and 8(it) contribute the most to it. It may look something like: [2, 60, 4, 5, 3, 8, 5, 90, 7, 6, 3]Notice that the values at positions 2 and 8 are greater than the values at other positions.
自我注意層的下一步是計(jì)算與每個(gè)輸入單詞相對(duì)應(yīng)的矢量分?jǐn)?shù)的值。 分?jǐn)?shù)計(jì)算是使注意力機(jī)制栩栩如生的最關(guān)鍵步驟之一(嗯……不是字面上的意思)。 向量分?jǐn)?shù)的大小為n 哪里 ? 是輸入單詞的數(shù)量,該向量的每個(gè)元素是一個(gè)數(shù)字,表明該單詞對(duì)應(yīng)的單詞對(duì)當(dāng)前單詞有多少貢獻(xiàn)。讓我們考慮一個(gè)例子來獲得直覺: “動(dòng)物沒有過馬路因?yàn)樘哿恕?/strong> 在以上句子中, 它是指動(dòng)物而不是道路。 對(duì)于我們來說,這很容易掌握,但對(duì)于沒有注意力的機(jī)器來說卻并非如此,因?yàn)槲覀冎勒Z法是如何工作的,并且我們已經(jīng)形成一種感覺,即它比動(dòng)物之類的單詞“ cross”或“ road”更能指動(dòng)物 。 這種語法意識(shí)是經(jīng)過訓(xùn)練的變形者,但是對(duì)于一個(gè)給定的單詞,它會(huì)考慮輸入中的所有單詞,然后能夠選擇其認(rèn)為貢獻(xiàn)最大的單詞這一事實(shí),這就是注意力機(jī)制的意義所在。 對(duì)于上述的句子,對(duì)于單詞生成的得分矢量它將具有11個(gè)數(shù)字,每一個(gè)對(duì)應(yīng)于在輸入句子的單詞。 對(duì)于訓(xùn)練有素的模型,此得分向量將在位置2和8處具有較大的數(shù)字,因?yàn)?(動(dòng)物)和8(it)處的單詞對(duì)其貢獻(xiàn)最大。 它可能看起來像:[2,60,4,5,3,8,5,90,7,6,3]注意,在位置2和8的值比在其它位置的值越大。
source: http://jalammar.github.io/illustrated-transformer/來源: http : //jalammar.github.io/illustrated-transformer/Let’s see how these scores are generated in the self-attention layer.Till now, for each word, we have Q, K, V vectors. To generate the score vector, we use something called the dot-product attention where we take a dot product between the Q and the K vectors to generate the score. The value of Q is corresponding to the query of the word for which we are calculating the score, in the above example, the word was it whereas there are n values of K, each corresponding to the key vector of the input words.So, if we want to generate the scores for the word it:
讓我們看看這些分?jǐn)?shù)是如何在自我注意層中生成的。到目前為止,對(duì)于每個(gè)單詞,我們都有Q,K,V向量。 為了生成得分向量,我們使用了一種稱為“ 點(diǎn)積注意”的方法 ,其中我們?nèi)?strong>Q和K向量之間的點(diǎn)積來生成得分。 Q的值對(duì)應(yīng)于我們要為其計(jì)算分?jǐn)?shù)的單詞的查詢,在上面的示例中,單詞是它,而存在n個(gè)K值,每個(gè)值對(duì)應(yīng)于輸入單詞的鍵向量。如果我們想生成單詞的分?jǐn)?shù):
We take the query vector of it: Q
我們?nèi)∷牟樵兿蛄?#xff1a; Q
We take the key vectors of the input sentences: K1, K2, K3, …, Kn.
我們采用輸入句子的關(guān)鍵向量: K1,K2,K3,…,Kn。
We take a dot product between Q and K’s and obtain n scores.
我們?cè)?strong>Q和K之間取一個(gè)點(diǎn)積,并得到n分。
After calculating the scores, we kind of normalize the scores by dividing them by squared root of (dk) which was the column-dimension of vectors Q, K, V.This step was mandatory because the creators of the transformer found that normalizing the scores by sqrt. of dk gives better results.
在計(jì)算出分?jǐn)?shù)之后,我們通過將分?jǐn)?shù)除以( dk )的平方根(即向量Q,K,V的 列維 )來對(duì)分?jǐn)?shù)進(jìn)行歸一化 。此步驟是必需的,因?yàn)檗D(zhuǎn)換器的創(chuàng)建者發(fā)現(xiàn)對(duì)分?jǐn)?shù)進(jìn)行歸一化由sqrt。 的dk效果更好。
After normalizing the score vectors, we encode them using softmax function such that the output is proportional to the original scores but all the values sum up to 1.
對(duì)得分向量進(jìn)行歸一化后,我們使用softmax函數(shù)對(duì)其進(jìn)行編碼,以使輸出與原始得分成正比,但所有值的總和為1。
Once we have the ‘softmaxed’ scores ready, we simply multiply each score element with the value vector V corresponding to it, such that we get n value vectors V after this operation: [V1, V2, V3, …, Vn].Now to obtain the output Z of the self-attention, we simply add all the n value vectors.
一旦我們準(zhǔn)備好“ softmaxed”分?jǐn)?shù),我們就簡單地將每個(gè)分?jǐn)?shù)元素與對(duì)應(yīng)的值向量V相乘,以便在此操作后獲得n個(gè)值向量V :[ V1,V2,V3,…,Vn ]。為了獲得自我注意的輸出Z ,我們只需將所有n個(gè)值向量相加即可。
source)來源 ) source: http://jalammar.github.io/illustrated-transformer/來源: http : //jalammar.github.io/illustrated-transformer/The above diagrams illustrate the steps of the self-attention layer.
上圖說明了自我注意層的步驟。
Multi-head Attention:
多頭注意:
Now that we know how an attention-head works, and how amazing it is there is a catch to it. A single attention-head can sometimes miss some of the words in input that contribute most to the spotlight word, like in the example before, sometimes the attention head may fail to pay attention to the word animal while predicting the word it and this may cause problems.To tackle this issue, instead of just a single attention-head, we use multiple attention-heads, each working in a similar manner. This idea helps us to reduce the error or miscalculation by any single attention head.This is also referred to as multi-head attention.
現(xiàn)在,我們知道了注意頭是如何工作的,以及它有多神奇,這有一個(gè)吸引點(diǎn)。 單注意頭有時(shí)會(huì)錯(cuò)過一些在輸入之前的例子最有助于聚光燈字,之類的話語,有時(shí)同時(shí)預(yù)測字呢注意頭可能無法要注意單詞的動(dòng)物 ,這可能會(huì)導(dǎo)致為了解決這個(gè)問題,我們使用多個(gè)關(guān)注頭,而不僅僅是一個(gè)單獨(dú)的關(guān)注頭,每個(gè)關(guān)注頭的工作方式都相似。 這個(gè)想法可以幫助我們減少任何單個(gè)關(guān)注頭的錯(cuò)誤或計(jì)算錯(cuò)誤,這也稱為多頭關(guān)注 。
The scores from 2 different attention-heads are represented in orange and green. We can see how one attention-head pays more attention to words like the, animal, cross whereas the other pays more attention to words like street, was, tired. (image source).2個(gè)不同的關(guān)注度得分以橙色和綠色表示。 我們可以看到一個(gè)注意力集中的人如何更加關(guān)注動(dòng)物,十字架之類的單詞,而另一個(gè)注意力集中于街道,過去,疲倦之類的單詞。 ( 圖片來源 )。In the transformers, multi-head attention typically uses 8 attention heads.Now notice that the output of a single attention-head was of 64 dimensions, but if we use multi-head attention, we will get 8 such 64-dimensional vectors as output.
在變形金剛中,多頭注意力通常使用8個(gè)注意力頭,現(xiàn)在注意單個(gè)注意力頭的輸出為64維,但是如果我們使用多頭注意力,我們將獲得8個(gè)這樣的64維向量作為輸出。
source: http://jalammar.github.io/illustrated-transformer/來源: http : //jalammar.github.io/illustrated-transformer/Turns out there is a final trainable parameter Output matrix Wo that I mentioned before that comes into play here.In the final layer of the self-attention, all the output [Z0, Z1, Z2,…, Z7] are concatenated and multiplied with Wo such that the final output Z is of a dimension 64.
原來,我之前提到的是一個(gè)最終可訓(xùn)練的參數(shù)輸出矩陣 Wo 。在自我注意的最后一層,所有輸出[Z0,Z1,Z2,…,Z7]被級(jí)聯(lián)并乘以Wo ,使得最終輸出Z的尺寸為64。
Below is the diagram to show all the steps discussed above:
下圖顯示了上面討論的所有步驟:
source: http://jalammar.github.io/illustrated-transformer/來源: http : //jalammar.github.io/illustrated-transformer/Positional encoding:
位置編碼:
Remember in first comes the input section I mentioned positional encoding, let’s see what are they and how they help. The problem with our current awesome transformer is that it does not take the position of the input words into account. Unlike RNN where we had timesteps to denote which word comes before and after, in transformers since the words are fed simultaneously, we need some kind of positional encoding that defines which word comes after which.Positional encoding comes to our rescue as it gives the input embedding a sense of position, we first generate the position embeddings for each of the input words and these position embeddings are then added to the word embeddings of the respective words to generate embeddings with a time signal.
請(qǐng)記住, 首先我提到位置編碼的輸入部分,讓我們看看它們是什么以及它們?nèi)绾翁峁椭?我們當(dāng)前出色的變壓器存在的問題是它沒有考慮輸入字的位置。 與RNN不同,在RNN中,我們有時(shí)間步長指示哪個(gè)單詞出現(xiàn)在前后,而在轉(zhuǎn)換器中,由于單詞是同時(shí)饋送的,我們需要某種位置編碼來定義哪個(gè)單詞出現(xiàn)在后面,因?yàn)槲恢镁幋a可以提供輸入為了嵌入位置感,我們首先為每個(gè)輸入單詞生成位置嵌入,然后將這些位置嵌入添加到各個(gè)單詞的單詞嵌入中,以生成帶有時(shí)間信號(hào)的嵌入。
There were many proposed method for generating the positional embeddings like one-hot encoded vectors or binary encoding but what the researchers found to work the best was using the equations below to generate the embeddings:
提出了許多生成位置嵌入的方法,例如單熱編碼矢量或二進(jìn)制編碼,但是研究人員發(fā)現(xiàn)最有效的方法是使用以下公式生成嵌入:
Image by Author圖片作者When we plot the 128-dimensional positional encoding for a sentence with a maximum length of 50, it looks something like:
當(dāng)我們繪制最大長度為50的句子的128維位置編碼時(shí),它看起來像:
Each row represents the embedding vector (Image by Author每行代表嵌入向量(作者提供的圖像))Residual connections:
殘余連接:
Finally, there is one more improvisation added to the encoders known as residual connections or skip connections which allow the output from the previous layer to bypass layers in between.It helps in deep networks where there are many hidden layers and if any layer in between is not of much use or is not learning much, skip connections help in bypassing that layer.Another thing to note is that when the residual connections are added and the resultant is normalized.
最后,在編碼器中又增加了一種即席連接(殘余連接或跳過連接),可以使前一層的輸出繞過中間的層,這對(duì)于深度網(wǎng)絡(luò)中存在許多隱藏層并且中間有任何層的情況很有幫助。沒有太大用處或?qū)W習(xí)不多的地方,跳過連接有助于繞過該層。另一要注意的是,當(dāng)添加剩余連接并將結(jié)果標(biāo)準(zhǔn)化后。
image source).圖片來源 )。Decoder
解碼器
A decoder is very similar to the encoder. Like encoder, it also has the self-attention and feed-forward network but it also has an additional block known as Encoder-Decoder Attention sandwiched between the two.The Encoder-Decoder Attention layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.The remaining 2 layers work exactly the same as those in the encoder cell.
解碼器與編碼器非常相似。 像編碼器,它也有自關(guān)注和前饋網(wǎng)絡(luò),但它也有被稱為夾在two.The 編碼器-解碼器注意層之間的編碼器-解碼器注意的附加塊的工作原理一樣多頭自注意,除了它創(chuàng)建它的查詢矩陣來自其下一層,并從編碼器堆棧的輸出中獲取鍵和值矩陣,其余兩層的工作原理與編碼器單元中的相同。
image source).圖片來源 )。The input to the decoder stack is sequential unlike the simultaneous input in encoder stack, meaning the first output word is passed into the decoder as an input using which it generates the second output now this output is again passed as an input to the decoder and using that it generates the third output and so on…
解碼器堆棧的輸入是順序的,與編碼器堆棧中的同時(shí)輸入不同,這意味著第一個(gè)輸出字作為輸入傳遞到解碼器,通過它生成第二個(gè)輸出,現(xiàn)在此輸出再次作為輸入傳遞到解碼器,并且它會(huì)生成第三個(gè)輸出,依此類推...
image source).圖片來源 )。The output of the decoders is passed into a linear layer with softmax activation using which, the correct word is predicted.
解碼器的輸出通過softmax激活傳遞到線性層,通過該層預(yù)測正確的字。
source: http://jalammar.github.io/illustrated-transformer/來源: http : //jalammar.github.io/illustrated-transformer/Once the transformer predicts a word using forward propagation, the prediction is compared with the actual label using a loss function like cross-entropy and then all the trainable parameters are updated using back-propagation.Well, this is one simplified way of understanding how learning happens in transformers. There are more variations like taking the complete output sentence for calculating the loss. To know more you can check out this amazing blog on Transformer by Jay Alammar.
轉(zhuǎn)換器使用正向傳播預(yù)測單詞后,使用諸如交叉熵之類的損失函數(shù)將預(yù)測與實(shí)際標(biāo)簽進(jìn)行比較,然后使用反向傳播更新所有可訓(xùn)練的參數(shù)。這是一種了解學(xué)習(xí)方式的簡化方式發(fā)生在變壓器中。 還有更多變化,例如采用完整的輸出語句來計(jì)算損失。 要了解更多信息,您可以查看Jay Alammar撰寫的有關(guān)Transformer的精彩博客 。
With this, we have come to the end of this blog. Hope the read was pleasant.I would like to thank all the creators for creating the awesome content I referred to for writing this blog.
至此,我們到了本博客的結(jié)尾。 希望閱讀愉快。我要感謝所有創(chuàng)作者創(chuàng)造了我寫此博客所提到的精彩內(nèi)容。
Reference links:
參考鏈接:
Applied AI Course: https://www.appliedaicourse.com/
應(yīng)用AI課程: https : //www.appliedaicourse.com/
https://arxiv.org/abs/1706.03762
https://arxiv.org/abs/1706.03762
http://jalammar.github.io/illustrated-transformer/
http://jalammar.github.io/illustrated-transformer/
http://primo.ai/index.php?title=Transformer
http://primo.ai/index.php?title=變形金剛
https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)
https://zh.wikipedia.org/wiki/變形金剛(machine_learning_model)
https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
Final note
最后說明
Thank you for reading the blog. I hope it was useful for some of you aspiring to do projects or learn some new concepts in NLP.
感謝您閱讀博客。 我希望這對(duì)有志于在NLP中進(jìn)行項(xiàng)目或?qū)W習(xí)一些新概念的人有用。
In part 2/3 we will go through BERT (Bidirectional Encoder Representations from Transformers).
在第2/3部分中,我們將介紹BERT(來自變壓器的雙向編碼器表示)。
In part 3/3 we will go through a hands-on Kaggle challenge — Google QUEST Q&A Labeling to see Transformers in action (top 4.4% on the leaderboard).
在第3/3部分中,我們將進(jìn)行動(dòng)手的Kaggle挑戰(zhàn)-Google QUEST問題與解答標(biāo)簽,以查看《變形金剛》的使用情況(在排行榜上排名前4.4%)。
Find me on LinkedIn: www.linkedin.com/in/sarthak-vajpayee
在LinkedIn 上找到我: www.linkedin.com/in/sarthak-vajpayee
Peace! ?
和平! ?
翻譯自: https://towardsdatascience.com/transformers-state-of-the-art-natural-language-processing-1d84c4c7462b
變形金剛2
總結(jié)
以上是生活随笔為你收集整理的变形金刚2_变形金刚(的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: power bi可视化表_如何使用Pow
- 下一篇: 机器学习 测试_测试优先机器学习