當(dāng)前位置：首頁 > 人工智能 > pytorch >内容正文

pytorch

[深度学习] 自然语言处理 --- Self-Attention(二) 动画与代码演示

發(fā)布時間：2023/12/15 pytorch 51 豆豆

生活随笔收集整理的這篇文章主要介紹了 [深度学习] 自然语言处理 --- Self-Attention(二) 动画与代码演示小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

一? Self Attention 動畫演示

Step 1: Prepare inputs

For this tutorial, we start with 3 inputs, each with dimension 4.

Input 1: [1, 0, 1, 0] Input 2: [0, 2, 0, 2] Input 3: [1, 1, 1, 1]

Step 2: Initialise weights

Every input must have three representations (see diagram below). These representations are called key (orange), query (red), and value (purple). For this example, let’s take that we want these representations to have a dimension of 3. Because every input has a dimension of 4, this means each set of the weights must have a shape of 4×3.

(the dimension of?value?is also the dimension of the output.)

In order to obtain these representations, every input (green) is multiplied with a set of weights for?keys, a set of weights for?querys?(I know that’s not the right spelling), and a set of weights for?values. In our example, we ‘initialise’ the three sets of weights as follows.

Weights for?key:

[[0, 0, 1],[1, 1, 0],[0, 1, 0],[1, 1, 0]]

Weights for?query:

[[1, 0, 1],[1, 0, 0],[0, 0, 1],[0, 1, 1]]

Weights for?value:

[[0, 2, 0],[0, 3, 0],[1, 0, 3],[1, 1, 0]]

PS： In a neural network setting, these weights are usually small numbers, initialised randomly using an appropriate random distribution like Gaussian, Xavier and Kaiming distributions.

Step 3: Derive key, query and value

Now that we have the three sets of weights, let’s actually obtain the?key,?query?and?value?representations for every input.

Key?representation for Input 1:

[0, 0, 1] [1, 0, 1, 0] x [1, 1, 0] = [0, 1, 1][0, 1, 0][1, 1, 0]

Use the same set of weights to get the?key?representation for Input 2:

[0, 0, 1] [0, 2, 0, 2] x [1, 1, 0] = [4, 4, 0][0, 1, 0][1, 1, 0]

Use the same set of weights to get the?key?representation for Input 3:

[0, 0, 1] [1, 1, 1, 1] x [1, 1, 0] = [2, 3, 1][0, 1, 0][1, 1, 0]

1.?? A faster way is to vectorise the above key operations:

[0, 0, 1] [1, 0, 1, 0] [1, 1, 0] [0, 1, 1] [0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0] [1, 1, 1, 1] [1, 1, 0] [2, 3, 1]

2.?? Let’s do the same to obtain the?value?representations for every input:

[0, 2, 0] [1, 0, 1, 0] [0, 3, 0] [1, 2, 3] [0, 2, 0, 2] x [1, 0, 3] = [2, 8, 0] [1, 1, 1, 1] [1, 1, 0] [2, 6, 3]

3. finally the?query?representations:

[1, 0, 1] [1, 0, 1, 0] [1, 0, 0] [1, 0, 2] [0, 2, 0, 2] x [0, 0, 1] = [2, 2, 2] [1, 1, 1, 1] [0, 1, 1] [2, 1, 3]

PS: In practice, a?bias vector?may be added to the product of matrix multiplication.

Step 4: Calculate attention scores for Input 1

To obtain?attention scores, we start off with taking a dot product between Input 1’s?query?(red) with all?keys?(orange), including itself. Since there are 3?key?representations (because we have 3 inputs), we obtain 3 attention scores (blue).

[0, 4, 2] [1, 0, 2] x [1, 4, 3] = [2, 4, 4][1, 0, 1]

we only use the?query?from Input 1. Later we’ll work on repeating this same step for the other?querys.

PS: The above operation is known as?dot product attention, one of the several?score functions.?Other score functions include?scaled dot product?and?additive/concat.

Step 5: Calculate softmax

Take the softmax across these attention scores (blue).

softmax([2, 4, 4]) = [0.0, 0.5, 0.5]

Step 6: Multiply scores with values

The softmaxed attention scores for each input (blue) is multiplied with its corresponding?value?(purple). This results in 3?alignment vectors?(yellow). In this tutorial, we’ll refer to them as?weighted values.

1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0] 2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0] 3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]

Step 7: Sum weighted values to get Output 1

Take all the?weighted?values?(yellow) and sum them element-wise:

[0.0, 0.0, 0.0] + [1.0, 4.0, 0.0] + [1.0, 3.0, 1.5] ----------------- = [2.0, 7.0, 1.5]

The resulting vector [2.0, 7.0, 1.5] (dark green) is Output 1, which is based on the?query representation from Input 1?interacting with all other keys, including itself.

Step 8: Repeat for Input 2 & Input 3

Query 與 Key 的緯度一定要相同，因為兩者需要進行點積相乘，然而， Value的緯度可以與Q, K的緯度不一樣

The resulting output will consequently follow the dimension of?value.

?

二? Self-Attention代碼演示

Step 1: 準備輸入X

import tensorflow as tfx = [[1, 0, 1, 0], # Input 1[0, 2, 0, 2], # Input 2[1, 1, 1, 1] # Input 3] x = tf.Variable(x, dtype=tf.float32)

Step 2: 參數(shù)W初始化

一般使用_Gaussian, Xavier_ 和 _Kaiming_隨機分布初始化。在訓(xùn)練開始之前完成這些初始化工作。

w_key = [[0, 0, 1],[1, 1, 0],[0, 1, 0],[1, 1, 0] ] w_query = [[1, 0, 1],[1, 0, 0],[0, 0, 1],[0, 1, 1] ] w_value = [[0, 2, 0],[0, 3, 0],[1, 0, 3],[1, 1, 0] ]w_key = tf.Variable(w_key, dtype=tf.float32) w_query = tf.Variable(w_query, dtype=tf.float32) w_value = tf.Variable(w_value, dtype=tf.float32)

Step 3:并計算出K, Q, V

? ??

keys = x @ w_key querys = x @ w_query values = x @ w_valueprint(keys) # tensor([[0., 1., 1.], # [4., 4., 0.], # [2., 3., 1.]])print(querys) # tensor([[1., 0., 2.], # [2., 2., 2.], # [2., 1., 3.]])print(values) # tensor([[1., 2., 3.], # [2., 8., 0.], # [2., 6., 3.]])

Step 4: 計算注意力權(quán)重

首先計算注意力權(quán)重，通過計算K的轉(zhuǎn)置矩陣和Q的點積得到。

attn_scores = querys @ tf.transpose(keys, perm=[1, 0]) # [[1, 4] print(attn_scores) # tensor([[ 2., 4., 4.], # attention scores from Query 1 # [ 4., 16., 12.], # attention scores from Query 2 # [ 4., 12., 10.]]) # attention scores from Query 3

Step 5: 計算 softmax

例子中沒有去除? ?

attn_scores_softmax = tf.nn.softmax(attn_scores) print(attn_scores_softmax) # tensor([[6.3379e-02, 4.6831e-01, 4.6831e-01], # [6.0337e-06, 9.8201e-01, 1.7986e-02], # [2.9539e-04, 8.8054e-01, 1.1917e-01]])# For readability, approximate the above as follows attn_scores_softmax = [[0.0, 0.5, 0.5],[0.0, 1.0, 0.0],[0.0, 0.9, 0.1] ] attn_scores_softmax = tf.Variable(attn_scores_softmax) print(attn_scores_softmax)

下面例子除? ?

??

attn_scores = attn_scores / 1.7 print(attn_scores) attn_scores = [[1.2, 2.4, 2.4],[2.4, 9.4, 7.1],[2.4, 7.1, 5.9], ] attn_scores = tf.Variable(attn_scores, dtype=tf.float32) print(attn_scores)attn_scores_softmax = tf.nn.softmax(attn_scores) print(attn_scores_softmax) attn_scores_softmax = [[0.1, 0.4, 0.4],[0.0, 0.9, 0.0],[0.0, 0.7, 0.2], ] attn_scores_softmax = tf.Variable(attn_scores_softmax, dtype=tf.float32) print(attn_scores_softmax)

?

Step6+Step7一起算出來

print(attn_scores_softmax) print(values) outputs = tf.matmul(attn_scores_softmax, values) print(outputs) <tf.Variable 'Variable:0' shape=(3, 3) dtype=float32, numpy= array([[0. , 0.5, 0.5],[0. , 1. , 0. ],[0. , 0.9, 0.1]], dtype=float32)> tf.Tensor( [[1. 2. 3.][2. 8. 0.][2. 6. 3.]], shape=(3, 3), dtype=float32) tf.Tensor( [[2. 7. 1.5 ][2. 8. 0. ][2. 7.7999997 0.3 ]], shape=(3, 3), dtype=float32)

下面例子使用的除? ?后，算出來的outputs

??

?

Step 6: Multiply scores with values

weighted_values = values[:,None] * tf.transpose(attn_scores_softmax, perm=[1, 0])[:,:,None] print(weighted_values) # tensor([[[0.0000, 0.0000, 0.0000], # [0.0000, 0.0000, 0.0000], # [0.0000, 0.0000, 0.0000]], # # [[1.0000, 4.0000, 0.0000], # [2.0000, 8.0000, 0.0000], # [1.8000, 7.2000, 0.0000]], # # [[1.0000, 3.0000, 1.5000], # [0.0000, 0.0000, 0.0000], # [0.2000, 0.6000, 0.3000]]])

Step 7: Sum weighted values

outputs = tf.reduce_sum(weighted_values, axis=0) print(outputs) # tensor([[2.0000, 7.0000, 1.5000], # Output 1 # [2.0000, 8.0000, 0.0000], # Output 2 # [2.0000, 7.8000, 0.3000]]) # Output 3

?

總結(jié)

以上是生活随笔為你收集整理的[深度学习] 自然语言处理 --- Self-Attention(二) 动画与代码演示的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：主创爆料《阿凡达3》：剧情和阵容更精彩
下一篇： [深度学习] FM FFM 算法基本原