图注意力网络(Graph Attention Network, GAT) 模型解读与代码实现(tensorflow2.0)
前面的文章,我們講解了圖神經網絡三劍客GCN、GraphSAGE、GAT中的兩個:
本要講的是GAT(Graph Attention Network),它使用 Attention 機制來對鄰居節點進行加權求和,和一般的Attention 機制一樣,分為計算注意力系數和加權求和兩個步驟。
GAT中的 Attention 機制
先來看看每一層的輸入與輸出:
input?:h={h?1,h?2,…,h?N},h?i∈RFoutput?:h′={h?1′,h?2′,…,h?N′},h?i′∈RF\text { input }: \mathbf{h}=\left\{\vec{h}_{1}, \vec{h}_{2}, \ldots, \vec{h}_{N}\right\}, \vec{h}_{i} \in \mathrm{R}^{F} \\ \text { output }: \mathbf{h}^{\prime}=\left\{\vec{h}_{1}^{\prime}, \vec{h}_{2}^{\prime}, \ldots, \vec{h}_{N}^{\prime}\right\}, \vec{h}_{i}^{\prime} \in \mathrm{R}^{F} ?input?:h={h1?,h2?,…,hN?},hi?∈RF?output?:h′={h1′?,h2′?,…,hN′?},hi′?∈RF
1. 計算注意力系數
首先計算頂點 iii 與周圍鄰居節點 j∈Nij \in \mathcal N_ij∈Ni? 的相似度:
eij=a(Wh?i,Wh?j)e_{i j}=a\left(\mathbf{W} \vec{h}_{i}, \mathbf{W} \vec{h}_{j}\right) eij?=a(Whi?,Whj?)
公式中的 (Wh?i,Wh?j)\left(\mathbf{W} \vec{h}_{i}, \mathbf{W} \vec{h}_{j}\right)(Whi?,Whj?) 可以看出特征 hi,hjh_i, h_jhi?,hj? 共享了參數矩陣 WWW,都使用 WWW 對特征進行線性變換, (?,?)(·, ·)(?,?)在公式中表示橫向拼接。而公式最外層的 aaa 表示單層前饋神經網絡(使用LeakyReLULeakyReLULeakyReLU作為激活函數),輸出為一個數值。
有了相關系數,接下來就要進行歸一化了,作者使用了 SoftmaxSoftmaxSoftmax,于是注意力系數的計算過程全貌為:
αij=softmax?j(eij)=exp?(eij)∑k∈Niexp?(eik)=exp?(LeakyReLU?(a→T[Wh?i∥Wh?j]))∑k∈Niexp?(LeakyReLU?(a→T[Wh?i∥Wh?k]))\begin{aligned} \alpha_{i j}&=\operatorname{softmax}_{j}\left(e_{i j}\right)\\ &=\frac{\exp \left(e_{i j}\right)}{\sum_{k \in N_{i}} \exp \left(e_{i k}\right)}\\ &=\frac{\exp \left(\operatorname{LeakyReLU}\left(\overrightarrow{\mathbf{a}}^{T}\left[\mathbf{W} \vec{h}_{i} \| \mathbf{W} \vec{h}_{j}\right]\right)\right)}{\sum_{k \in N_{i}} \exp \left(\operatorname{LeakyReLU}\left(\overrightarrow{\mathbf{a}}^{T}\left[\mathbf{W} \vec{h}_{i} \| \mathbf{W} \vec{h}_{k}\right]\right)\right)} \end{aligned} αij??=softmaxj?(eij?)=∑k∈Ni??exp(eik?)exp(eij?)?=∑k∈Ni??exp(LeakyReLU(aT[Whi?∥Whk?]))exp(LeakyReLU(aT[Whi?∥Whj?]))??
其中,∥\|∥ 表示向量拼接,以上的計算過程如下圖所示:
2. 加權求和
得到注意力系數之后,就是對鄰居節點的特征進行加權求和:
h?i′=σ(∑j∈NiαijWh?j)\vec{h}_{i}^{\prime}=\sigma\left(\sum_{j \in N_{i}} \alpha_{i j} \mathbf{W} \vec{h}_{j}\right) hi′?=σ???j∈Ni?∑?αij?Whj????
不過為了更好的學習效果,作者使用了 “multi-head attention”,也就是使用 K 個注意力。對于 K 個注意力又可以使用兩種方法對鄰居節點進行聚合。
一種方法是橫向拼接的方式,這樣聚合到的特征維度就是原來的K倍:
h?i′=∥k=1Kσ(∑j∈NiαijkWkh?j)\vec{h}_{i}^{\prime}=\|_{k=1}^{K} \sigma\left(\sum_{j \in N_{i}} \alpha_{i j}^{k} \mathbf{W}^{k} \vec{h}_{j}\right) hi′?=∥k=1K?σ???j∈Ni?∑?αijk?Wkhj????
另一種方法是把K個注意力機制得到的結果取平均值:
h?i′=σ(1K∑k=1K∑j∈NiαijkWkh?j)\vec{h}_{i}^{\prime}=\sigma\left(\frac{1}{K} \sum_{k=1}^{K} \sum_{j \in N_{i}} \alpha_{i j}^{k} \mathbf{W}^{k} \vec{h}_{j}\right) hi′?=σ???K1?k=1∑K?j∈Ni?∑?αijk?Wkhj????
下圖可以輔助理解上面的計算過程:
盡管使用了多個 head ,但是不同的 head 是可以并行訓練的。
GAT 優點
代碼實現(tensorflow2.0)
參數配置:
# training params batch_size = 1 nb_epochs = 100000 patience = 100 lr = 0.005 l2_coef = 0.0005 # l2 正則化系數 hid_units = [8] # 每一個attention head中每一層的隱藏單元個數 n_heads = [8, 1] # 每層使用的注意力頭個數 residual = False # ...單個 Attention Head 實現:
class attn_head(tf.keras.layers.Layer):def __init__(self,hidden_dim, nb_nodes = None,in_drop=0.0, coef_drop=0.0,activation = tf.nn.elu,residual = False): super(attn_head,self).__init__() self.activation = activationself.residual = residualself.in_dropout = tf.keras.layers.Dropout(in_drop)self.coef_dropout = tf.keras.layers.Dropout(coef_drop) self.conv_no_bias = tf.keras.layers.Conv1D(hidden_dim,1,use_bias=False)self.conv_f1 = tf.keras.layers.Conv1D(1,1)self.conv_f2 = tf.keras.layers.Conv1D(1,1)self.conv_residual = tf.keras.layers.Conv1D(hidden_dim,1)self.bias_zero = tf.Variable(tf.zeros(hidden_dim))def __call__(self,seq,bias_mat,training):# seq: 輸入的節點特征seq = self.in_dropout(seq,training = training)# 使用 hidden_dim=8 個1維卷積,卷積核大小為1# 每個卷積核的參數相當于鄰居節點的權重# 整個卷積的過程相當于公式中的 Wh# seq_fts.shape: (num_graph, num_nodes, hidden_dim)seq_fts = self.conv_no_bias(seq)# 1x1 卷積可以理解為按hidden_dim這個通道進行加權求和,但參數共享# 相當于單輸出全連接層1# f_1.shape: (num_graph, num_nodes, 1)f_1 = self.conv_f1(seq_fts)# 相當于單輸出全連接層2f_2 = self.conv_f2(seq_fts)# 廣播機制計算(num_graph,num_nodes,1)+(num_graph,1,num_nodes)# logits.shape: (num_graph, num_nodes, num_nodes)# 相當于計算了所有節點的 [e_ij]logits = f_1 + tf.transpose(f_2,[0,2,1])# 得到鄰居節點的注意力系數:[alpha_ij]# bias_mat 中非鄰居節點為極大的負數,softmax之后變為0# coefs.shape: (num_graph, num_nodes, num_nodes)coefs = tf.nn.softmax(tf.nn.leaky_relu(logits)+bias_mat)# dropoutcoefs = self.coef_dropout(coefs,training = training)seq_fts = self.in_dropout(seq_fts,training = training)# 計算:[alpha_ij] x Wh# vals.shape: (num_graph, num_nodes, num_nodes)vals = tf.matmul(coefs, seq_fts)vals = tf.cast(vals, dtype=tf.float32)# 最終結果再加上一個 bias ret = vals + self.bias_zero# 殘差if self.residual:if seq.shape[-1] != ret.shape[-1]:ret = ret + self.conv_residual(seq) else:ret = ret + seq# 返回 h' = σ([alpha_ij] x Wh)# shape: (num_graph, num_nodes, hidden_dim)return self.activation(ret)Multi-head Attention 代碼:
chosen_attention = attn_headclass inference(tf.keras.layers.Layer):def __init__(self,n_heads,hid_units,nb_classes, nb_nodes,Sparse,ffd_drop=0.0, attn_drop=0.0,activation = tf.nn.elu,residual = False): super(inference,self).__init__()attned_head = choose_attn_head(Sparse)self.attns = []self.sec_attns = []self.final_attns = []self.final_sum = n_heads[-1]# 構造 n_heads[0] 個 attentionfor i in range(n_heads[0]):self.attns.append(attned_head(hidden_dim = hid_units[0], nb_nodes = nb_nodes,in_drop = ffd_drop, coef_drop = attn_drop, activation = activation,residual = residual))# hid_units表示每一個attention head中每一層的隱藏單元個數# 若給定hid_units = [8], 表示使用單個全連接層# 因此,不執行下面的代碼for i in range(1, len(hid_units)):h_old = h_1sec_attns = []for j in range(n_heads[i]): sec_attns.append(attned_head(hidden_dim = hid_units[i], nb_nodes = nb_nodes,in_drop = ffd_drop, coef_drop = attn_drop, activation = activation,residual = residual))self.sec_attns.append(sec_attns)# 加上輸出層for i in range(n_heads[-1]):self.final_attns.append(attned_head(hidden_dim = nb_classes, nb_nodes = nb_nodes,in_drop = ffd_drop, coef_drop = attn_drop, activation = lambda x: x,residual = residual)) def __call__(self,inputs,bias_mat,training): first_attn = []out = []# 計算 n_heads[0] 個 attentionfor indiv_attn in self.attns:first_attn.append(indiv_attn(seq = inputs, bias_mat = bias_mat,training = training))# h_1.shape: (num_graph, num_nodes, hidden_dim*n_heads[0])h_1 = tf.concat(first_attn,axis = -1) # 如果 attention 使用了多層網絡,則依次計算for sec_attns in self.sec_attns:next_attn = []for indiv_attns in sec_attns:next_attn.append(indiv_attn(seq = h_1,bias_mat = bias_mat,training = training))h_1 = tf.concat(next_attns,axis = -1)# 得到最終的預測結果for indiv_attn in self.final_attns:out.append(indiv_attn(seq=h_1,bias_mat = bias_mat,training = training))# 將結果在最后一個維度取均值# logits.shape: (num_graph, num_nodes, nb_classes)logits = tf.add_n(out)/self.final_sumreturn logitsGAT 模型:
class GAT(tf.keras.Model):def __init__(self, hid_units,n_heads, nb_classes, nb_nodes,Sparse,ffd_drop = 0.0,attn_drop = 0.0,activation = tf.nn.elu,residual=False): super(GAT,self).__init__()'''hid_units: 隱藏單元個數n_heads: 每層使用的注意力頭個數nb_classes: 類別數,7nb_nodes: 節點的個數,2708activation: 激活函數residual: 是否使用殘差連接''' self.hid_units = hid_units #[8]self.n_heads = n_heads #[8,1]self.nb_classes = nb_classesself.nb_nodes = nb_nodesself.activation = activationself.residual = residual self.inferencing = inference(n_heads,hid_units,nb_classes,nb_nodes,Sparse = Sparse,ffd_drop = ffd_drop,attn_drop = attn_drop, activation = activation,residual = residual)def masked_softmax_cross_entropy(self,logits, labels, mask):loss = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels)mask = tf.cast(mask, dtype=tf.float32)mask /= tf.reduce_mean(mask)loss *= maskreturn tf.reduce_mean(loss)def masked_accuracy(self,logits, labels, mask):correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))accuracy_all = tf.cast(correct_prediction, tf.float32)mask = tf.cast(mask, dtype=tf.float32)mask /= tf.reduce_mean(mask)accuracy_all *= maskreturn tf.reduce_mean(accuracy_all)def __call__(self,inputs,training,bias_mat,lbl_in,msk_in): # logits.shape: (num_graph, num_nodes, nb_classes) logits = self.inferencing(inputs = inputs, bias_mat = bias_mat,training = training) log_resh = tf.reshape(logits, [-1, self.nb_classes]) lab_resh = tf.reshape(lbl_in, [-1, self.nb_classes])msk_resh = tf.reshape(msk_in, [-1]) loss = self.masked_softmax_cross_entropy(log_resh, lab_resh, msk_resh)lossL2 = tf.add_n([tf.nn.l2_loss(v) for v in self.trainable_variables if v.name notin ['bias', 'gamma', 'b', 'g', 'beta']]) * l2_coefloss = loss+lossL2accuracy = self.masked_accuracy(log_resh, lab_resh, msk_resh)return logits,accuracy,loss完整代碼地址:github.com/zxxwin/Graph-Attention-Networks-tensorflow2.0
參考文章:
圖注意力網絡(GAT) ICLR2018, Graph Attention Network論文詳解
Graph Attention Network (GAT) 的Tensorflow版代碼解析
GNN-report.pdf
Attention Model(注意力模型)學習大全
總結
以上是生活随笔為你收集整理的图注意力网络(Graph Attention Network, GAT) 模型解读与代码实现(tensorflow2.0)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Mac中如何查看电脑的IP地址
- 下一篇: XGBoost算法的相关知识