Multi-head Selection和Deep Biaffine Attention在關系抽取中的應用 前言 Multi-head Selection 一、Joint entity recognition and relation extraction as a multi-head selection problem 二、BERT-Based Multi-Head Selection for Joint Entity-Relation Extraction 三、實現方法和模型代碼展示 Deep Biaffine Attention 一、Deep Biaffine Attention for Neural Dependency Parsing 二、Named Entity Recognition as Dependency Parsing Biaffine 和 Multi-head對比 總結
前言
最近在磕一人之力,刷爆三路榜單!信息抽取競賽奪冠經驗分享這篇文章,文章對關系抽取任務的解碼方式做了簡單的概述,在之前的文章中本人已經實現了指針標注網絡,并對其進了優化(詳情見改良后的層疊式指針網絡,讓我的模型F1提升近4%),因此這次把文中提到的多頭選擇和Biaffine關系矩陣構建的原論文拿出來研究了一下,并根據實際的關系抽取任務做了復現和改良。
本文主要涉及以下論文:
多頭選擇與關系抽取: Joint entity recognition and relation extraction as a multi-head selection problem
Bert版多頭選擇: BERT-Based Multi-Head Selection for Joint Entity-Relation Extraction
雙仿射注意力機制: Deep Biaffine Attention for Neural Dependency Parsing(基于深層雙仿射注意力的神經網絡依存解析)
新論文,采取Biaffine機制構造Span矩陣: Named Entity Recognition as Dependency Parsing
本文只是記錄自己的思考和實現,只對每篇論文核心部分做簡單理解,可能有錯誤,感興趣的同學建議直接看原文。
Multi-head Selection
一、Joint entity recognition and relation extraction as a multi-head selection problem
該網絡結構將實體識別和關系抽取 joint 在一起,先通過在隱藏層上連接CRF,來抽取實體標簽,并將實體標簽信息embedding后與隱藏層一起傳遞給sigmoid_layer,來與其他實體特征進行交互抽取關系。 實體抽取部分比較好理解,對于多頭選擇部分,以下原文的幾條核心公式:
理解如下: 當Token(j)為Subject的End 且 Token(i)為Object的End的概率分數為 非常簡明的意思是:將Token(j)和Token(i)的Z(隱藏層?label embedding)分別經過U,W線性變換后相加再加上偏置b,最后再進行一次整體線性變換V,得到的值經過sigmoid函數后即轉換為對應的概率分數。 別急!后文會討論多類別關系時,各個矩陣的維度關系。
二、BERT-Based Multi-Head Selection for Joint Entity-Relation Extraction
Bert + Multi-head Selection,除了引入了Bert作為編碼層外,其他創新點如下:
考慮在訓練時我們可以通過傳入真實的實體標簽的embedding給Multi-head Selection層,但在模型推理時,為了利用CRF的softmax在各個標簽上產生的分值信息和考慮到推理時可能產生的錯誤標簽結果,作者將softmax結果與各個標簽的embedding進行加權后傳給Multi-head Selection層。
引入句子級關系分類任務來指導特征學習,如圖中的用CLS來獲得穩定的維度特征。(關于這一改進我并沒有進行嘗試,還不是因為沒有數據!因此持有懷疑態度,原本就將兩個任務的解碼壓力放在一組encoder上,現在又增加了句子分類任務。這不會加大模型壓力嗎?文中給出的實驗結果表明,單獨增加Global predicate Prediction并沒有帶來明顯的提升,而組合各種策略能帶來的較高提升不一定是該方法的貢獻。)
Soft label embedding of the ith token hi is feed into two separate fully connected layers to get the subject representation hsi and object representation hoi. where f(·) means neural network, ri,j is the relation when the ith token is subject and the jth token is object, rj,i is the relation when the jth token is subject and the ith token is object.(論文原文,多頭選擇方法與上文相同,將構建好的token特征通過兩個不同的全連接層后經過一個F網絡輸出兩者的關系分值)
三、實現方法和模型代碼展示
以三元關系抽取任務為例,我們多頭選擇該如何更好的理解和應用?功夫不負有心人我找大了夕大姐文章里一張圖雖然和多頭選擇沒多少關系,但能比較形象的展示Multi-head Selection的流程:
對于一個句子中的所有的token形成的SO組合,一共有N2種。假設我們的Token都已經經過了線性變化或者全連接層的洗禮,如圖中對于每一個City,我們將其作為S,我們應該考慮其他所有token是否能和它形成SO關系,所以我們要計算每一個token和city經過V變換后的分數。
具體實現上也非常簡單,我們可以構建一個NNHidden_size的token組合矩陣,經過一個P_num(關系類別總數)的Dense層后即可得到各個token之間在各個關系上的分值。
Subject
= tf
. keras
. layers
. Dense
( hidden_size
) ( Encode
)
Object
= tf
. keras
. layers
. Dense
( hidden_size
) ( Encode
)
'''
將原token的encode經過兩個不同的全連接層后,得到Subject,Object兩個token序列
對應公式中的 U*zj 和 W*zi
'''
Subject
= tf
. expand_dims
( Subject
, 1 )
Object
= tf
. expand_dims
( Object
, 2 )
Subject
= tf
. tile
( Subject
, multiples
= ( 1 , MAX_LEN
, 1 , 1 ) )
Object
= tf
. tile
( Object
, multiples
= ( 1 , 1 , MAX_LEN
, 1 ) )
concat_SO
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ Subject
, Object
] )
output_logist
= tf
. keras
. layers
. Dense
( P_num
, activation
= 'sigmoid' ) ( concat_SO
)
'''
將組合后的 U*zj 與 W*zi 經過一個V全連接層,V.shape = (2*hidden_size,P_num)
對應公式中的 V*U*zj + V*W*zi = V(U*zj + W*zj + b)
'''
樣本構建部分: 我們需要將一個標注好的[ MAX_LEN, MAX_LEN, P_num ]的矩陣作為Multi-Head Selection 結果的 Y值。
關于樣本標注問題:
一個實體包含多個字符且可能存在實體嵌套的問題,如“我是歌手的主演是阿瓜”,需要抽取的關系為“我是歌手的主演是阿瓜”、“阿瓜是歌手”。
我們并不需要將“我是歌手”的四個token和“阿瓜”的兩個token在P=主演的得分上全都標注為1,因為在實體抽取部分我們對實體的頭和尾進行了識別,我們僅需要將 S“手”和O“瓜”所對應的P=主演的分值標注為1即可。本例中即 [3,9,indenx of 主演] = 1即可。如果阿瓜還是個歌手,則 [9,3,indenx of 職業] = 1即可,因為對于不同類型的嵌套的實體,可能存在尾字符相同的情況極少,因此我們標注尾字符而不是首字符。
以上就是Multi-head Selection Model部分的核心思路和代碼。
完整模型代碼如下:
用bert代替了原文的LSTM編碼層。 這里將CRF替換為指針標注,并引入了實體的類別信息。 將實體的硬標簽與實體的end_token拼接后傳入Multi-head Selection層,這也是本人靈光一閃的部分,既然在Multi-head Selection層我們希望model能識別S的end_token和O的end_token, 我們就只給這兩個token傳入有效的實體標簽信息,其余token類別都編碼為0即可,實驗正面這確實比你對所有的SOtoken都傳入對應的實體類別embedding效果更好。 沒有使用上文的軟標簽,軟標簽的具體實現可以通過自定義 layer 實現。
def build_model ( pretrained_path
, config
, MAX_LEN
, Cs_num
, cs_em_size
, R_num
) : ids
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) att
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) cs
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) config
. output_hidden_states
= True bert_model
= TFBertModel
. from_pretrained
( pretrained_path
, config
= config
, from_pt
= True ) x
, _
, hidden_states
= bert_model
( ids
, attention_mask
= att
) layer_1
= hidden_states
[ - 1 ] start_logits
= tf
. keras
. layers
. Dense
( Cs_num
, activation
= 'sigmoid' ) ( layer_1
) start_logits
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 2 , name
= 's_start' ) ( start_logits
) end_logits
= tf
. keras
. layers
. Dense
( Cs_num
, activation
= 'sigmoid' ) ( layer_1
) end_logits
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 2 , name
= 's_end' ) ( end_logits
) cs_emb
= tf
. keras
. layers
. Embedding
( Cs_num
, cs_em_size
) ( cs
) concat_cs
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ layer_1
, cs_emb
] ) f1
= tf
. keras
. layers
. Dense
( 128 ) ( concat_cs
) f2
= tf
. keras
. layers
. Dense
( 128 ) ( concat_cs
) f1
= tf
. expand_dims
( f1
, 1 ) f2
= tf
. expand_dims
( f2
, 2 ) f1
= tf
. tile
( f1
, multiples
= ( 1 , MAX_LEN
, 1 , 1 ) ) f2
= tf
. tile
( f2
, multiples
= ( 1 , 1 , MAX_LEN
, 1 ) ) concat_f
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ f1
, f2
] ) output_logist
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( concat_f
) output_logist
= tf
. keras
. layers
. Dense
( R_num
, activation
= 'sigmoid' ) ( output_logist
) output_logist
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 4 , name
= 'relation' ) ( output_logist
) model
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
, cs
] , outputs
= [ start_logits
, end_logits
, output_logist
] ) model_2
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
] , outputs
= [ start_logits
, end_logits
] ) model_3
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
, cs
] , outputs
= [ output_logist
] ) return model
, model_2
, model_3
模型示意圖之丑陋手稿:看個樂哈~: 模型效果: 沒有做其他任何的處理,在2019百度三元抽取數據集上F1就達到了 79.157,而且模型結構相比于層疊序列標注簡單了許多,理解起來也更加到胃~
Deep Biaffine Attention
一、Deep Biaffine Attention for Neural Dependency Parsing
這篇文章主要通過Biaffine來應用于依存關系分析上,但這也正好和關系抽取共通,只是依存關系中的關系類別只有一種,而在關系抽取中存在多種關系分類。
文章使用了雙仿射注意力機制,而不是使用傳統基于MLP注意力機制的單仿射分類器,或雙線性分類器;上文提到的Multi-Head Selection正是由多個線性分類器構成的關系分類器。而現在我們希望能通過構建一個Biaffine Attention矩陣直接計算各個token之間在某個關系分類上的attention。
(這里直接將關系依存中的head 和 dep 稱為 關系抽取中的S和O
將BiLSTM編碼的token hidden 經過兩個MLP 得到 S 和 O的特征表示; 對于這一步文章中特地提到:“Applying smaller MLPs to the recurrent output states before the biaffine classifier has the advantage of stripping away information not relevant to the current decision. That is, every top recurrent state ri will need to carry enough information to identify word i’s head, find all its dependents, exclude all its non-dependents, assign itself the correct label, and assign all its dependents their correct labels, as well as transfer any relevant information to the recurrent states of words before and after it. Thus ri necessarily contains significantly more information than is needed to compute any individual score, and training on this superfluous information needlessly reduces parsing speed and increases the risk of overfitting. Reducing dimensionality and applying a nonlinearity addresses both of these problems.” LSTM層的輸出狀態需要攜帶足夠的信息,如識別其頭結點,找到其依賴項,排除非依賴項,分配自身及其所有依賴的依存標簽,而且還需要把其它任何相關信息傳遞至前或后單元。對這些不必要的信息進行訓練會降低訓練速度,而且還有過擬合的風險。使用MLP對LSTM輸出降維,并使用雙仿射變換,可解決這一問題! 簡單來說我們希望能通過將原本高維度富含豐富信息的 hidden state 通過MLP降為至只能容下關系依賴信息的低緯度的特征,一方面加速訓練,另一方面可以抑制過擬合。
最終我通過構建一個U(Biaffine)矩陣來計算各個token之間依存的分值,并引入u矩陣來計算head的先驗概率并產生偏置b。我們設token的長度為d,經過MLP壓縮后的hidden_size為k,以下是我丑陋手稿來解釋矩陣的乘法維度變化。
Biaffine實現代碼:
class Biaffine ( tf
. keras
. layers
. Layer
) : def __init__ ( self
, in_size
, out_size
, bias_x
= False , bias_y
= False ) : super ( Biaffine
, self
) . __init__
( ) self
. bias_x
= bias_xself
. bias_y
= bias_yself
. U
= self
. add_weight
( name
= 'weight1' , shape
= ( in_size
+ int ( bias_x
) , out_size
, in_size
+ int ( bias_y
) ) , trainable
= True ) def call ( self
, input1
, input2
) : if self
. bias_x
: input1
= tf
. concat
( ( input1
, tf
. ones_like
( input1
[ . . . , : 1 ] ) ) , axis
= - 1 ) if self
. bias_y
: input2
= tf
. concat
( ( input2
, tf
. ones_like
( input2
[ . . . , : 1 ] ) ) , axis
= - 1 ) logits_1
= tf
. einsum
( 'bxi,ioj,byj->bxyo' , input1
, self
. U
, input2
) return logits_1
完整模型代碼:
def build_model ( pretrained_path
, config
, MAX_LEN
, Cs_num
, cs_em_size
, R_num
) : ids
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) att
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) cs
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) config
. output_hidden_states
= True bert_model
= TFBertModel
. from_pretrained
( pretrained_path
, config
= config
, from_pt
= True ) x
, _
, hidden_states
= bert_model
( ids
, attention_mask
= att
) layer_1
= hidden_states
[ - 1 ] start_logits
= tf
. keras
. layers
. Dense
( Cs_num
, activation
= 'sigmoid' ) ( layer_1
) start_logits
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 2 , name
= 's_start' ) ( start_logits
) end_logits
= tf
. keras
. layers
. Dense
( Cs_num
, activation
= 'sigmoid' ) ( layer_1
) end_logits
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 2 , name
= 's_end' ) ( end_logits
) cs_emb
= tf
. keras
. layers
. Embedding
( Cs_num
, cs_em_size
) ( cs
) concat_cs
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ layer_1
, cs_emb
] ) f1
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( concat_cs
) f2
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( concat_cs
) Biaffine_layer
= Biaffine
( 128 , R_num
, bias_y
= True ) output_logist
= Biaffine_layer
( f1
, f2
) output_logist
= tf
. keras
. layers
. Activation
( 'sigmoid' ) ( output_logist
) output_logist
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 4 , name
= 'relation' ) ( output_logist
) model
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
, cs
] , outputs
= [ start_logits
, end_logits
, output_logist
] ) model_2
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
] , outputs
= [ start_logits
, end_logits
] ) model_3
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
, cs
] , outputs
= [ output_logist
] ) return model
, model_2
, model_3
模型效果: F1 = 0.7964 相比Multi-head Selection F1提高了0.05左右
二、Named Entity Recognition as Dependency Parsing
After obtaining the word representations from the BiLSTM, we apply two separate FFNNs to create different representations (hs/he) for the start/end of the spans. Using different representations for the start/end of the spans allow the system to learn to identify the start/end of the spans separately. This improves accuracy compared to the model which directly uses the outputs of the LSTM since the context of the start and end of the entity are different. Finally, we employ a biaffine model over the sentence to create a l×l×c scoring tensor(rm), where l is the length of the sentence and c is the number of NER categories + 1(for non-entity). We compute the score for a span i by: where si and ei are the start and end indices of the span i, Um is a d × c × d tensor, Wm is a 2d × c matrix and bm is the bias. 原文在BERT、fastText & Char Embeddings提取特征的基礎上,通過BiLSTM捕獲word representations后,同樣使用兩組全連接層來表示實體的頭和尾,這比直接使用encode結果后直接輸出實體的頭和尾來說更加準確,畢竟兩者所表示的信息是不同的。 之后將這兩組特征丟入我們的主角:Biaffine矩陣。這個任務中我們不僅要識別實體的頭和尾,還要識別出實體的類別C,因此我們目標是得到一個LLC的結果矩陣(其中L為序列長度,C為實體類別數目) 重點關注公式,其中Um、Wm、bm的shape及其表示意義如下: Um:tensor shape of dcd: 對hs(i)為頭he(i)為尾的實體類別后驗 概率建模 Wm:tensor shape of 2d*c: 對hs(i) 或 he(i)為尾的實體類別的后驗 概率分別建模 bm:tensor shape of c: 對實體類別c的先驗 概率建模 對于頭為hs(i),尾為he(i),類別為C的實體其概率分值為 P(頭為hs(i)&尾為he(i),C) + P(頭為hs(i),C) + P(尾為he(i),C) + P( C ) 根據公式我們可以構建新的Biaffine矩陣代碼:
class Biaffine_2 ( tf
. keras
. layers
. Layer
) : def __init__ ( self
, in_size
, out_size
, MAX_LEN
) : super ( Biaffine_2
, self
) . __init__
( ) self
. w1
= self
. add_weight
( name
= 'weight1' , shape
= ( in_size
, out_size
, in_size
) , trainable
= True ) self
. w2
= self
. add_weight
( name
= 'weight2' , shape
= ( 2 * in_size
+ 1 , out_size
) , trainable
= True ) self
. MAX_LEN
= MAX_LEN
def call ( self
, input1
, input2
) : f1
= tf
. expand_dims
( input1
, 2 ) f2
= tf
. expand_dims
( input2
, 1 ) f1
= tf
. tile
( f1
, multiples
= ( 1 , 1 , self
. MAX_LEN
, 1 ) ) f2
= tf
. tile
( f2
, multiples
= ( 1 , self
. MAX_LEN
, 1 , 1 ) ) concat_f1f2
= tf
. concat
( ( f1
, f2
) , axis
= - 1 ) concat_f1f2
= tf
. concat
( ( concat_f1f2
, tf
. ones_like
( concat_f1f2
[ . . . , : 1 ] ) ) , axis
= - 1 ) logits_1
= tf
. einsum
( 'bxi,ioj,byj->bxyo' , input1
, self
. w1
, input2
) logits_2
= tf
. einsum
( 'bijy,yo->bijo' , concat_f1f2
, self
. w2
) return logits_1
+ logits_2
嘗試1
目前嘗試中能拿到最好效果的模型方案(持續更新中):
實體標簽作為比較強的特征,取得S和O的實體標簽基本可以判斷兩者的關系,因此將標簽直接引入Biaffine矩陣。 將BERT最后兩層編碼進?Biaffine計算,得到關系矩陣。 給實體抽取層增加了一層全連接,對實體抽取和關系抽取兩個任務做適當的分離。
def build_model_3 ( pretrained_path
, config
, MAX_LEN
, Cs_num
, cs_em_size
, R_num
) : ids
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) att
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) cs
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) config
. output_hidden_states
= True bert_model
= TFBertModel
. from_pretrained
( pretrained_path
, config
= config
, from_pt
= True ) x
, _
, hidden_states
= bert_model
( ids
, attention_mask
= att
) layer_1
= hidden_states
[ - 1 ] layer_2
= hidden_states
[ - 2 ] start_logits
= tf
. keras
. layers
. Dense
( 256 , activation
= 'relu' ) ( layer_1
) start_logits
= tf
. keras
. layers
. Dense
( Cs_num
, activation
= 'sigmoid' ) ( start_logits
) start_logits
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 2 , name
= 's_start' ) ( start_logits
) end_logits
= tf
. keras
. layers
. Dense
( 256 , activation
= 'relu' ) ( layer_1
) end_logits
= tf
. keras
. layers
. Dense
( Cs_num
, activation
= 'sigmoid' ) ( end_logits
) end_logits
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 2 , name
= 's_end' ) ( end_logits
) cs_emb
= tf
. keras
. layers
. Embedding
( Cs_num
, cs_em_size
) ( cs
) concat_cs
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ layer_1
, layer_2
] ) f1
= tf
. keras
. layers
. Dropout
( 0.2 ) ( concat_cs
) f1
= tf
. keras
. layers
. Dense
( 256 , activation
= 'relu' ) ( f1
) f1
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( f1
) f1
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ f1
, cs_emb
] ) f2
= tf
. keras
. layers
. Dropout
( 0.2 ) ( concat_cs
) f2
= tf
. keras
. layers
. Dense
( 256 , activation
= 'relu' ) ( f2
) f2
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( f2
) f2
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ f2
, cs_emb
] ) Biaffine_layer
= Biaffine_2
( 128 + cs_em_size
, R_num
, MAX_LEN
) output_logist
= Biaffine_layer
( f1
, f2
) output_logist
= tf
. keras
. layers
. Activation
( 'sigmoid' ) ( output_logist
) output_logist
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 4 , name
= 'relation' ) ( output_logist
) model
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
, cs
] , outputs
= [ start_logits
, end_logits
, output_logist
] ) model_2
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
] , outputs
= [ start_logits
, end_logits
] ) model_3
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
, cs
] , outputs
= [ output_logist
] ) return model
, model_2
, model_3
F1 = 0.7986 比baseline提高了 0.022
嘗試2
實體抽取部分用Biaffine矩陣代替序列標注,softmax激活后輸出 并用bert后兩層隱藏層同時編碼實體和關系矩陣
def build_model ( pretrained_path
, config
, MAX_LEN
, Cs_num
, cs_em_size
, R_num
) : ids
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) att
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) cs
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) config
. output_hidden_states
= True bert_model
= TFBertModel
. from_pretrained
( pretrained_path
, config
= config
, from_pt
= True ) x
, pooling
, hidden_states
= bert_model
( ids
, attention_mask
= att
) layer_1
= hidden_states
[ - 1 ] layer_2
= hidden_states
[ - 2 ] concat_cs
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ layer_1
, layer_2
] ) start_logits
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( concat_cs
) end_logits
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( concat_cs
) S_Biaffine_layer
= Biaffine_2
( 128 , Cs_num
, MAX_LEN
) S_logits
= S_Biaffine_layer
( start_logits
, end_logits
) S_output_logist
= tf
. keras
. layers
. Activation
( 'softmax' , name
= 's_token' ) ( S_logits
) S_logits
= tf
. keras
. layers
. Dense
( 256 , activation
= 'relu' ) ( concat_cs
) S_logits
= tf
. keras
. layers
. Dropout
( 0.2 ) ( S_logits
) S_logits
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( S_logits
) O_logits
= tf
. keras
. layers
. Dense
( 256 , activation
= 'relu' ) ( concat_cs
) O_logits
= tf
. keras
. layers
. Dropout
( 0.2 ) ( O_logits
) O_logits
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( O_logits
) cs_emb
= tf
. keras
. layers
. Embedding
( Cs_num
, cs_em_size
) ( cs
) f1
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ S_logits
, cs_emb
] ) f2
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ O_logits
, cs_emb
] ) Biaffine_layer
= Biaffine_2
( 128 + cs_em_size
, R_num
, MAX_LEN
) output_logist
= Biaffine_layer
( f1
, f2
) output_logist
= tf
. keras
. layers
. Activation
( 'sigmoid' ) ( output_logist
) output_logist
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 4 , name
= 'relation' ) ( output_logist
) model
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
, cs
] , outputs
= [ S_output_logist
, output_logist
] ) model_2
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
] , outputs
= [ S_output_logist
] ) model_3
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
, cs
] , outputs
= [ output_logist
] ) return model
, model_2
, model_3
F1值:0.8016
Biaffine 和 Multi-head對比
同樣的計算開銷:N^2 更多的參數:Baffine attention 矩陣擁有更多的參數,且相比于Muti-head Selection 能捕捉到S和O特征之間的交叉關系,而Muti-head Selection則是通過簡單的MLP線性變化進行組合。丑陋手稿如下: 加入了start和end單獨的先驗概率 各模型結果對比 biaffine + biaffine 8 0.8016
總結
在關系抽取的任務中,baseline選擇 Biaffine 會優于 基本的Multi-head,在這個基礎上模型還有很多可以優化的地方。比如 如何更好的構造Biaffine矩陣,如何產生更有效的信息,但不至于過擬合。在我的實驗過程中,Biaffine方法一直無法超越之前改良后的層疊式指針網絡,這讓我非常郁悶(主要是和別人實驗結果不同),排除batch_size太小,模型收斂性較差的原因之外,根據模型的訓練情況判斷可能原因是在抽取實體時要求模型辨別實體的類別,導致該任務收斂性較差,試了多種網絡較難擬合,也可能和數據集相關。使用上還存在一定的問題,會持續關注。
其實學術界已經對這種共享編碼的效果提出了質疑,也有不少實驗證明,實體抽取和關系抽取這兩個任務在獨立編碼的情況下效果要好于共享編碼。Two are Better than One: Joint Entity and Relation Extraction with Table-Sequence Encoders
本文并沒有嘗試用CRF來抽取實體,而是直接使用了指針標注的方法,只是收到了大部分實體抽取文章的影響(默認指針標注優于CRF)之后會對這兩種方法進行對比。
關系抽取的系列文章可能要到此告一段落,之后看到這方面重要的論文也會第一時間和大家分享~
總結
以上是生活随笔 為你收集整理的信息抽取(四)【NLP论文复现】Multi-head Selection和Deep Biaffine Attention在关系抽取中的实现和效果 的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔 網站內容還不錯,歡迎將生活随笔 推薦給好友。