Multi-head Selection和Deep Biaffine Attention在關(guān)系抽取中的應(yīng)用 前言 Multi-head Selection 一、Joint entity recognition and relation extraction as a multi-head selection problem 二、BERT-Based Multi-Head Selection for Joint Entity-Relation Extraction 三、實(shí)現(xiàn)方法和模型代碼展示 Deep Biaffine Attention 一、Deep Biaffine Attention for Neural Dependency Parsing 二、Named Entity Recognition as Dependency Parsing Biaffine 和 Multi-head對比 總結(jié)
前言
最近在磕一人之力,刷爆三路榜單!信息抽取競賽奪冠經(jīng)驗(yàn)分享這篇文章,文章對關(guān)系抽取任務(wù)的解碼方式做了簡單的概述,在之前的文章中本人已經(jīng)實(shí)現(xiàn)了指針標(biāo)注網(wǎng)絡(luò),并對其進(jìn)了優(yōu)化(詳情見改良后的層疊式指針網(wǎng)絡(luò),讓我的模型F1提升近4%),因此這次把文中提到的多頭選擇和Biaffine關(guān)系矩陣構(gòu)建的原論文拿出來研究了一下,并根據(jù)實(shí)際的關(guān)系抽取任務(wù)做了復(fù)現(xiàn)和改良。
本文主要涉及以下論文:
多頭選擇與關(guān)系抽取: Joint entity recognition and relation extraction as a multi-head selection problem
Bert版多頭選擇: BERT-Based Multi-Head Selection for Joint Entity-Relation Extraction
雙仿射注意力機(jī)制: Deep Biaffine Attention for Neural Dependency Parsing(基于深層雙仿射注意力的神經(jīng)網(wǎng)絡(luò)依存解析)
新論文,采取Biaffine機(jī)制構(gòu)造Span矩陣: Named Entity Recognition as Dependency Parsing
本文只是記錄自己的思考和實(shí)現(xiàn),只對每篇論文核心部分做簡單理解,可能有錯(cuò)誤,感興趣的同學(xué)建議直接看原文。
Multi-head Selection
一、Joint entity recognition and relation extraction as a multi-head selection problem
該網(wǎng)絡(luò)結(jié)構(gòu)將實(shí)體識別和關(guān)系抽取 joint 在一起,先通過在隱藏層上連接CRF,來抽取實(shí)體標(biāo)簽,并將實(shí)體標(biāo)簽信息embedding后與隱藏層一起傳遞給sigmoid_layer,來與其他實(shí)體特征進(jìn)行交互抽取關(guān)系。 實(shí)體抽取部分比較好理解,對于多頭選擇部分,以下原文的幾條核心公式:
理解如下: 當(dāng)Token(j)為Subject的End 且 Token(i)為Object的End的概率分?jǐn)?shù)為 非常簡明的意思是:將Token(j)和Token(i)的Z(隱藏層?label embedding)分別經(jīng)過U,W線性變換后相加再加上偏置b,最后再進(jìn)行一次整體線性變換V,得到的值經(jīng)過sigmoid函數(shù)后即轉(zhuǎn)換為對應(yīng)的概率分?jǐn)?shù)。 別急!后文會討論多類別關(guān)系時(shí),各個(gè)矩陣的維度關(guān)系。
二、BERT-Based Multi-Head Selection for Joint Entity-Relation Extraction
Bert + Multi-head Selection,除了引入了Bert作為編碼層外,其他創(chuàng)新點(diǎn)如下:
考慮在訓(xùn)練時(shí)我們可以通過傳入真實(shí)的實(shí)體標(biāo)簽的embedding給Multi-head Selection層,但在模型推理時(shí),為了利用CRF的softmax在各個(gè)標(biāo)簽上產(chǎn)生的分值信息和考慮到推理時(shí)可能產(chǎn)生的錯(cuò)誤標(biāo)簽結(jié)果,作者將softmax結(jié)果與各個(gè)標(biāo)簽的embedding進(jìn)行加權(quán)后傳給Multi-head Selection層。
引入句子級關(guān)系分類任務(wù)來指導(dǎo)特征學(xué)習(xí),如圖中的用CLS來獲得穩(wěn)定的維度特征。(關(guān)于這一改進(jìn)我并沒有進(jìn)行嘗試,還不是因?yàn)闆]有數(shù)據(jù)!因此持有懷疑態(tài)度,原本就將兩個(gè)任務(wù)的解碼壓力放在一組encoder上,現(xiàn)在又增加了句子分類任務(wù)。這不會加大模型壓力嗎?文中給出的實(shí)驗(yàn)結(jié)果表明,單獨(dú)增加Global predicate Prediction并沒有帶來明顯的提升,而組合各種策略能帶來的較高提升不一定是該方法的貢獻(xiàn)。)
Soft label embedding of the ith token hi is feed into two separate fully connected layers to get the subject representation hsi and object representation hoi. where f(·) means neural network, ri,j is the relation when the ith token is subject and the jth token is object, rj,i is the relation when the jth token is subject and the ith token is object.(論文原文,多頭選擇方法與上文相同,將構(gòu)建好的token特征通過兩個(gè)不同的全連接層后經(jīng)過一個(gè)F網(wǎng)絡(luò)輸出兩者的關(guān)系分值)
三、實(shí)現(xiàn)方法和模型代碼展示
以三元關(guān)系抽取任務(wù)為例,我們多頭選擇該如何更好的理解和應(yīng)用?功夫不負(fù)有心人我找大了夕大姐文章里一張圖雖然和多頭選擇沒多少關(guān)系,但能比較形象的展示Multi-head Selection的流程:
對于一個(gè)句子中的所有的token形成的SO組合,一共有N2種。假設(shè)我們的Token都已經(jīng)經(jīng)過了線性變化或者全連接層的洗禮,如圖中對于每一個(gè)City,我們將其作為S,我們應(yīng)該考慮其他所有token是否能和它形成SO關(guān)系,所以我們要計(jì)算每一個(gè)token和city經(jīng)過V變換后的分?jǐn)?shù)。
具體實(shí)現(xiàn)上也非常簡單,我們可以構(gòu)建一個(gè)NNHidden_size的token組合矩陣,經(jīng)過一個(gè)P_num(關(guān)系類別總數(shù))的Dense層后即可得到各個(gè)token之間在各個(gè)關(guān)系上的分值。
Subject
= tf
. keras
. layers
. Dense
( hidden_size
) ( Encode
)
Object
= tf
. keras
. layers
. Dense
( hidden_size
) ( Encode
)
'''
將原token的encode經(jīng)過兩個(gè)不同的全連接層后,得到Subject,Object兩個(gè)token序列
對應(yīng)公式中的 U*zj 和 W*zi
'''
Subject
= tf
. expand_dims
( Subject
, 1 )
Object
= tf
. expand_dims
( Object
, 2 )
Subject
= tf
. tile
( Subject
, multiples
= ( 1 , MAX_LEN
, 1 , 1 ) )
Object
= tf
. tile
( Object
, multiples
= ( 1 , 1 , MAX_LEN
, 1 ) )
concat_SO
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ Subject
, Object
] )
output_logist
= tf
. keras
. layers
. Dense
( P_num
, activation
= 'sigmoid' ) ( concat_SO
)
'''
將組合后的 U*zj 與 W*zi 經(jīng)過一個(gè)V全連接層,V.shape = (2*hidden_size,P_num)
對應(yīng)公式中的 V*U*zj + V*W*zi = V(U*zj + W*zj + b)
'''
樣本構(gòu)建部分: 我們需要將一個(gè)標(biāo)注好的[ MAX_LEN, MAX_LEN, P_num ]的矩陣作為Multi-Head Selection 結(jié)果的 Y值。
關(guān)于樣本標(biāo)注問題:
一個(gè)實(shí)體包含多個(gè)字符且可能存在實(shí)體嵌套的問題,如“我是歌手的主演是阿瓜”,需要抽取的關(guān)系為“我是歌手的主演是阿瓜”、“阿瓜是歌手”。
我們并不需要將“我是歌手”的四個(gè)token和“阿瓜”的兩個(gè)token在P=主演的得分上全都標(biāo)注為1,因?yàn)樵趯?shí)體抽取部分我們對實(shí)體的頭和尾進(jìn)行了識別,我們僅需要將 S“手”和O“瓜”所對應(yīng)的P=主演的分值標(biāo)注為1即可。本例中即 [3,9,indenx of 主演] = 1即可。如果阿瓜還是個(gè)歌手,則 [9,3,indenx of 職業(yè)] = 1即可,因?yàn)閷τ诓煌愋偷那短椎膶?shí)體,可能存在尾字符相同的情況極少,因此我們標(biāo)注尾字符而不是首字符。
以上就是Multi-head Selection Model部分的核心思路和代碼。
完整模型代碼如下:
用bert代替了原文的LSTM編碼層。 這里將CRF替換為指針標(biāo)注,并引入了實(shí)體的類別信息。 將實(shí)體的硬標(biāo)簽與實(shí)體的end_token拼接后傳入Multi-head Selection層,這也是本人靈光一閃的部分,既然在Multi-head Selection層我們希望model能識別S的end_token和O的end_token, 我們就只給這兩個(gè)token傳入有效的實(shí)體標(biāo)簽信息,其余token類別都編碼為0即可,實(shí)驗(yàn)正面這確實(shí)比你對所有的SOtoken都傳入對應(yīng)的實(shí)體類別embedding效果更好。 沒有使用上文的軟標(biāo)簽,軟標(biāo)簽的具體實(shí)現(xiàn)可以通過自定義 layer 實(shí)現(xiàn)。
def build_model ( pretrained_path
, config
, MAX_LEN
, Cs_num
, cs_em_size
, R_num
) : ids
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) att
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) cs
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) config
. output_hidden_states
= True bert_model
= TFBertModel
. from_pretrained
( pretrained_path
, config
= config
, from_pt
= True ) x
, _
, hidden_states
= bert_model
( ids
, attention_mask
= att
) layer_1
= hidden_states
[ - 1 ] start_logits
= tf
. keras
. layers
. Dense
( Cs_num
, activation
= 'sigmoid' ) ( layer_1
) start_logits
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 2 , name
= 's_start' ) ( start_logits
) end_logits
= tf
. keras
. layers
. Dense
( Cs_num
, activation
= 'sigmoid' ) ( layer_1
) end_logits
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 2 , name
= 's_end' ) ( end_logits
) cs_emb
= tf
. keras
. layers
. Embedding
( Cs_num
, cs_em_size
) ( cs
) concat_cs
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ layer_1
, cs_emb
] ) f1
= tf
. keras
. layers
. Dense
( 128 ) ( concat_cs
) f2
= tf
. keras
. layers
. Dense
( 128 ) ( concat_cs
) f1
= tf
. expand_dims
( f1
, 1 ) f2
= tf
. expand_dims
( f2
, 2 ) f1
= tf
. tile
( f1
, multiples
= ( 1 , MAX_LEN
, 1 , 1 ) ) f2
= tf
. tile
( f2
, multiples
= ( 1 , 1 , MAX_LEN
, 1 ) ) concat_f
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ f1
, f2
] ) output_logist
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( concat_f
) output_logist
= tf
. keras
. layers
. Dense
( R_num
, activation
= 'sigmoid' ) ( output_logist
) output_logist
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 4 , name
= 'relation' ) ( output_logist
) model
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
, cs
] , outputs
= [ start_logits
, end_logits
, output_logist
] ) model_2
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
] , outputs
= [ start_logits
, end_logits
] ) model_3
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
, cs
] , outputs
= [ output_logist
] ) return model
, model_2
, model_3
模型示意圖之丑陋手稿:看個(gè)樂哈~: 模型效果: 沒有做其他任何的處理,在2019百度三元抽取數(shù)據(jù)集上F1就達(dá)到了 79.157,而且模型結(jié)構(gòu)相比于層疊序列標(biāo)注簡單了許多,理解起來也更加到胃~
Deep Biaffine Attention
一、Deep Biaffine Attention for Neural Dependency Parsing
這篇文章主要通過Biaffine來應(yīng)用于依存關(guān)系分析上,但這也正好和關(guān)系抽取共通,只是依存關(guān)系中的關(guān)系類別只有一種,而在關(guān)系抽取中存在多種關(guān)系分類。
文章使用了雙仿射注意力機(jī)制,而不是使用傳統(tǒng)基于MLP注意力機(jī)制的單仿射分類器,或雙線性分類器;上文提到的Multi-Head Selection正是由多個(gè)線性分類器構(gòu)成的關(guān)系分類器。而現(xiàn)在我們希望能通過構(gòu)建一個(gè)Biaffine Attention矩陣直接計(jì)算各個(gè)token之間在某個(gè)關(guān)系分類上的attention。
(這里直接將關(guān)系依存中的head 和 dep 稱為 關(guān)系抽取中的S和O
將BiLSTM編碼的token hidden 經(jīng)過兩個(gè)MLP 得到 S 和 O的特征表示; 對于這一步文章中特地提到:“Applying smaller MLPs to the recurrent output states before the biaffine classifier has the advantage of stripping away information not relevant to the current decision. That is, every top recurrent state ri will need to carry enough information to identify word i’s head, find all its dependents, exclude all its non-dependents, assign itself the correct label, and assign all its dependents their correct labels, as well as transfer any relevant information to the recurrent states of words before and after it. Thus ri necessarily contains significantly more information than is needed to compute any individual score, and training on this superfluous information needlessly reduces parsing speed and increases the risk of overfitting. Reducing dimensionality and applying a nonlinearity addresses both of these problems.” LSTM層的輸出狀態(tài)需要攜帶足夠的信息,如識別其頭結(jié)點(diǎn),找到其依賴項(xiàng),排除非依賴項(xiàng),分配自身及其所有依賴的依存標(biāo)簽,而且還需要把其它任何相關(guān)信息傳遞至前或后單元。對這些不必要的信息進(jìn)行訓(xùn)練會降低訓(xùn)練速度,而且還有過擬合的風(fēng)險(xiǎn)。使用MLP對LSTM輸出降維,并使用雙仿射變換,可解決這一問題! 簡單來說我們希望能通過將原本高維度富含豐富信息的 hidden state 通過MLP降為至只能容下關(guān)系依賴信息的低緯度的特征,一方面加速訓(xùn)練,另一方面可以抑制過擬合。
最終我通過構(gòu)建一個(gè)U(Biaffine)矩陣來計(jì)算各個(gè)token之間依存的分值,并引入u矩陣來計(jì)算head的先驗(yàn)概率并產(chǎn)生偏置b。我們設(shè)token的長度為d,經(jīng)過MLP壓縮后的hidden_size為k,以下是我丑陋手稿來解釋矩陣的乘法維度變化。
Biaffine實(shí)現(xiàn)代碼:
class Biaffine ( tf
. keras
. layers
. Layer
) : def __init__ ( self
, in_size
, out_size
, bias_x
= False , bias_y
= False ) : super ( Biaffine
, self
) . __init__
( ) self
. bias_x
= bias_xself
. bias_y
= bias_yself
. U
= self
. add_weight
( name
= 'weight1' , shape
= ( in_size
+ int ( bias_x
) , out_size
, in_size
+ int ( bias_y
) ) , trainable
= True ) def call ( self
, input1
, input2
) : if self
. bias_x
: input1
= tf
. concat
( ( input1
, tf
. ones_like
( input1
[ . . . , : 1 ] ) ) , axis
= - 1 ) if self
. bias_y
: input2
= tf
. concat
( ( input2
, tf
. ones_like
( input2
[ . . . , : 1 ] ) ) , axis
= - 1 ) logits_1
= tf
. einsum
( 'bxi,ioj,byj->bxyo' , input1
, self
. U
, input2
) return logits_1
完整模型代碼:
def build_model ( pretrained_path
, config
, MAX_LEN
, Cs_num
, cs_em_size
, R_num
) : ids
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) att
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) cs
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) config
. output_hidden_states
= True bert_model
= TFBertModel
. from_pretrained
( pretrained_path
, config
= config
, from_pt
= True ) x
, _
, hidden_states
= bert_model
( ids
, attention_mask
= att
) layer_1
= hidden_states
[ - 1 ] start_logits
= tf
. keras
. layers
. Dense
( Cs_num
, activation
= 'sigmoid' ) ( layer_1
) start_logits
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 2 , name
= 's_start' ) ( start_logits
) end_logits
= tf
. keras
. layers
. Dense
( Cs_num
, activation
= 'sigmoid' ) ( layer_1
) end_logits
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 2 , name
= 's_end' ) ( end_logits
) cs_emb
= tf
. keras
. layers
. Embedding
( Cs_num
, cs_em_size
) ( cs
) concat_cs
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ layer_1
, cs_emb
] ) f1
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( concat_cs
) f2
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( concat_cs
) Biaffine_layer
= Biaffine
( 128 , R_num
, bias_y
= True ) output_logist
= Biaffine_layer
( f1
, f2
) output_logist
= tf
. keras
. layers
. Activation
( 'sigmoid' ) ( output_logist
) output_logist
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 4 , name
= 'relation' ) ( output_logist
) model
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
, cs
] , outputs
= [ start_logits
, end_logits
, output_logist
] ) model_2
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
] , outputs
= [ start_logits
, end_logits
] ) model_3
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
, cs
] , outputs
= [ output_logist
] ) return model
, model_2
, model_3
模型效果: F1 = 0.7964 相比Multi-head Selection F1提高了0.05左右
二、Named Entity Recognition as Dependency Parsing
After obtaining the word representations from the BiLSTM, we apply two separate FFNNs to create different representations (hs/he) for the start/end of the spans. Using different representations for the start/end of the spans allow the system to learn to identify the start/end of the spans separately. This improves accuracy compared to the model which directly uses the outputs of the LSTM since the context of the start and end of the entity are different. Finally, we employ a biaffine model over the sentence to create a l×l×c scoring tensor(rm), where l is the length of the sentence and c is the number of NER categories + 1(for non-entity). We compute the score for a span i by: where si and ei are the start and end indices of the span i, Um is a d × c × d tensor, Wm is a 2d × c matrix and bm is the bias. 原文在BERT、fastText & Char Embeddings提取特征的基礎(chǔ)上,通過BiLSTM捕獲word representations后,同樣使用兩組全連接層來表示實(shí)體的頭和尾,這比直接使用encode結(jié)果后直接輸出實(shí)體的頭和尾來說更加準(zhǔn)確,畢竟兩者所表示的信息是不同的。 之后將這兩組特征丟入我們的主角:Biaffine矩陣。這個(gè)任務(wù)中我們不僅要識別實(shí)體的頭和尾,還要識別出實(shí)體的類別C,因此我們目標(biāo)是得到一個(gè)LLC的結(jié)果矩陣(其中L為序列長度,C為實(shí)體類別數(shù)目) 重點(diǎn)關(guān)注公式,其中Um、Wm、bm的shape及其表示意義如下: Um:tensor shape of dcd: 對hs(i)為頭he(i)為尾的實(shí)體類別后驗(yàn) 概率建模 Wm:tensor shape of 2d*c: 對hs(i) 或 he(i)為尾的實(shí)體類別的后驗(yàn) 概率分別建模 bm:tensor shape of c: 對實(shí)體類別c的先驗(yàn) 概率建模 對于頭為hs(i),尾為he(i),類別為C的實(shí)體其概率分值為 P(頭為hs(i)&尾為he(i),C) + P(頭為hs(i),C) + P(尾為he(i),C) + P( C ) 根據(jù)公式我們可以構(gòu)建新的Biaffine矩陣代碼:
class Biaffine_2 ( tf
. keras
. layers
. Layer
) : def __init__ ( self
, in_size
, out_size
, MAX_LEN
) : super ( Biaffine_2
, self
) . __init__
( ) self
. w1
= self
. add_weight
( name
= 'weight1' , shape
= ( in_size
, out_size
, in_size
) , trainable
= True ) self
. w2
= self
. add_weight
( name
= 'weight2' , shape
= ( 2 * in_size
+ 1 , out_size
) , trainable
= True ) self
. MAX_LEN
= MAX_LEN
def call ( self
, input1
, input2
) : f1
= tf
. expand_dims
( input1
, 2 ) f2
= tf
. expand_dims
( input2
, 1 ) f1
= tf
. tile
( f1
, multiples
= ( 1 , 1 , self
. MAX_LEN
, 1 ) ) f2
= tf
. tile
( f2
, multiples
= ( 1 , self
. MAX_LEN
, 1 , 1 ) ) concat_f1f2
= tf
. concat
( ( f1
, f2
) , axis
= - 1 ) concat_f1f2
= tf
. concat
( ( concat_f1f2
, tf
. ones_like
( concat_f1f2
[ . . . , : 1 ] ) ) , axis
= - 1 ) logits_1
= tf
. einsum
( 'bxi,ioj,byj->bxyo' , input1
, self
. w1
, input2
) logits_2
= tf
. einsum
( 'bijy,yo->bijo' , concat_f1f2
, self
. w2
) return logits_1
+ logits_2
嘗試1
目前嘗試中能拿到最好效果的模型方案(持續(xù)更新中):
實(shí)體標(biāo)簽作為比較強(qiáng)的特征,取得S和O的實(shí)體標(biāo)簽基本可以判斷兩者的關(guān)系,因此將標(biāo)簽直接引入Biaffine矩陣。 將BERT最后兩層編碼進(jìn)?Biaffine計(jì)算,得到關(guān)系矩陣。 給實(shí)體抽取層增加了一層全連接,對實(shí)體抽取和關(guān)系抽取兩個(gè)任務(wù)做適當(dāng)?shù)姆蛛x。
def build_model_3 ( pretrained_path
, config
, MAX_LEN
, Cs_num
, cs_em_size
, R_num
) : ids
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) att
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) cs
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) config
. output_hidden_states
= True bert_model
= TFBertModel
. from_pretrained
( pretrained_path
, config
= config
, from_pt
= True ) x
, _
, hidden_states
= bert_model
( ids
, attention_mask
= att
) layer_1
= hidden_states
[ - 1 ] layer_2
= hidden_states
[ - 2 ] start_logits
= tf
. keras
. layers
. Dense
( 256 , activation
= 'relu' ) ( layer_1
) start_logits
= tf
. keras
. layers
. Dense
( Cs_num
, activation
= 'sigmoid' ) ( start_logits
) start_logits
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 2 , name
= 's_start' ) ( start_logits
) end_logits
= tf
. keras
. layers
. Dense
( 256 , activation
= 'relu' ) ( layer_1
) end_logits
= tf
. keras
. layers
. Dense
( Cs_num
, activation
= 'sigmoid' ) ( end_logits
) end_logits
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 2 , name
= 's_end' ) ( end_logits
) cs_emb
= tf
. keras
. layers
. Embedding
( Cs_num
, cs_em_size
) ( cs
) concat_cs
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ layer_1
, layer_2
] ) f1
= tf
. keras
. layers
. Dropout
( 0.2 ) ( concat_cs
) f1
= tf
. keras
. layers
. Dense
( 256 , activation
= 'relu' ) ( f1
) f1
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( f1
) f1
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ f1
, cs_emb
] ) f2
= tf
. keras
. layers
. Dropout
( 0.2 ) ( concat_cs
) f2
= tf
. keras
. layers
. Dense
( 256 , activation
= 'relu' ) ( f2
) f2
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( f2
) f2
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ f2
, cs_emb
] ) Biaffine_layer
= Biaffine_2
( 128 + cs_em_size
, R_num
, MAX_LEN
) output_logist
= Biaffine_layer
( f1
, f2
) output_logist
= tf
. keras
. layers
. Activation
( 'sigmoid' ) ( output_logist
) output_logist
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 4 , name
= 'relation' ) ( output_logist
) model
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
, cs
] , outputs
= [ start_logits
, end_logits
, output_logist
] ) model_2
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
] , outputs
= [ start_logits
, end_logits
] ) model_3
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
, cs
] , outputs
= [ output_logist
] ) return model
, model_2
, model_3
F1 = 0.7986 比baseline提高了 0.022
嘗試2
實(shí)體抽取部分用Biaffine矩陣代替序列標(biāo)注,softmax激活后輸出 并用bert后兩層隱藏層同時(shí)編碼實(shí)體和關(guān)系矩陣
def build_model ( pretrained_path
, config
, MAX_LEN
, Cs_num
, cs_em_size
, R_num
) : ids
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) att
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) cs
= tf
. keras
. layers
. Input
( ( MAX_LEN
, ) , dtype
= tf
. int32
) config
. output_hidden_states
= True bert_model
= TFBertModel
. from_pretrained
( pretrained_path
, config
= config
, from_pt
= True ) x
, pooling
, hidden_states
= bert_model
( ids
, attention_mask
= att
) layer_1
= hidden_states
[ - 1 ] layer_2
= hidden_states
[ - 2 ] concat_cs
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ layer_1
, layer_2
] ) start_logits
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( concat_cs
) end_logits
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( concat_cs
) S_Biaffine_layer
= Biaffine_2
( 128 , Cs_num
, MAX_LEN
) S_logits
= S_Biaffine_layer
( start_logits
, end_logits
) S_output_logist
= tf
. keras
. layers
. Activation
( 'softmax' , name
= 's_token' ) ( S_logits
) S_logits
= tf
. keras
. layers
. Dense
( 256 , activation
= 'relu' ) ( concat_cs
) S_logits
= tf
. keras
. layers
. Dropout
( 0.2 ) ( S_logits
) S_logits
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( S_logits
) O_logits
= tf
. keras
. layers
. Dense
( 256 , activation
= 'relu' ) ( concat_cs
) O_logits
= tf
. keras
. layers
. Dropout
( 0.2 ) ( O_logits
) O_logits
= tf
. keras
. layers
. Dense
( 128 , activation
= 'relu' ) ( O_logits
) cs_emb
= tf
. keras
. layers
. Embedding
( Cs_num
, cs_em_size
) ( cs
) f1
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ S_logits
, cs_emb
] ) f2
= tf
. keras
. layers
. Concatenate
( axis
= - 1 ) ( [ O_logits
, cs_emb
] ) Biaffine_layer
= Biaffine_2
( 128 + cs_em_size
, R_num
, MAX_LEN
) output_logist
= Biaffine_layer
( f1
, f2
) output_logist
= tf
. keras
. layers
. Activation
( 'sigmoid' ) ( output_logist
) output_logist
= tf
. keras
. layers
. Lambda
( lambda x
: x
** 4 , name
= 'relation' ) ( output_logist
) model
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
, cs
] , outputs
= [ S_output_logist
, output_logist
] ) model_2
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
] , outputs
= [ S_output_logist
] ) model_3
= tf
. keras
. models
. Model
( inputs
= [ ids
, att
, cs
] , outputs
= [ output_logist
] ) return model
, model_2
, model_3
F1值:0.8016
Biaffine 和 Multi-head對比
同樣的計(jì)算開銷:N^2 更多的參數(shù):Baffine attention 矩陣擁有更多的參數(shù),且相比于Muti-head Selection 能捕捉到S和O特征之間的交叉關(guān)系,而Muti-head Selection則是通過簡單的MLP線性變化進(jìn)行組合。丑陋手稿如下: 加入了start和end單獨(dú)的先驗(yàn)概率 各模型結(jié)果對比 biaffine + biaffine 8 0.8016
總結(jié)
在關(guān)系抽取的任務(wù)中,baseline選擇 Biaffine 會優(yōu)于 基本的Multi-head,在這個(gè)基礎(chǔ)上模型還有很多可以優(yōu)化的地方。比如 如何更好的構(gòu)造Biaffine矩陣,如何產(chǎn)生更有效的信息,但不至于過擬合。在我的實(shí)驗(yàn)過程中,Biaffine方法一直無法超越之前改良后的層疊式指針網(wǎng)絡(luò),這讓我非常郁悶(主要是和別人實(shí)驗(yàn)結(jié)果不同),排除batch_size太小,模型收斂性較差的原因之外,根據(jù)模型的訓(xùn)練情況判斷可能原因是在抽取實(shí)體時(shí)要求模型辨別實(shí)體的類別,導(dǎo)致該任務(wù)收斂性較差,試了多種網(wǎng)絡(luò)較難擬合,也可能和數(shù)據(jù)集相關(guān)。使用上還存在一定的問題,會持續(xù)關(guān)注。
其實(shí)學(xué)術(shù)界已經(jīng)對這種共享編碼的效果提出了質(zhì)疑,也有不少實(shí)驗(yàn)證明,實(shí)體抽取和關(guān)系抽取這兩個(gè)任務(wù)在獨(dú)立編碼的情況下效果要好于共享編碼。Two are Better than One: Joint Entity and Relation Extraction with Table-Sequence Encoders
本文并沒有嘗試用CRF來抽取實(shí)體,而是直接使用了指針標(biāo)注的方法,只是收到了大部分實(shí)體抽取文章的影響(默認(rèn)指針標(biāo)注優(yōu)于CRF)之后會對這兩種方法進(jìn)行對比。
關(guān)系抽取的系列文章可能要到此告一段落,之后看到這方面重要的論文也會第一時(shí)間和大家分享~
總結(jié)
以上是生活随笔 為你收集整理的信息抽取(四)【NLP论文复现】Multi-head Selection和Deep Biaffine Attention在关系抽取中的实现和效果 的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔 網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔 推薦給好友。