Bert源代码(二)模型
Bert源代碼(二)模型
- 模型訓練、評估和預測流程
- Bert模型
- Transformer模型
- Bert模型
- Bert模型代碼解析
- 參考文獻
模型訓練、評估和預測流程
和上一篇文章的預訓練類似。
Bert模型
先上幾張圖:
Transformer模型
上面的圖展示了Transformer結構:Transformer是一種Seq2Seq模型,由encoder和decoder組成。encoder端由Nx個基礎單元疊加,每個基礎單元為:Multi-head的Self-Attention和Feed Forward串聯;decoder端由Nx個基礎單元疊加,每個基礎單元為:輸出端的Masked Multi-head Self-Attention和encoder-decoder的Multi-head Attention和Feed Forward串聯。當然encoder和decoder端都會加入短連接Resnet和Normalization。輸入輸出端都在原來Inputs和Outputs上接一層Embedding層,再加入Position的Embedding來補償位置信息(相對于RNN而言,距離始終為1,避免了RNN長距離依賴的問題。用Position Embedding來補償距離信息。)
Bert模型
Bert模型在Input Embedding基礎上加入Token Embedding、Segment Embedding、Position Embedding。引入了Deep Bidirectional Masked Lm和Next Prediction Sentence(這兩個創新點在上一篇文章中已經結合代碼詳細介紹了)。
Bert模型代碼解析
我們來逐步解析代碼:
with tf.variable_scope(scope, default_name="bert"):with tf.variable_scope("embeddings"):# 輸入Embedding# Perform embedding lookup on the word ids.(self.embedding_output, self.embedding_table) = embedding_lookup(input_ids=input_ids,vocab_size=config.vocab_size,embedding_size=config.hidden_size,initializer_range=config.initializer_range,word_embedding_name="word_embeddings",use_one_hot_embeddings=use_one_hot_embeddings)# Add positional embeddings and token type embeddings, then layer# normalize and perform dropout.# 1. 添加Position Embedding和Token Embeddingself.embedding_output = embedding_postprocessor(input_tensor=self.embedding_output,use_token_type=True,token_type_ids=token_type_ids,token_type_vocab_size=config.type_vocab_size,token_type_embedding_name="token_type_embeddings",use_position_embeddings=True,position_embedding_name="position_embeddings",initializer_range=config.initializer_range,max_position_embeddings=config.max_position_embeddings,dropout_prob=config.hidden_dropout_prob)with tf.variable_scope("encoder"):# This converts a 2D mask of shape [batch_size, seq_length] to a 3D# mask of shape [batch_size, seq_length, seq_length] which is used# for the attention scores.attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask)# Run the stacked transformer.# `sequence_output` shape = [batch_size, seq_length, hidden_size].# 2. 創建transformer模型self.all_encoder_layers = transformer_model(input_tensor=self.embedding_output,attention_mask=attention_mask,hidden_size=config.hidden_size,num_hidden_layers=config.num_hidden_layers,num_attention_heads=config.num_attention_heads,intermediate_size=config.intermediate_size,intermediate_act_fn=get_activation(config.hidden_act),hidden_dropout_prob=config.hidden_dropout_prob,attention_probs_dropout_prob=config.attention_probs_dropout_prob,initializer_range=config.initializer_range,do_return_all_layers=True)# 用于masked lm的sequence_output,即final hidden layerself.sequence_output = self.all_encoder_layers[-1]# The "pooler" converts the encoded sequence tensor of shape# [batch_size, seq_length, hidden_size] to a tensor of shape# [batch_size, hidden_size]. This is necessary for segment-level# (or segment-pair-level) classification tasks where we need a fixed# dimensional representation of the segment.# 用于句子分類的pooled_output,使用final hidden layer的第一個token[CLS]的embeding向量with tf.variable_scope("pooler"):# We "pool" the model by simply taking the hidden state corresponding# to the first token. We assume that this has been pre-trainedfirst_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)self.pooled_output = tf.layers.dense(first_token_tensor,config.hidden_size,activation=tf.tanh,kernel_initializer=create_initializer(config.initializer_range))輸入加入其他embedding解析:
def embedding_postprocessor(input_tensor,use_token_type=False,token_type_ids=None,token_type_vocab_size=16,token_type_embedding_name="token_type_embeddings",use_position_embeddings=True,position_embedding_name="position_embeddings",initializer_range=0.02,max_position_embeddings=512,dropout_prob=0.1):"""Performs various post-processing on a word embedding tensor.Args:input_tensor: float Tensor of shape [batch_size, seq_length,embedding_size].use_token_type: bool. Whether to add embeddings for `token_type_ids`.token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].Must be specified if `use_token_type` is True.token_type_vocab_size: int. The vocabulary size of `token_type_ids`.token_type_embedding_name: string. The name of the embedding table variablefor token type ids.use_position_embeddings: bool. Whether to add position embeddings for theposition of each token in the sequence.position_embedding_name: string. The name of the embedding table variablefor positional embeddings.initializer_range: float. Range of the weight initialization.max_position_embeddings: int. Maximum sequence length that might ever beused with this model. This can be longer than the sequence length ofinput_tensor, but cannot be shorter.dropout_prob: float. Dropout probability applied to the final output tensor.Returns:float tensor with same shape as `input_tensor`.Raises:ValueError: One of the tensor shapes or input values is invalid."""input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]width = input_shape[2]output = input_tensor# 添加token embeddingif use_token_type:if token_type_ids is None:raise ValueError("`token_type_ids` must be specified if""`use_token_type` is True.")token_type_table = tf.get_variable(name=token_type_embedding_name,shape=[token_type_vocab_size, width],initializer=create_initializer(initializer_range))# This vocab will be small so we always do one-hot here, since it is always# faster for a small vocabulary.flat_token_type_ids = tf.reshape(token_type_ids, [-1])one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)token_type_embeddings = tf.reshape(token_type_embeddings,[batch_size, seq_length, width])output += token_type_embeddings# 添加position embeddingif use_position_embeddings:assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)with tf.control_dependencies([assert_op]):full_position_embeddings = tf.get_variable(name=position_embedding_name,shape=[max_position_embeddings, width],initializer=create_initializer(initializer_range))# Since the position embedding table is a learned variable, we create it# using a (long) sequence length `max_position_embeddings`. The actual# sequence length might be shorter than this, for faster training of# tasks that do not have long sequences.## So `full_position_embeddings` is effectively an embedding table# for position [0, 1, 2, ..., max_position_embeddings-1], and the current# sequence has positions [0, 1, 2, ... seq_length-1], so we can just# perform a slice.position_embeddings = tf.slice(full_position_embeddings, [0, 0],[seq_length, -1])num_dims = len(output.shape.as_list())# Only the last two dimensions are relevant (`seq_length` and `width`), so# we broadcast among the first dimensions, which is typically just# the batch size.position_broadcast_shape = []for _ in range(num_dims - 2):position_broadcast_shape.append(1)position_broadcast_shape.extend([seq_length, width])position_embeddings = tf.reshape(position_embeddings,position_broadcast_shape)output += position_embeddings# 加入LN和dropoutoutput = layer_norm_and_dropout(output, dropout_prob)return outputtransformer模型解析:
def transformer_model(input_tensor,attention_mask=None,hidden_size=768,num_hidden_layers=12,num_attention_heads=12,intermediate_size=3072,intermediate_act_fn=gelu,hidden_dropout_prob=0.1,attention_probs_dropout_prob=0.1,initializer_range=0.02,do_return_all_layers=False):if hidden_size % num_attention_heads != 0:raise ValueError("The hidden size (%d) is not a multiple of the number of attention ""heads (%d)" % (hidden_size, num_attention_heads))attention_head_size = int(hidden_size / num_attention_heads)input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]input_width = input_shape[2]# The Transformer performs sum residuals on all layers so the input needs# to be the same as the hidden size.if input_width != hidden_size:raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %(input_width, hidden_size))# We keep the representation as a 2D tensor to avoid re-shaping it back and# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on# the GPU/CPU but may not be free on the TPU, so we want to minimize them to# help the optimizer.prev_output = reshape_to_matrix(input_tensor)all_layer_outputs = []# 定義基礎單元疊加數量Nx:num_hidden_layersfor layer_idx in range(num_hidden_layers):with tf.variable_scope("layer_%d" % layer_idx):layer_input = prev_output# Attention層with tf.variable_scope("attention"):attention_heads = []with tf.variable_scope("self"):attention_head = attention_layer(from_tensor=layer_input,to_tensor=layer_input,attention_mask=attention_mask,num_attention_heads=num_attention_heads,size_per_head=attention_head_size,attention_probs_dropout_prob=attention_probs_dropout_prob,initializer_range=initializer_range,do_return_2d_tensor=True,batch_size=batch_size,from_seq_length=seq_length,to_seq_length=seq_length)attention_heads.append(attention_head)attention_output = Noneif len(attention_heads) == 1:attention_output = attention_heads[0]else:# In the case where we have other sequences, we just concatenate# them to the self-attention head before the projection.attention_output = tf.concat(attention_heads, axis=-1)# Run a linear projection of `hidden_size` then add a residual# with `layer_input`.# dropout并加入短鏈接ResNet和LNwith tf.variable_scope("output"):attention_output = tf.layers.dense(attention_output,hidden_size,kernel_initializer=create_initializer(initializer_range))attention_output = dropout(attention_output, hidden_dropout_prob)attention_output = layer_norm(attention_output + layer_input)# The activation is only applied to the "intermediate" hidden layer.# 映射到intermediate_size大小with tf.variable_scope("intermediate"):intermediate_output = tf.layers.dense(attention_output,intermediate_size,activation=intermediate_act_fn,kernel_initializer=create_initializer(initializer_range))# Down-project back to `hidden_size` then add the residual.# 映射回hidden_size大小,并加入短連接ResNet和LNwith tf.variable_scope("output"):layer_output = tf.layers.dense(intermediate_output,hidden_size,kernel_initializer=create_initializer(initializer_range))layer_output = dropout(layer_output, hidden_dropout_prob)layer_output = layer_norm(layer_output + attention_output)prev_output = layer_outputall_layer_outputs.append(layer_output)if do_return_all_layers:final_outputs = []for layer_output in all_layer_outputs:# reshap回input_shape尺寸final_output = reshape_from_matrix(layer_output, input_shape)final_outputs.append(final_output)return final_outputselse:final_output = reshape_from_matrix(prev_output, input_shape)return final_outputAttention層:
Attention(Q,K,V)=softmax(QKTdk)VAttention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})VAttention(Q,K,V)=softmax(dk??QKT?)V
其中Q為query, K為key, V為value, Q∈Rn×?dkQ∈R^{n\text{ × }d_k}Q∈Rn?×?dk? dk,K∈Rm×?dk,V∈Rm×?dvd_k, K∈R^{m\text{ × }d_k}, V∈R^{m\text{ × }d_v}dk?,K∈Rm?×?dk?,V∈Rm?×?dv?, dkd_kdk?為query和key的維度。?其中,除以dkd_kdk?是為了防止點積過大,QKTQK^TQKT是計算序列中各個位置的相互關系。
參考文獻
總結
以上是生活随笔為你收集整理的Bert源代码(二)模型的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: OpenCV曝光参数和快门时间的对应关系
- 下一篇: [原创]egret的WebView实现(