當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Bert源代码（二）模型

發布時間：2023/12/16 编程问答 55 豆豆

生活随笔收集整理的這篇文章主要介紹了 Bert源代码（二）模型小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Bert源代碼（二）模型

模型訓練、評估和預測流程
Bert模型
- Transformer模型
- Bert模型
Bert模型代碼解析
參考文獻

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12 export GLUE_DIR=/path/to/glue python run_classifier.py \--task_name=MRPC \--do_train=true \--do_eval=true \--data_dir=$GLUE_DIR/MRPC \--vocab_file=$BERT_BASE_DIR/vocab.txt \--bert_config_file=$BERT_BASE_DIR/bert_config.json \--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \--max_seq_length=128 \--train_batch_size=32 \--learning_rate=2e-5 \--num_train_epochs=3.0 \--output_dir=/tmp/mrpc_output/

模型訓練、評估和預測流程

和上一篇文章的預訓練類似。

使用ColaProcessor，MnliProcessor，MrpcProcessor，XnliProcessor四種processor處理輸入相應輸入格式數據，獲得當前句text_a和下一句text_b，再通過FullTokenizer來token化。最后通過TFRecordWriter將數據保存為tfrecord文件。

定義input_fn函數file_based_input_fn_builder

定義RunConfig配置run_config

創建model_fn：從傳入的init_checkpoint初始化bert模型、AdamWeightDecayOptimizer，返回TPUEstimatorSpec

結合model_fn、run_config和input_fn使用TPUEstimator進行訓練、評估、預測

Bert模型

先上幾張圖：

Transformer模型

上面的圖展示了Transformer結構：Transformer是一種Seq2Seq模型，由encoder和decoder組成。encoder端由Nx個基礎單元疊加，每個基礎單元為：Multi-head的Self-Attention和Feed Forward串聯；decoder端由Nx個基礎單元疊加，每個基礎單元為：輸出端的Masked Multi-head Self-Attention和encoder-decoder的Multi-head Attention和Feed Forward串聯。當然encoder和decoder端都會加入短連接Resnet和Normalization。輸入輸出端都在原來Inputs和Outputs上接一層Embedding層，再加入Position的Embedding來補償位置信息（相對于RNN而言，距離始終為1，避免了RNN長距離依賴的問題。用Position Embedding來補償距離信息。）

Bert模型

Bert模型在Input Embedding基礎上加入Token Embedding、Segment Embedding、Position Embedding。引入了Deep Bidirectional Masked Lm和Next Prediction Sentence（這兩個創新點在上一篇文章中已經結合代碼詳細介紹了）。

Bert模型代碼解析

我們來逐步解析代碼：

with tf.variable_scope(scope, default_name="bert"):with tf.variable_scope("embeddings"):# 輸入Embedding# Perform embedding lookup on the word ids.(self.embedding_output, self.embedding_table) = embedding_lookup(input_ids=input_ids,vocab_size=config.vocab_size,embedding_size=config.hidden_size,initializer_range=config.initializer_range,word_embedding_name="word_embeddings",use_one_hot_embeddings=use_one_hot_embeddings)# Add positional embeddings and token type embeddings, then layer# normalize and perform dropout.# 1. 添加Position Embedding和Token Embeddingself.embedding_output = embedding_postprocessor(input_tensor=self.embedding_output,use_token_type=True,token_type_ids=token_type_ids,token_type_vocab_size=config.type_vocab_size,token_type_embedding_name="token_type_embeddings",use_position_embeddings=True,position_embedding_name="position_embeddings",initializer_range=config.initializer_range,max_position_embeddings=config.max_position_embeddings,dropout_prob=config.hidden_dropout_prob)with tf.variable_scope("encoder"):# This converts a 2D mask of shape [batch_size, seq_length] to a 3D# mask of shape [batch_size, seq_length, seq_length] which is used# for the attention scores.attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask)# Run the stacked transformer.# `sequence_output` shape = [batch_size, seq_length, hidden_size].# 2. 創建transformer模型self.all_encoder_layers = transformer_model(input_tensor=self.embedding_output,attention_mask=attention_mask,hidden_size=config.hidden_size,num_hidden_layers=config.num_hidden_layers,num_attention_heads=config.num_attention_heads,intermediate_size=config.intermediate_size,intermediate_act_fn=get_activation(config.hidden_act),hidden_dropout_prob=config.hidden_dropout_prob,attention_probs_dropout_prob=config.attention_probs_dropout_prob,initializer_range=config.initializer_range,do_return_all_layers=True)# 用于masked lm的sequence_output，即final hidden layerself.sequence_output = self.all_encoder_layers[-1]# The "pooler" converts the encoded sequence tensor of shape# [batch_size, seq_length, hidden_size] to a tensor of shape# [batch_size, hidden_size]. This is necessary for segment-level# (or segment-pair-level) classification tasks where we need a fixed# dimensional representation of the segment.# 用于句子分類的pooled_output，使用final hidden layer的第一個token[CLS]的embeding向量with tf.variable_scope("pooler"):# We "pool" the model by simply taking the hidden state corresponding# to the first token. We assume that this has been pre-trainedfirst_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)self.pooled_output = tf.layers.dense(first_token_tensor,config.hidden_size,activation=tf.tanh,kernel_initializer=create_initializer(config.initializer_range))

輸入加入其他embedding解析：

def embedding_postprocessor(input_tensor,use_token_type=False,token_type_ids=None,token_type_vocab_size=16,token_type_embedding_name="token_type_embeddings",use_position_embeddings=True,position_embedding_name="position_embeddings",initializer_range=0.02,max_position_embeddings=512,dropout_prob=0.1):"""Performs various post-processing on a word embedding tensor.Args:input_tensor: float Tensor of shape [batch_size, seq_length,embedding_size].use_token_type: bool. Whether to add embeddings for `token_type_ids`.token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].Must be specified if `use_token_type` is True.token_type_vocab_size: int. The vocabulary size of `token_type_ids`.token_type_embedding_name: string. The name of the embedding table variablefor token type ids.use_position_embeddings: bool. Whether to add position embeddings for theposition of each token in the sequence.position_embedding_name: string. The name of the embedding table variablefor positional embeddings.initializer_range: float. Range of the weight initialization.max_position_embeddings: int. Maximum sequence length that might ever beused with this model. This can be longer than the sequence length ofinput_tensor, but cannot be shorter.dropout_prob: float. Dropout probability applied to the final output tensor.Returns:float tensor with same shape as `input_tensor`.Raises:ValueError: One of the tensor shapes or input values is invalid."""input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]width = input_shape[2]output = input_tensor# 添加token embeddingif use_token_type:if token_type_ids is None:raise ValueError("`token_type_ids` must be specified if""`use_token_type` is True.")token_type_table = tf.get_variable(name=token_type_embedding_name,shape=[token_type_vocab_size, width],initializer=create_initializer(initializer_range))# This vocab will be small so we always do one-hot here, since it is always# faster for a small vocabulary.flat_token_type_ids = tf.reshape(token_type_ids, [-1])one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)token_type_embeddings = tf.reshape(token_type_embeddings,[batch_size, seq_length, width])output += token_type_embeddings# 添加position embeddingif use_position_embeddings:assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)with tf.control_dependencies([assert_op]):full_position_embeddings = tf.get_variable(name=position_embedding_name,shape=[max_position_embeddings, width],initializer=create_initializer(initializer_range))# Since the position embedding table is a learned variable, we create it# using a (long) sequence length `max_position_embeddings`. The actual# sequence length might be shorter than this, for faster training of# tasks that do not have long sequences.## So `full_position_embeddings` is effectively an embedding table# for position [0, 1, 2, ..., max_position_embeddings-1], and the current# sequence has positions [0, 1, 2, ... seq_length-1], so we can just# perform a slice.position_embeddings = tf.slice(full_position_embeddings, [0, 0],[seq_length, -1])num_dims = len(output.shape.as_list())# Only the last two dimensions are relevant (`seq_length` and `width`), so# we broadcast among the first dimensions, which is typically just# the batch size.position_broadcast_shape = []for _ in range(num_dims - 2):position_broadcast_shape.append(1)position_broadcast_shape.extend([seq_length, width])position_embeddings = tf.reshape(position_embeddings,position_broadcast_shape)output += position_embeddings# 加入LN和dropoutoutput = layer_norm_and_dropout(output, dropout_prob)return output

transformer模型解析：

def transformer_model(input_tensor,attention_mask=None,hidden_size=768,num_hidden_layers=12,num_attention_heads=12,intermediate_size=3072,intermediate_act_fn=gelu,hidden_dropout_prob=0.1,attention_probs_dropout_prob=0.1,initializer_range=0.02,do_return_all_layers=False):if hidden_size % num_attention_heads != 0:raise ValueError("The hidden size (%d) is not a multiple of the number of attention ""heads (%d)" % (hidden_size, num_attention_heads))attention_head_size = int(hidden_size / num_attention_heads)input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]input_width = input_shape[2]# The Transformer performs sum residuals on all layers so the input needs# to be the same as the hidden size.if input_width != hidden_size:raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %(input_width, hidden_size))# We keep the representation as a 2D tensor to avoid re-shaping it back and# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on# the GPU/CPU but may not be free on the TPU, so we want to minimize them to# help the optimizer.prev_output = reshape_to_matrix(input_tensor)all_layer_outputs = []# 定義基礎單元疊加數量Nx：num_hidden_layersfor layer_idx in range(num_hidden_layers):with tf.variable_scope("layer_%d" % layer_idx):layer_input = prev_output# Attention層with tf.variable_scope("attention"):attention_heads = []with tf.variable_scope("self"):attention_head = attention_layer(from_tensor=layer_input,to_tensor=layer_input,attention_mask=attention_mask,num_attention_heads=num_attention_heads,size_per_head=attention_head_size,attention_probs_dropout_prob=attention_probs_dropout_prob,initializer_range=initializer_range,do_return_2d_tensor=True,batch_size=batch_size,from_seq_length=seq_length,to_seq_length=seq_length)attention_heads.append(attention_head)attention_output = Noneif len(attention_heads) == 1:attention_output = attention_heads[0]else:# In the case where we have other sequences, we just concatenate# them to the self-attention head before the projection.attention_output = tf.concat(attention_heads, axis=-1)# Run a linear projection of `hidden_size` then add a residual# with `layer_input`.# dropout并加入短鏈接ResNet和LNwith tf.variable_scope("output"):attention_output = tf.layers.dense(attention_output,hidden_size,kernel_initializer=create_initializer(initializer_range))attention_output = dropout(attention_output, hidden_dropout_prob)attention_output = layer_norm(attention_output + layer_input)# The activation is only applied to the "intermediate" hidden layer.# 映射到intermediate_size大小with tf.variable_scope("intermediate"):intermediate_output = tf.layers.dense(attention_output,intermediate_size,activation=intermediate_act_fn,kernel_initializer=create_initializer(initializer_range))# Down-project back to `hidden_size` then add the residual.# 映射回hidden_size大小，并加入短連接ResNet和LNwith tf.variable_scope("output"):layer_output = tf.layers.dense(intermediate_output,hidden_size,kernel_initializer=create_initializer(initializer_range))layer_output = dropout(layer_output, hidden_dropout_prob)layer_output = layer_norm(layer_output + attention_output)prev_output = layer_outputall_layer_outputs.append(layer_output)if do_return_all_layers:final_outputs = []for layer_output in all_layer_outputs:# reshap回input_shape尺寸final_output = reshape_from_matrix(layer_output, input_shape)final_outputs.append(final_output)return final_outputselse:final_output = reshape_from_matrix(prev_output, input_shape)return final_output

Attention層：
$softmax(\frac{QK^T}{\sqrt{d_k}})V$
其中Q為query, K為key, V為value, $Q∈Rn×?dkQ∈R^{n\text{ × }d_k}$ $dk,K∈Rm×?dk,V∈Rm×?dvd_k, K∈R^{m\text{ × }d_k}, V∈R^{m\text{ × }d_v}$ ， $d_k$ 為query和key的維度。?其中，除以 $d_k$ 是為了防止點積過大， $QK^T$ 是計算序列中各個位置的相互關系。

def attention_layer(from_tensor,to_tensor,attention_mask=None,num_attention_heads=1,size_per_head=512,query_act=None,key_act=None,value_act=None,attention_probs_dropout_prob=0.0,initializer_range=0.02,do_return_2d_tensor=False,batch_size=None,from_seq_length=None,to_seq_length=None):def transpose_for_scores(input_tensor, batch_size, num_attention_heads,seq_length, width):output_tensor = tf.reshape(input_tensor, [batch_size, seq_length, num_attention_heads, width])output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])return output_tensorfrom_shape = get_shape_list(from_tensor, expected_rank=[2, 3])to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])if len(from_shape) != len(to_shape):raise ValueError("The rank of `from_tensor` must match the rank of `to_tensor`.")if len(from_shape) == 3:batch_size = from_shape[0]from_seq_length = from_shape[1]to_seq_length = to_shape[1]elif len(from_shape) == 2:if (batch_size is None or from_seq_length is None or to_seq_length is None):raise ValueError("When passing in rank 2 tensors to attention_layer, the values ""for `batch_size`, `from_seq_length`, and `to_seq_length` ""must all be specified.")# Scalar dimensions referenced here:# B = batch size (number of sequences)# F = `from_tensor` sequence length, 輸入tensor# T = `to_tensor` sequence length, 輸出tensor# N = `num_attention_heads`, Attention頭數目# H = `size_per_head`, 每個Attention頭的大小from_tensor_2d = reshape_to_matrix(from_tensor)to_tensor_2d = reshape_to_matrix(to_tensor)# `query_layer` = [B*F, N*H]# Q, [B*F, N*H], 將from_tensor映射為num_attention_heads * size_per_head大小query_layer = tf.layers.dense(from_tensor_2d,num_attention_heads * size_per_head,activation=query_act,name="query",kernel_initializer=create_initializer(initializer_range))# `key_layer` = [B*T, N*H]# K, [B*T, N*H], 將to_tensor映射為num_attention_heads * size_per_head大小key_layer = tf.layers.dense(to_tensor_2d,num_attention_heads * size_per_head,activation=key_act,name="key",kernel_initializer=create_initializer(initializer_range))# `value_layer` = [B*T, N*H]# V, [B*T, N*H], 將to_tensor映射為num_attention_heads * size_per_head大小value_layer = tf.layers.dense(to_tensor_2d,num_attention_heads * size_per_head,activation=value_act,name="value",kernel_initializer=create_initializer(initializer_range))# `query_layer` = [B, N, F, H]# 重排列維度query_layer = transpose_for_scores(query_layer, batch_size,num_attention_heads, from_seq_length,size_per_head)# `key_layer` = [B, N, T, H]key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,to_seq_length, size_per_head)# Take the dot product between "query" and "key" to get the raw# attention scores.# `attention_scores` = [B, N, F, T]# 計算QK/sqrt(dk)attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)attention_scores = tf.multiply(attention_scores,1.0 / math.sqrt(float(size_per_head)))if attention_mask is not None:# `attention_mask` = [B, 1, F, T]attention_mask = tf.expand_dims(attention_mask, axis=[1])# Since attention_mask is 1.0 for positions we want to attend and 0.0 for# masked positions, this operation will create a tensor which is 0.0 for# positions we want to attend and -10000.0 for masked positions.# 對于attention_mask為1的加零維持原樣；# 對于attention_mask為0的加入-10000令最終該位置的softmax結果為0。adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0# Since we are adding it to the raw scores before the softmax, this is# effectively the same as removing these entirely.attention_scores += adder# Normalize the attention scores to probabilities.# `attention_probs` = [B, N, F, T]# 計算softmaxattention_probs = tf.nn.softmax(attention_scores)# This is actually dropping out entire tokens to attend to, which might# seem a bit unusual, but is taken from the original Transformer paper.# 加入dropoutattention_probs = dropout(attention_probs, attention_probs_dropout_prob)# `value_layer` = [B, T, N, H]value_layer = tf.reshape(value_layer,[batch_size, to_seq_length, num_attention_heads, size_per_head])# `value_layer` = [B, N, T, H]value_layer = tf.transpose(value_layer, [0, 2, 1, 3])# `context_layer` = [B, N, F, H]# softmax結果乘以V, [B, N, F, T] x [B, N, T, H] = [B, N, F, H]context_layer = tf.matmul(attention_probs, value_layer)# `context_layer` = [B, F, N, H]# 重排列維度context_layer = tf.transpose(context_layer, [0, 2, 1, 3])if do_return_2d_tensor:# `context_layer` = [B*F, N*H]context_layer = tf.reshape(context_layer,[batch_size * from_seq_length, num_attention_heads * size_per_head])else:# `context_layer` = [B, F, N*H]# 將num_attention_heads個head連接在一起context_layer = tf.reshape(context_layer,[batch_size, from_seq_length, num_attention_heads * size_per_head])return context_layer

參考文獻

Attention Is All You Need

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

總結

以上是生活随笔為你收集整理的Bert源代码（二）模型的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： OpenCV曝光参数和快门时间的对应关系
下一篇： [原创]egret的WebView实现（

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

Bert源代码（二）模型

Bert源代碼（二）模型

模型訓練、評估和預測流程

Bert模型

Transformer模型

Bert模型

Bert模型代碼解析

參考文獻

總結