日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Bert源代码(二)模型

發布時間:2023/12/16 编程问答 42 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Bert源代码(二)模型 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Bert源代碼(二)模型

  • 模型訓練、評估和預測流程
  • Bert模型
    • Transformer模型
    • Bert模型
  • Bert模型代碼解析
  • 參考文獻

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12 export GLUE_DIR=/path/to/glue python run_classifier.py \--task_name=MRPC \--do_train=true \--do_eval=true \--data_dir=$GLUE_DIR/MRPC \--vocab_file=$BERT_BASE_DIR/vocab.txt \--bert_config_file=$BERT_BASE_DIR/bert_config.json \--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \--max_seq_length=128 \--train_batch_size=32 \--learning_rate=2e-5 \--num_train_epochs=3.0 \--output_dir=/tmp/mrpc_output/

模型訓練、評估和預測流程

和上一篇文章的預訓練類似。

  • 使用ColaProcessor,MnliProcessor,MrpcProcessor,XnliProcessor四種processor處理輸入相應輸入格式數據,獲得當前句text_a和下一句text_b,再通過FullTokenizer來token化。最后通過TFRecordWriter將數據保存為tfrecord文件。
  • 定義input_fn函數file_based_input_fn_builder
  • 定義RunConfig配置run_config
  • 創建model_fn:從傳入的init_checkpoint初始化bert模型、AdamWeightDecayOptimizer,返回TPUEstimatorSpec
  • 結合model_fn、run_config和input_fn使用TPUEstimator進行訓練、評估、預測
  • Bert模型

    先上幾張圖:

    Transformer模型




    上面的圖展示了Transformer結構:Transformer是一種Seq2Seq模型,由encoder和decoder組成。encoder端由Nx個基礎單元疊加,每個基礎單元為:Multi-head的Self-Attention和Feed Forward串聯;decoder端由Nx個基礎單元疊加,每個基礎單元為:輸出端的Masked Multi-head Self-Attention和encoder-decoder的Multi-head Attention和Feed Forward串聯。當然encoder和decoder端都會加入短連接Resnet和Normalization。輸入輸出端都在原來Inputs和Outputs上接一層Embedding層,再加入Position的Embedding來補償位置信息(相對于RNN而言,距離始終為1,避免了RNN長距離依賴的問題。用Position Embedding來補償距離信息。)

    Bert模型

    Bert模型在Input Embedding基礎上加入Token Embedding、Segment Embedding、Position Embedding。引入了Deep Bidirectional Masked Lm和Next Prediction Sentence(這兩個創新點在上一篇文章中已經結合代碼詳細介紹了)。

    Bert模型代碼解析

    我們來逐步解析代碼:

    with tf.variable_scope(scope, default_name="bert"):with tf.variable_scope("embeddings"):# 輸入Embedding# Perform embedding lookup on the word ids.(self.embedding_output, self.embedding_table) = embedding_lookup(input_ids=input_ids,vocab_size=config.vocab_size,embedding_size=config.hidden_size,initializer_range=config.initializer_range,word_embedding_name="word_embeddings",use_one_hot_embeddings=use_one_hot_embeddings)# Add positional embeddings and token type embeddings, then layer# normalize and perform dropout.# 1. 添加Position Embedding和Token Embeddingself.embedding_output = embedding_postprocessor(input_tensor=self.embedding_output,use_token_type=True,token_type_ids=token_type_ids,token_type_vocab_size=config.type_vocab_size,token_type_embedding_name="token_type_embeddings",use_position_embeddings=True,position_embedding_name="position_embeddings",initializer_range=config.initializer_range,max_position_embeddings=config.max_position_embeddings,dropout_prob=config.hidden_dropout_prob)with tf.variable_scope("encoder"):# This converts a 2D mask of shape [batch_size, seq_length] to a 3D# mask of shape [batch_size, seq_length, seq_length] which is used# for the attention scores.attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask)# Run the stacked transformer.# `sequence_output` shape = [batch_size, seq_length, hidden_size].# 2. 創建transformer模型self.all_encoder_layers = transformer_model(input_tensor=self.embedding_output,attention_mask=attention_mask,hidden_size=config.hidden_size,num_hidden_layers=config.num_hidden_layers,num_attention_heads=config.num_attention_heads,intermediate_size=config.intermediate_size,intermediate_act_fn=get_activation(config.hidden_act),hidden_dropout_prob=config.hidden_dropout_prob,attention_probs_dropout_prob=config.attention_probs_dropout_prob,initializer_range=config.initializer_range,do_return_all_layers=True)# 用于masked lm的sequence_output,即final hidden layerself.sequence_output = self.all_encoder_layers[-1]# The "pooler" converts the encoded sequence tensor of shape# [batch_size, seq_length, hidden_size] to a tensor of shape# [batch_size, hidden_size]. This is necessary for segment-level# (or segment-pair-level) classification tasks where we need a fixed# dimensional representation of the segment.# 用于句子分類的pooled_output,使用final hidden layer的第一個token[CLS]的embeding向量with tf.variable_scope("pooler"):# We "pool" the model by simply taking the hidden state corresponding# to the first token. We assume that this has been pre-trainedfirst_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)self.pooled_output = tf.layers.dense(first_token_tensor,config.hidden_size,activation=tf.tanh,kernel_initializer=create_initializer(config.initializer_range))

    輸入加入其他embedding解析:

    def embedding_postprocessor(input_tensor,use_token_type=False,token_type_ids=None,token_type_vocab_size=16,token_type_embedding_name="token_type_embeddings",use_position_embeddings=True,position_embedding_name="position_embeddings",initializer_range=0.02,max_position_embeddings=512,dropout_prob=0.1):"""Performs various post-processing on a word embedding tensor.Args:input_tensor: float Tensor of shape [batch_size, seq_length,embedding_size].use_token_type: bool. Whether to add embeddings for `token_type_ids`.token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].Must be specified if `use_token_type` is True.token_type_vocab_size: int. The vocabulary size of `token_type_ids`.token_type_embedding_name: string. The name of the embedding table variablefor token type ids.use_position_embeddings: bool. Whether to add position embeddings for theposition of each token in the sequence.position_embedding_name: string. The name of the embedding table variablefor positional embeddings.initializer_range: float. Range of the weight initialization.max_position_embeddings: int. Maximum sequence length that might ever beused with this model. This can be longer than the sequence length ofinput_tensor, but cannot be shorter.dropout_prob: float. Dropout probability applied to the final output tensor.Returns:float tensor with same shape as `input_tensor`.Raises:ValueError: One of the tensor shapes or input values is invalid."""input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]width = input_shape[2]output = input_tensor# 添加token embeddingif use_token_type:if token_type_ids is None:raise ValueError("`token_type_ids` must be specified if""`use_token_type` is True.")token_type_table = tf.get_variable(name=token_type_embedding_name,shape=[token_type_vocab_size, width],initializer=create_initializer(initializer_range))# This vocab will be small so we always do one-hot here, since it is always# faster for a small vocabulary.flat_token_type_ids = tf.reshape(token_type_ids, [-1])one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)token_type_embeddings = tf.reshape(token_type_embeddings,[batch_size, seq_length, width])output += token_type_embeddings# 添加position embeddingif use_position_embeddings:assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)with tf.control_dependencies([assert_op]):full_position_embeddings = tf.get_variable(name=position_embedding_name,shape=[max_position_embeddings, width],initializer=create_initializer(initializer_range))# Since the position embedding table is a learned variable, we create it# using a (long) sequence length `max_position_embeddings`. The actual# sequence length might be shorter than this, for faster training of# tasks that do not have long sequences.## So `full_position_embeddings` is effectively an embedding table# for position [0, 1, 2, ..., max_position_embeddings-1], and the current# sequence has positions [0, 1, 2, ... seq_length-1], so we can just# perform a slice.position_embeddings = tf.slice(full_position_embeddings, [0, 0],[seq_length, -1])num_dims = len(output.shape.as_list())# Only the last two dimensions are relevant (`seq_length` and `width`), so# we broadcast among the first dimensions, which is typically just# the batch size.position_broadcast_shape = []for _ in range(num_dims - 2):position_broadcast_shape.append(1)position_broadcast_shape.extend([seq_length, width])position_embeddings = tf.reshape(position_embeddings,position_broadcast_shape)output += position_embeddings# 加入LN和dropoutoutput = layer_norm_and_dropout(output, dropout_prob)return output

    transformer模型解析:

    def transformer_model(input_tensor,attention_mask=None,hidden_size=768,num_hidden_layers=12,num_attention_heads=12,intermediate_size=3072,intermediate_act_fn=gelu,hidden_dropout_prob=0.1,attention_probs_dropout_prob=0.1,initializer_range=0.02,do_return_all_layers=False):if hidden_size % num_attention_heads != 0:raise ValueError("The hidden size (%d) is not a multiple of the number of attention ""heads (%d)" % (hidden_size, num_attention_heads))attention_head_size = int(hidden_size / num_attention_heads)input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]input_width = input_shape[2]# The Transformer performs sum residuals on all layers so the input needs# to be the same as the hidden size.if input_width != hidden_size:raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %(input_width, hidden_size))# We keep the representation as a 2D tensor to avoid re-shaping it back and# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on# the GPU/CPU but may not be free on the TPU, so we want to minimize them to# help the optimizer.prev_output = reshape_to_matrix(input_tensor)all_layer_outputs = []# 定義基礎單元疊加數量Nx:num_hidden_layersfor layer_idx in range(num_hidden_layers):with tf.variable_scope("layer_%d" % layer_idx):layer_input = prev_output# Attention層with tf.variable_scope("attention"):attention_heads = []with tf.variable_scope("self"):attention_head = attention_layer(from_tensor=layer_input,to_tensor=layer_input,attention_mask=attention_mask,num_attention_heads=num_attention_heads,size_per_head=attention_head_size,attention_probs_dropout_prob=attention_probs_dropout_prob,initializer_range=initializer_range,do_return_2d_tensor=True,batch_size=batch_size,from_seq_length=seq_length,to_seq_length=seq_length)attention_heads.append(attention_head)attention_output = Noneif len(attention_heads) == 1:attention_output = attention_heads[0]else:# In the case where we have other sequences, we just concatenate# them to the self-attention head before the projection.attention_output = tf.concat(attention_heads, axis=-1)# Run a linear projection of `hidden_size` then add a residual# with `layer_input`.# dropout并加入短鏈接ResNet和LNwith tf.variable_scope("output"):attention_output = tf.layers.dense(attention_output,hidden_size,kernel_initializer=create_initializer(initializer_range))attention_output = dropout(attention_output, hidden_dropout_prob)attention_output = layer_norm(attention_output + layer_input)# The activation is only applied to the "intermediate" hidden layer.# 映射到intermediate_size大小with tf.variable_scope("intermediate"):intermediate_output = tf.layers.dense(attention_output,intermediate_size,activation=intermediate_act_fn,kernel_initializer=create_initializer(initializer_range))# Down-project back to `hidden_size` then add the residual.# 映射回hidden_size大小,并加入短連接ResNet和LNwith tf.variable_scope("output"):layer_output = tf.layers.dense(intermediate_output,hidden_size,kernel_initializer=create_initializer(initializer_range))layer_output = dropout(layer_output, hidden_dropout_prob)layer_output = layer_norm(layer_output + attention_output)prev_output = layer_outputall_layer_outputs.append(layer_output)if do_return_all_layers:final_outputs = []for layer_output in all_layer_outputs:# reshap回input_shape尺寸final_output = reshape_from_matrix(layer_output, input_shape)final_outputs.append(final_output)return final_outputselse:final_output = reshape_from_matrix(prev_output, input_shape)return final_output

    Attention層:
    Attention(Q,K,V)=softmax(QKTdk)VAttention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})VAttention(Q,K,V)=softmax(dk??QKT?)V
    其中Q為query, K為key, V為value, Q∈Rn×?dkQ∈R^{n\text{ × }d_k}QRn?×?dk? dk,K∈Rm×?dk,V∈Rm×?dvd_k, K∈R^{m\text{ × }d_k}, V∈R^{m\text{ × }d_v}dk?,KRm?×?dk?,VRm?×?dv?dkd_kdk?為query和key的維度。?其中,除以dkd_kdk?是為了防止點積過大,QKTQK^TQKT是計算序列中各個位置的相互關系。

    def attention_layer(from_tensor,to_tensor,attention_mask=None,num_attention_heads=1,size_per_head=512,query_act=None,key_act=None,value_act=None,attention_probs_dropout_prob=0.0,initializer_range=0.02,do_return_2d_tensor=False,batch_size=None,from_seq_length=None,to_seq_length=None):def transpose_for_scores(input_tensor, batch_size, num_attention_heads,seq_length, width):output_tensor = tf.reshape(input_tensor, [batch_size, seq_length, num_attention_heads, width])output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])return output_tensorfrom_shape = get_shape_list(from_tensor, expected_rank=[2, 3])to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])if len(from_shape) != len(to_shape):raise ValueError("The rank of `from_tensor` must match the rank of `to_tensor`.")if len(from_shape) == 3:batch_size = from_shape[0]from_seq_length = from_shape[1]to_seq_length = to_shape[1]elif len(from_shape) == 2:if (batch_size is None or from_seq_length is None or to_seq_length is None):raise ValueError("When passing in rank 2 tensors to attention_layer, the values ""for `batch_size`, `from_seq_length`, and `to_seq_length` ""must all be specified.")# Scalar dimensions referenced here:# B = batch size (number of sequences)# F = `from_tensor` sequence length, 輸入tensor# T = `to_tensor` sequence length, 輸出tensor# N = `num_attention_heads`, Attention頭數目# H = `size_per_head`, 每個Attention頭的大小from_tensor_2d = reshape_to_matrix(from_tensor)to_tensor_2d = reshape_to_matrix(to_tensor)# `query_layer` = [B*F, N*H]# Q, [B*F, N*H], 將from_tensor映射為num_attention_heads * size_per_head大小query_layer = tf.layers.dense(from_tensor_2d,num_attention_heads * size_per_head,activation=query_act,name="query",kernel_initializer=create_initializer(initializer_range))# `key_layer` = [B*T, N*H]# K, [B*T, N*H], 將to_tensor映射為num_attention_heads * size_per_head大小key_layer = tf.layers.dense(to_tensor_2d,num_attention_heads * size_per_head,activation=key_act,name="key",kernel_initializer=create_initializer(initializer_range))# `value_layer` = [B*T, N*H]# V, [B*T, N*H], 將to_tensor映射為num_attention_heads * size_per_head大小value_layer = tf.layers.dense(to_tensor_2d,num_attention_heads * size_per_head,activation=value_act,name="value",kernel_initializer=create_initializer(initializer_range))# `query_layer` = [B, N, F, H]# 重排列維度query_layer = transpose_for_scores(query_layer, batch_size,num_attention_heads, from_seq_length,size_per_head)# `key_layer` = [B, N, T, H]key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,to_seq_length, size_per_head)# Take the dot product between "query" and "key" to get the raw# attention scores.# `attention_scores` = [B, N, F, T]# 計算QK/sqrt(dk)attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)attention_scores = tf.multiply(attention_scores,1.0 / math.sqrt(float(size_per_head)))if attention_mask is not None:# `attention_mask` = [B, 1, F, T]attention_mask = tf.expand_dims(attention_mask, axis=[1])# Since attention_mask is 1.0 for positions we want to attend and 0.0 for# masked positions, this operation will create a tensor which is 0.0 for# positions we want to attend and -10000.0 for masked positions.# 對于attention_mask為1的加零維持原樣;# 對于attention_mask為0的加入-10000令最終該位置的softmax結果為0。adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0# Since we are adding it to the raw scores before the softmax, this is# effectively the same as removing these entirely.attention_scores += adder# Normalize the attention scores to probabilities.# `attention_probs` = [B, N, F, T]# 計算softmaxattention_probs = tf.nn.softmax(attention_scores)# This is actually dropping out entire tokens to attend to, which might# seem a bit unusual, but is taken from the original Transformer paper.# 加入dropoutattention_probs = dropout(attention_probs, attention_probs_dropout_prob)# `value_layer` = [B, T, N, H]value_layer = tf.reshape(value_layer,[batch_size, to_seq_length, num_attention_heads, size_per_head])# `value_layer` = [B, N, T, H]value_layer = tf.transpose(value_layer, [0, 2, 1, 3])# `context_layer` = [B, N, F, H]# softmax結果乘以V, [B, N, F, T] x [B, N, T, H] = [B, N, F, H]context_layer = tf.matmul(attention_probs, value_layer)# `context_layer` = [B, F, N, H]# 重排列維度context_layer = tf.transpose(context_layer, [0, 2, 1, 3])if do_return_2d_tensor:# `context_layer` = [B*F, N*H]context_layer = tf.reshape(context_layer,[batch_size * from_seq_length, num_attention_heads * size_per_head])else:# `context_layer` = [B, F, N*H]# 將num_attention_heads個head連接在一起context_layer = tf.reshape(context_layer,[batch_size, from_seq_length, num_attention_heads * size_per_head])return context_layer

    參考文獻

  • Attention Is All You Need
  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • 總結

    以上是生活随笔為你收集整理的Bert源代码(二)模型的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。