當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

TF之LSTM：基于Tensorflow框架采用PTB数据集建立LSTM网络的自然语言建模

發布時間：2025/3/21 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 TF之LSTM：基于Tensorflow框架采用PTB数据集建立LSTM网络的自然语言建模小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

TF之LSTM：基于Tensorflow框架采用PTB數據集建立LSTM網絡的自然語言建模

關于PTB數據集

代碼實現

關于PTB數據集

PTB （Penn Treebank Dataset）文本數據集是語言模型學習中目前最被廣泛使用數據集。
ptb.test.txt ? ?#測試集數據文件
ptb.train.txt ? #訓練集數據文件
ptb.valid.txt ? #驗證集數據文件
這三個數據文件中的數據已經經過了預處理，包含了10000 個不同的詞語和語句結束標記符（在文本中就是換行符）以及標記稀有詞語的特殊符號。
為了讓使用PTB數據集更加方便，TensorFlow提供了兩個函數來幫助實現數據的預處理。首先，TensorFlow提供了ptb_raw_data函數來讀取PTB的原始數據，并將原始數據中的單詞轉化為單詞ID。
訓練數據中總共包含了929589 個單詞，而這些單詞被組成了一個非常長的序列。這個序列通過特殊的標識符給出了每句話結束的位置。在這個數據集中，句子結束的標識符ID為2。
數據集的下載地址：TF的PTB數據集? ? ?(別的數據集不匹配的話會出現錯誤)
?

代碼實現

? ?本代碼使用2層 LSTM 網絡，且每層有 200 個隱藏單元。在訓練中截斷的輸入序列長度為 32，且使用 Dropout 和梯度截斷等方法控制模型的過擬合與梯度爆炸等問題。當簡單地訓練 3 個 Epoch 后，測試復雜度（Perplexity）降低到了 210，如果多輪訓練會更低。

# -*- coding: utf-8 -*- from __future__ import absolute_import from __future__ import division from __future__ import print_functionimport collections import os import sysimport tensorflow as tfPy3 = sys.version_info[0] == 3def _read_words(filename):with tf.gfile.GFile(filename, "r") as f:if Py3:return f.read().replace("\n", "<eos>").split()else:return f.read().decode("utf-8").replace("\n", "<eos>").split()def _build_vocab(filename):data = _read_words(filename)counter = collections.Counter(data)count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))words, _ = list(zip(*count_pairs))word_to_id = dict(zip(words, range(len(words))))return word_to_iddef _file_to_word_ids(filename, word_to_id):data = _read_words(filename)return [word_to_id[word] for word in data if word in word_to_id]def ptb_raw_data(data_path=None):"""Load PTB raw data from data directory "data_path".Reads PTB text files, converts strings to integer ids,and performs mini-batching of the inputs.The PTB dataset comes from Tomas Mikolov's webpage:http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgzArgs:data_path: string path to the directory where simple-examples.tgz hasbeen extracted.Returns:tuple (train_data, valid_data, test_data, vocabulary)where each of the data objects can be passed to PTBIterator."""train_path = os.path.join(data_path, "ptb.train.txt")valid_path = os.path.join(data_path, "ptb.valid.txt")test_path = os.path.join(data_path, "ptb.test.txt")word_to_id = _build_vocab(train_path)train_data = _file_to_word_ids(train_path, word_to_id)valid_data = _file_to_word_ids(valid_path, word_to_id)test_data = _file_to_word_ids(test_path, word_to_id)vocabulary = len(word_to_id)return train_data, valid_data, test_data, vocabularydef ptb_producer(raw_data, batch_size, num_steps, name=None):"""Iterate on the raw PTB data.This chunks up raw_data into batches of examples and returns Tensors thatare drawn from these batches.Args:raw_data: one of the raw data outputs from ptb_raw_data.batch_size: int, the batch size.num_steps: int, the number of unrolls.name: the name of this operation (optional).Returns:A pair of Tensors, each shaped [batch_size, num_steps]. The second elementof the tuple is the same data time-shifted to the right by one.Raises:tf.errors.InvalidArgumentError: if batch_size or num_steps are too high."""with tf.name_scope(name, "PTBProducer", [raw_data, batch_size, num_steps]):raw_data = tf.convert_to_tensor(raw_data, name="raw_data", dtype=tf.int32)data_len = tf.size(raw_data)batch_len = data_len // batch_sizedata = tf.reshape(raw_data[0 : batch_size * batch_len],[batch_size, batch_len])epoch_size = (batch_len - 1) // num_stepsassertion = tf.assert_positive(epoch_size,message="epoch_size == 0, decrease batch_size or num_steps")with tf.control_dependencies([assertion]):epoch_size = tf.identity(epoch_size, name="epoch_size")i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()x = tf.strided_slice(data, [0, i * num_steps],[batch_size, (i + 1) * num_steps])x.set_shape([batch_size, num_steps])y = tf.strided_slice(data, [0, i * num_steps + 1],[batch_size, (i + 1) * num_steps + 1])y.set_shape([batch_size, num_steps])return x, y from reader import * import tensorflow as tf import numpy as npdata_path = 'F:/File_Python/Python_daydayup/data/simple-examples/data' #F:/File_Python/Python_daydayup/data/simple-examples/data # 隱藏層單元數與LSTM層級數 hidden_size = 200 num_layers = 2 #詞典規模 vocab_size = 10000learning_rate = 1.0 train_batch_size = 16 # 訓練數據截斷長度 train_num_step = 32# 在測試時不需要使用截斷，測試數據為一個超長序列 eval_batch_size = 1 eval_num_step = 1 num_epoch = 3 #結點不被Dropout的概率 keep_prob = 0.5# 用于控制梯度爆炸的參數 max_grad_norm = 5 # 通過ptbmodel 的類描述模型 class PTBModel(object):def __init__(self, is_training, batch_size, num_steps):# 記錄使用的Batch大小和截斷長度self.batch_size = batch_sizeself.num_steps = num_steps# 定義輸入層，維度為批量大小×截斷長度self.input_data = tf.placeholder(tf.int32, [batch_size, num_steps])# 定義預期輸出self.targets = tf.placeholder(tf.int32, [batch_size, num_steps])# 定義使用LSTM結構為循環體，帶Dropout的深度RNNlstm_cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size)if is_training:lstm_cell = tf.nn.rnn_cell.DropoutWrapper(lstm_cell, output_keep_prob=keep_prob)cell = tf.nn.rnn_cell.MultiRNNCell([lstm_cell] * num_layers)# 初始化狀態為0self.initial_state = cell.zero_state(batch_size, tf.float32)# 將單詞ID轉換為單詞向量，embedding的維度為vocab_size*hidden_sizeembedding = tf.get_variable('embedding', [vocab_size, hidden_size])# 將一個批量內的單詞ID轉化為詞向量，轉化后的輸入維度為批量大小×截斷長度×隱藏單元數inputs = tf.nn.embedding_lookup(embedding, self.input_data)# 只在訓練時使用Dropoutif is_training: inputs = tf.nn.dropout(inputs, keep_prob)# 定義輸出列表，這里先將不同時刻LSTM的輸出收集起來，再通過全連接層得到最終輸出outputs = []# state 儲存不同批量中LSTM的狀態，初始為0state = self.initial_statewith tf.variable_scope('RNN'):for time_step in range(num_steps):if time_step > 0: tf.get_variable_scope().reuse_variables()# 從輸入數據獲取當前時間步的輸入與前一時間步的狀態，并傳入LSTM結構cell_output, state = cell(inputs[:, time_step, :], state)# 將當前輸出加入輸出隊列outputs.append(cell_output)# 將輸出隊列展開成[batch,hidden*num_step]的形狀，再reshape為[batch*num_step, hidden]output = tf.reshape(tf.concat(outputs, 1), [-1, hidden_size])# 將LSTM的輸出傳入全連接層以生成最后的預測結果。最后結果在每時刻上都是長度為vocab_size的張量# 且經過softmax層后表示下一個位置不同詞的概率weight = tf.get_variable('weight', [hidden_size, vocab_size])bias = tf.get_variable('bias', [vocab_size])logits = tf.matmul(output, weight) + bias# 定義交叉熵損失函數，一個序列的交叉熵之和loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example([logits], # 預測的結果[tf.reshape(self.targets, [-1])], # 期望正確的結果，這里將[batch_size, num_steps]壓縮為一維張量[tf.ones([batch_size * num_steps], dtype=tf.float32)]) # 損失的權重，所有為1表明不同批量和時刻的重要程度一樣# 計算每個批量的平均損失self.cost = tf.reduce_sum(loss) / batch_sizeself.final_state = state# 只在訓練模型時定義反向傳播操作if not is_training: returntrainable_variable = tf.trainable_variables()# 控制梯度爆炸問題grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, trainable_variable), max_grad_norm)# 如果需要使用Adam作為優化器，可以改為tf.train.AdamOptimizer(learning_rate)，學習率需要降低至0.001左右optimizer = tf.train.GradientDescentOptimizer(learning_rate)# 定義訓練步驟self.train_op = optimizer.apply_gradients(zip(grads, trainable_variable)) def run_epoch(session, model, data, train_op, output_log, epoch_size):total_costs = 0.0iters = 0state = session.run(model.initial_state)# # 使用當前數據訓練或測試模型for step in range(epoch_size):x, y = session.run(data)# 在當前批量上運行train_op并計算損失值，交叉熵計算的是下一個單詞為給定單詞的概率cost, state, _ = session.run([model.cost, model.final_state, train_op],{model.input_data: x, model.targets: y, model.initial_state: state})# 將不同時刻和批量的概率就可得到困惑度的對數形式，將這個和做指數運算就可得到困惑度total_costs += costiters += model.num_steps# 只在訓練時輸出日志if output_log and step % 100 == 0:print("After %d steps, perplexity is %.3f" % (step, np.exp(total_costs / iters)))return np.exp(total_costs / iters) def main():train_data, valid_data, test_data, _ = ptb_raw_data(data_path)# 計算一個epoch需要訓練的次數train_data_len = len(train_data)train_batch_len = train_data_len // train_batch_sizetrain_epoch_size = (train_batch_len - 1) // train_num_stepvalid_data_len = len(valid_data)valid_batch_len = valid_data_len // eval_batch_sizevalid_epoch_size = (valid_batch_len - 1) // eval_num_steptest_data_len = len(test_data)test_batch_len = test_data_len // eval_batch_sizetest_epoch_size = (test_batch_len - 1) // eval_num_stepinitializer = tf.random_uniform_initializer(-0.05, 0.05)with tf.variable_scope("language_model", reuse=None, initializer=initializer):train_model = PTBModel(True, train_batch_size, train_num_step)with tf.variable_scope("language_model", reuse=True, initializer=initializer):eval_model = PTBModel(False, eval_batch_size, eval_num_step)# 訓練模型。with tf.Session() as session:tf.global_variables_initializer().run()train_queue = ptb_producer(train_data, train_model.batch_size, train_model.num_steps)eval_queue = ptb_producer(valid_data, eval_model.batch_size, eval_model.num_steps)test_queue = ptb_producer(test_data, eval_model.batch_size, eval_model.num_steps)coord = tf.train.Coordinator()threads = tf.train.start_queue_runners(sess=session, coord=coord)for i in range(num_epoch):print("In iteration: %d" % (i + 1))run_epoch(session, train_model, train_queue, train_model.train_op, True, train_epoch_size)valid_perplexity = run_epoch(session, eval_model, eval_queue, tf.no_op(), False, valid_epoch_size)print("Epoch: %d Validation Perplexity: %.3f" % (i + 1, valid_perplexity))test_perplexity = run_epoch(session, eval_model, test_queue, tf.no_op(), False, test_epoch_size)print("Test Perplexity: %.3f" % test_perplexity)coord.request_stop()coord.join(threads)if __name__ == "__main__":main()

總結

以上是生活随笔為你收集整理的TF之LSTM：基于Tensorflow框架采用PTB数据集建立LSTM网络的自然语言建模的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： DL之RBM：基于RBM实现手写数字图片
下一篇： TF之data_format：data_