Tensorflow搞一个聊天机器人
catalogue
0. 前言 1. 訓練語料庫 2. 數據預處理 3. 詞匯轉向量 4. 訓練 5. 聊天機器人 - 驗證效果?
0. 前言
不是搞機器學習算法專業的,3個月前開始補了一些神經網絡,卷積,神經網絡一大堆基礎概念,尼瑪,還真有點復雜,不過搞懂這些基本數學概念,再看tensorflow的api和python代碼覺得跌跌撞撞竟然能看懂了,背后的意思也能明白一點點
0x1: 模型分類
1. 基于檢索的模型 vs. 產生式模型
基于檢索的模型(Retrieval-Based Models)有一個預先定義的"回答集(repository)",包含了許多回答(responses),還有一些根據輸入的問句和上下文(context),以及用于挑選出合適的回答的啟發式規則。這些啟發式規則可能是簡單的基于規則的表達式匹配,或是相對復雜的機器學習分類器的集成。基于檢索的模型不會產生新的文字,它只能從預先定義的"回答集"中挑選出一個較為合適的回答。
產生式模型(Generative Models)不依賴于預先定義的回答集,它會產生一個新的回答。經典的產生式模型是基于機器翻譯技術的,只不過不是將一種語言翻譯成另一種語言,而是將問句"翻譯"成回答(response)
2. 長對話模型 vs. 短對話模型
短對話(Short Conversation)指的是一問一答式的單輪(single turn)對話。舉例來說,當機器收到用戶的一個提問時,會返回一個合適的回答。對應地,長對話(Long Conversation)指的是你來我往的多輪(multi-turn)對話,例如兩個朋友對某個話題交流意見的一段聊天。在這個場景中,需要談話雙方(聊天機器人可能是其中一方)記得雙方曾經談論過什么,這是和短對話的場景的區別之一。現下,機器人客服系統通常是長對話模型
3. 開放話題模型 vs. 封閉話題模型
開放話題(Open Domain)場景下,用戶可以說任何內容,不需要是有特定的目的或是意圖的詢問。人們在Twitter、Reddit等社交網絡上的對話形式就是典型的開放話題情景。由于該場景下,可談論的主題的數量不限,而且需要一些常識作為聊天基礎,使得搭建一個這樣的聊天機器人變得相對困難。
封閉話題(Closed Domain)場景,又稱為目標驅動型(goal-driven),系統致力于解決特定領域的問題,因此可能的詢問和回答的數量相對有限。技術客服系統或是購物助手等應用就是封閉話題模型的例子。我們不要求這些系統能夠談論政治,只需要它們能夠盡可能有效地解決我們的問題。雖然用戶還是可以向這些系統問一些不著邊際的問題,但是系統同樣可以不著邊際地給你回復 ;)
Relevant Link:
http://naturali.io/deeplearning/chatbot/introduction/2016/04/28/chatbot-part1.html http://blog.topspeedsnail.com/archives/10735/comment-page-1#comment-1161 http://blog.csdn.net/malefactor/article/details/51901115?
1. 訓練語料庫
wget https://raw.githubusercontent.com/rustch3n/dgk_lost_conv/master/dgk_shooter_min.conv.zip 解壓 unzip dgk_shooter_min.conv.zipRelevant Link:
https://github.com/rustch3n/dgk_lost_conv?
2. 數據預處理
一般來說,我們拿到的基礎語料庫可能是一些電影臺詞對話,或者是UBUNTU對話語料庫(Ubuntu Dialog Corpus),但基本上我們都要完成以下幾大步驟
1. 分詞(tokenized) 2. 英文單詞取詞根(stemmed) 3. 英文單詞變形的歸類(lemmatized)(例如單復數歸類)等 4. 此外,例如人名、地名、組織名、URL鏈接、系統路徑等專有名詞,我們也可以統一用類型標識符來替代M 表示話語,E 表示分割,遇到M就吧當前對話片段加入臨時對話集,遇到E就說明遇到一個中斷或者交談雙方轉換了,一口氣吧臨時對話集加入convs總對話集,一次加入一個對話集,可以理解為拍電影里面的一個"咔"
convs = [] # conversation set with open(conv_path, encoding="utf8") as f:one_conv = [] # a complete conversationfor line in f:line = line.strip('\n').replace('/', '')if line == '':continueif line[0] == 'E':if one_conv:convs.append(one_conv)one_conv = []elif line[0] == 'M':one_conv.append(line.split(' ')[1])因為場景是聊天機器人,影視劇的臺詞也是一人一句對答的,所以這里需要忽略2種特殊情況,只有一問或者只有一答,以及問和答的數量不一致,即最后一個人問完了沒有得到回答
Relevant Link:
?
3. 詞匯轉向量
我們知道圖像識別、語音識別之所以能率先在深度學習領域取得較大成就,其中一個原因在于這2個領域的原始輸入數據本身就帶有很強的樣本關聯性,例如像素權重分布在同一類物體的不同圖像中,表現是基本一致的,這本質上也人腦識別同類物體的機制是一樣的,即我們常說的"舉一反三"能力,我們學過的文字越多,就越可能駕馭甚至能創造組合出新的文字用法,寫出華麗的文章
但是NPL或者語義識別領域的輸入數據,對話或者叫語料往往是不具備這種強關聯性的,為此,就需要引入一個概念模型,叫詞向量(word2vec)或短語向量(seq2seq),簡單來說就是將語料庫中的詞匯抽象映射到一個向量空間中,向量的排布是根據預發和詞義語境決定的,例如,"中國->人"(中國后面緊跟著一個人字的可能性是極大的)、"你今年幾歲了->我 ** 歲了"
0x1: Token化處理、詞編碼
將訓練集中的對話的每個文件拆分成單獨的一個個文字,形成一個詞表(word table)
def gen_vocabulary_file(input_file, output_file):vocabulary = {}with open(input_file) as f:counter = 0for line in f:counter += 1tokens = [word for word in line.strip()]for word in tokens:if word in vocabulary:vocabulary[word] += 1else:vocabulary[word] = 1vocabulary_list = START_VOCABULART + sorted(vocabulary, key=vocabulary.get, reverse=True)# For taking 10000 custom character kanjiif len(vocabulary_list) > 10000:vocabulary_list = vocabulary_list[:10000]print(input_file + " phrase table size:", len(vocabulary_list))with open(output_file, "w") as ff:for word in vocabulary_list:ff.write(word + "\n")
完成了Token化之后,需要對單詞進行數字編碼,方便后續的向量空間處理,這里依據的核心思想是這樣的
我們的訓練語料庫的對話之間都是有強關聯的,基于這份有關聯的對話集獲得的詞表的詞之間也有邏輯關聯性,那么我們只要按照此表原生的順序對詞進行編碼,這個編碼后的[work, id]就是一個有向量空間關聯性的詞表
def convert_conversation_to_vector(input_file, vocabulary_file, output_file):tmp_vocab = []with open(vocabulary_file, "r") as f:tmp_vocab.extend(f.readlines())tmp_vocab = [line.strip() for line in tmp_vocab]vocab = dict([(x, y) for (y, x) in enumerate(tmp_vocab)])for item in vocab:print item.encode('utf-8')所以我們根據訓練預料集得到的此表可以作為對話訓練集和對話測試機進行向量化的依據,我們的目的是將對話(包括訓練集和測試集)的問和答都轉化映射到向量空間
土 968 "土"字在訓練集詞匯表中的位置是968,我們就給該字設置一個編碼9680x2: 對話轉為向量
原作者在詞表的選取上作了裁剪,只選取前5000個詞匯,但是仔細思考了一下,感覺問題源頭還是在訓練語料庫不夠豐富,不能完全覆蓋所有的對話語言場景
這一步得到一個ask/answer的語句seq向量空間集,對于訓練集,我們將ask和answer建立映射關系
Relevant Link:
?
4. 訓練
0x1: Sequence-to-sequence basics
A basic sequence-to-sequence model, as introduced in Cho et al., 2014, consists of two recurrent neural networks (RNNs): an encoder that processes the input and a decoder that generates the output. This basic architecture is depicted below.
Each box in the picture above represents a cell of the RNN, most commonly a GRU cell or an LSTM cell. Encoder and decoder can share weights or, as is more common, use a different set of parameters. Multi-layer cells have been successfully used in sequence-to-sequence models too
In the basic model depicted above, every input has to be encoded into a fixed-size state vector, as that is the only thing passed to the decoder. To allow the decoder more direct access to the input, an attention mechanism was introduced in Bahdanau et al., 2014.; suffice it to say that it allows the decoder to peek into the input at every decoding step. A multi-layer sequence-to-sequence network with LSTM cells and attention mechanism in the decoder looks like this.
0x2: 訓練過程
利用ask/answer的訓練集輸入神經網絡,并使用ask/answer測試向量映射集實現BP反饋與,使用一個三層神經網絡,讓tensorflow自動調整權重參數,獲得一個ask-?的模型
# -*- coding: utf-8 -*-import tensorflow as tf # 0.12 from tensorflow.models.rnn.translate import seq2seq_model import os import numpy as np import mathPAD_ID = 0 GO_ID = 1 EOS_ID = 2 UNK_ID = 3# ask/answer conversation vector file train_ask_vec_file = 'train_ask.vec' train_answer_vec_file = 'train_answer.vec' test_ask_vec_file = 'test_ask.vec' test_answer_vec_file = 'test_answer.vec'# word table 6000 vocabulary_ask_size = 6000 vocabulary_answer_size = 6000buckets = [(5, 10), (10, 15), (20, 25), (40, 50)] layer_size = 256 num_layers = 3 batch_size = 64# read *dencode.vec和*decode.vec data into memory def read_data(source_path, target_path, max_size=None):data_set = [[] for _ in buckets]with tf.gfile.GFile(source_path, mode="r") as source_file:with tf.gfile.GFile(target_path, mode="r") as target_file:source, target = source_file.readline(), target_file.readline()counter = 0while source and target and (not max_size or counter < max_size):counter += 1source_ids = [int(x) for x in source.split()]target_ids = [int(x) for x in target.split()]target_ids.append(EOS_ID)for bucket_id, (source_size, target_size) in enumerate(buckets):if len(source_ids) < source_size and len(target_ids) < target_size:data_set[bucket_id].append([source_ids, target_ids])breaksource, target = source_file.readline(), target_file.readline()return data_setif __name__ == '__main__':model = seq2seq_model.Seq2SeqModel(source_vocab_size=vocabulary_ask_size,target_vocab_size=vocabulary_answer_size,buckets=buckets, size=layer_size, num_layers=num_layers, max_gradient_norm=5.0,batch_size=batch_size, learning_rate=0.5, learning_rate_decay_factor=0.97,forward_only=False)config = tf.ConfigProto()config.gpu_options.allocator_type = 'BFC' # forbidden out of memorywith tf.Session(config=config) as sess:# 恢復前一次訓練ckpt = tf.train.get_checkpoint_state('.')if ckpt != None:print(ckpt.model_checkpoint_path)model.saver.restore(sess, ckpt.model_checkpoint_path)else:sess.run(tf.global_variables_initializer())train_set = read_data(train_ask_vec_file, train_answer_vec_file)test_set = read_data(test_ask_vec_file, test_answer_vec_file)train_bucket_sizes = [len(train_set[b]) for b in range(len(buckets))]train_total_size = float(sum(train_bucket_sizes))train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size for i in range(len(train_bucket_sizes))]loss = 0.0total_step = 0previous_losses = []# continue train,save modle after a decade of timewhile True:random_number_01 = np.random.random_sample()bucket_id = min([i for i in range(len(train_buckets_scale)) if train_buckets_scale[i] > random_number_01])encoder_inputs, decoder_inputs, target_weights = model.get_batch(train_set, bucket_id)_, step_loss, _ = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, False)loss += step_loss / 500total_step += 1print(total_step)if total_step % 500 == 0:print(model.global_step.eval(), model.learning_rate.eval(), loss)# if model has't not improve,decrese the learning rateif len(previous_losses) > 2 and loss > max(previous_losses[-3:]):sess.run(model.learning_rate_decay_op)previous_losses.append(loss)# save modelcheckpoint_path = "chatbot_seq2seq.ckpt"model.saver.save(sess, checkpoint_path, global_step=model.global_step)loss = 0.0# evaluation the model by test datasetfor bucket_id in range(len(buckets)):if len(test_set[bucket_id]) == 0:continueencoder_inputs, decoder_inputs, target_weights = model.get_batch(test_set, bucket_id)_, eval_loss, _ = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, True)eval_ppx = math.exp(eval_loss) if eval_loss < 300 else float('inf')print(bucket_id, eval_ppx)Relevant Link:
https://www.tensorflow.org/tutorials/seq2seq http://suriyadeepan.github.io/2016-06-28-easy-seq2seq/?
5. 聊天機器人 - 驗證效果
# -*- coding: utf-8 -*-import tensorflow as tf # 0.12 from tensorflow.models.rnn.translate import seq2seq_model import os import sys import locale import numpy as npPAD_ID = 0 GO_ID = 1 EOS_ID = 2 UNK_ID = 3train_ask_vocabulary_file = "train_ask_vocabulary.vec" train_answer_vocabulary_file = "train_answer_vocabulary.vec"def read_vocabulary(input_file):tmp_vocab = []with open(input_file, "r") as f:tmp_vocab.extend(f.readlines())tmp_vocab = [line.strip() for line in tmp_vocab]vocab = dict([(x, y) for (y, x) in enumerate(tmp_vocab)])return vocab, tmp_vocabif __name__ == '__main__':vocab_en, _, = read_vocabulary(train_ask_vocabulary_file)_, vocab_de, = read_vocabulary(train_answer_vocabulary_file)# word table 6000vocabulary_ask_size = 6000vocabulary_answer_size = 6000buckets = [(5, 10), (10, 15), (20, 25), (40, 50)]layer_size = 256num_layers = 3batch_size = 1model = seq2seq_model.Seq2SeqModel(source_vocab_size=vocabulary_ask_size,target_vocab_size=vocabulary_answer_size,buckets=buckets, size=layer_size, num_layers=num_layers, max_gradient_norm=5.0,batch_size=batch_size, learning_rate=0.5, learning_rate_decay_factor=0.99,forward_only=True)model.batch_size = 1with tf.Session() as sess:# restore last trainckpt = tf.train.get_checkpoint_state('.')if ckpt != None:print(ckpt.model_checkpoint_path)model.saver.restore(sess, ckpt.model_checkpoint_path)else:print("model not found")while True:input_string = raw_input('me > ').decode(sys.stdin.encoding or locale.getpreferredencoding(True)).strip()# 退出if input_string == 'quit':exit()# convert the user's input to vectorinput_string_vec = []for words in input_string.strip():input_string_vec.append(vocab_en.get(words, UNK_ID))bucket_id = min([b for b in range(len(buckets)) if buckets[b][0] > len(input_string_vec)])encoder_inputs, decoder_inputs, target_weights = model.get_batch({bucket_id: [(input_string_vec, [])]},bucket_id)_, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, True)outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits]if EOS_ID in outputs:outputs = outputs[:outputs.index(EOS_ID)]response = "".join([tf.compat.as_str(vocab_de[output]) for output in outputs])print('AI > ' + response)神經網絡還是很依賴樣本的訓練的,我在實驗的過程中發現,用GPU跑到20000 step之后,模型的效果才逐漸顯現出來,才開始逐漸像正常的人機對話了
Relevant Link:
總結
以上是生活随笔為你收集整理的Tensorflow搞一个聊天机器人的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 最新|TensorFlow开源的序列到序
- 下一篇: TensorFlow教程之完整教程 2.