當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

【强化学习】 Nature DQN算法与莫烦代码重现（tensorflow)

發(fā)布時(shí)間：2023/12/20 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了【强化学习】 Nature DQN算法与莫烦代码重现（tensorflow) 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

DQN,(Deep Q-Learning)是將深度學(xué)習(xí)與強(qiáng)化學(xué)習(xí)相結(jié)合。在Q-learning中，我們是根據(jù)不斷更新Q-table中的值來(lái)進(jìn)行訓(xùn)練。但是在數(shù)據(jù)量比較大的情況下，Q-table是無(wú)法容納所有的數(shù)據(jù)量，因此提出了DQN。DQN的核心就是把Q-table的更新轉(zhuǎn)化為函數(shù)問(wèn)題，通過(guò)擬合一個(gè)function來(lái)代替Q-table產(chǎn)生Q值。

一、DQN算法原理

強(qiáng)化學(xué)習(xí)算法可以分為三大類(lèi)：value based,policy based和actor critic。以DQN為代表的是value based算法，這種算法只有一個(gè)值函數(shù)網(wǎng)絡(luò)，沒(méi)有policy網(wǎng)絡(luò)。

在DQN（NIPS 2013）里面，我們使用的目標(biāo)Q值的計(jì)算方式為：

這里目標(biāo)Q值的計(jì)算使用到了當(dāng)前要訓(xùn)練的Q網(wǎng)絡(luò)參數(shù)來(lái)計(jì)算，但實(shí)際上，我們又通過(guò)yj來(lái)更新Q網(wǎng)絡(luò)參數(shù)。兩者循環(huán)依賴(lài)，迭代起來(lái)相關(guān)性太強(qiáng)，不利于算法的收斂。因此一個(gè)改版的DQN：Nature DQN嘗試使用兩個(gè)網(wǎng)絡(luò)結(jié)構(gòu)完全相同的神經(jīng)網(wǎng)絡(luò)來(lái)減少目標(biāo)Q值計(jì)算和要更新Q網(wǎng)絡(luò)參數(shù)之間的依賴(lài)關(guān)系。下面是對(duì)Nature DQN的介紹（以下Nature DQN 均稱(chēng)DQN）。

二、Nature DQN結(jié)構(gòu)簡(jiǎn)介

DQN和Qlearning一樣，都是采用off-policy的方式。但DQN有兩個(gè)創(chuàng)新點(diǎn)，一是experience replay，即經(jīng)驗(yàn)回放，二是Fixed Q-target。

Experience replay，經(jīng)驗(yàn)池回放，我們將agent在每個(gè)時(shí)間步驟的經(jīng)驗(yàn)儲(chǔ)存在數(shù)據(jù)中，將許多回合匯聚到一個(gè)回放內(nèi)存中，數(shù)據(jù)集D=e1,...,eN,其中。在算法的內(nèi)部循環(huán)中，我們會(huì)把從部分?jǐn)?shù)據(jù)中進(jìn)行隨機(jī)抽樣，將抽取的樣本作為神經(jīng)網(wǎng)絡(luò)的輸入，從而更新神經(jīng)網(wǎng)絡(luò)的參數(shù)。使用經(jīng)驗(yàn)回放的優(yōu)勢(shì)有：

1、經(jīng)驗(yàn)的每個(gè)步驟都可能在許多權(quán)重更新中使用，這會(huì)提高數(shù)據(jù)的使用效率；

2、在游戲中，每個(gè)樣本之間的相關(guān)性比較強(qiáng)，相鄰樣本并不滿(mǎn)足獨(dú)立的前提。機(jī)器從連續(xù)樣本中學(xué)習(xí)到的東西是無(wú)效的。采用經(jīng)驗(yàn)回放相當(dāng)于給樣本增添了隨機(jī)性，而隨機(jī)性會(huì)破壞這些相關(guān)性，因此會(huì)減少更新的方差。

Fixed Q-targets,在DQN中采用了兩個(gè)結(jié)構(gòu)完全相同的神經(jīng)網(wǎng)絡(luò)，分別為Q-target和Q-predict，但Q-target網(wǎng)絡(luò)中采用的參數(shù)是舊參數(shù)，而Q-predict網(wǎng)絡(luò)中采用的參數(shù)是新參數(shù)。Q-predict網(wǎng)絡(luò)的參數(shù)每一次訓(xùn)練都會(huì)根據(jù)loss函數(shù)更新，在經(jīng)過(guò)一定的訓(xùn)練次數(shù)以后，Q-target網(wǎng)絡(luò)的參數(shù)會(huì)從Q-predict網(wǎng)絡(luò)中復(fù)制。這就是Fixed Q-targets。

為什么Q-target網(wǎng)絡(luò)參數(shù)不可以每一次訓(xùn)練都進(jìn)行更新？因?yàn)樵贒QN中，兩個(gè)Q網(wǎng)絡(luò)結(jié)構(gòu)是完全相同的，這樣會(huì)有一個(gè)新的問(wèn)題，每次更新網(wǎng)絡(luò)參數(shù)時(shí)，因?yàn)閠arget也會(huì)更新，這樣會(huì)容易導(dǎo)致參數(shù)不收斂。在有監(jiān)督學(xué)習(xí)中，標(biāo)簽label都是固定的，不會(huì)隨著參數(shù)的更新而改變。我們可以把Q-target當(dāng)作是一只老鼠，Q-predict就是一只貓，我們的目的是希望Q-predict越接近Q-target越好，就相當(dāng)于貓捉老鼠的過(guò)程。在這個(gè)過(guò)程中，貓和老鼠都是一直在變化的，這樣貓想捉老鼠是非常困難的。但是如果把老鼠(Q-target)固定住，只允許貓(Q-predict)動(dòng)，這樣貓想捉住老鼠就會(huì)變得容易很多了。

三、算法流程

首先初始化Memory D，D的容量是N

初始化Q網(wǎng)絡(luò)，隨機(jī)生成權(quán)重

初始化Q-target網(wǎng)絡(luò)，權(quán)重

循環(huán)遍歷episode=1，2，...，M??

? ? 初始化狀態(tài)S：

? ? ?循環(huán)遍歷step=1,2,...,T：

? ? ? ? ? ? ? 用epsilon-greedy策略生成action?:以概率epsilon隨機(jī)選擇一個(gè)action，或者選擇

? ? ? ? ? ? ? 執(zhí)行action?，接收reward?以及新的state S_

? ? ? ? ? ? ? 將transition樣本存入D中

? ? ? ? ? ? ? 從D中隨機(jī)抽取一個(gè)minibatch的transitions

? ? ? ? ? ? ? 如果j+1步是terminal，?令，否則，令

? ? ? ? ? ? ? 對(duì)關(guān)于θ使用梯度下降法進(jìn)行更新

? ? ? ? ? ? ? 每隔C步更新Q-target網(wǎng)絡(luò)，令

End For;

End For.

附上原文的算法流程：

?四、代碼復(fù)現(xiàn)(莫煩Python）

下面我們用一個(gè)具體的例子來(lái)演示DQN的應(yīng)用，這里參考了Morvan的DQN的代碼，建立了一個(gè)簡(jiǎn)易的4*4宮格的具有障礙的最短路徑的游戲。該游戲非常簡(jiǎn)單，基本要求就是要控制方塊在不觸碰黑色方塊的前提找到終點(diǎn)。在圖中每個(gè)狀態(tài)的可選擇的動(dòng)作最多有四個(gè)：上、下、左、右。進(jìn)入黑色方塊位置的獎(jiǎng)勵(lì)為-1，走到終點(diǎn)位置的獎(jiǎng)勵(lì)為1，其余位置的獎(jiǎng)勵(lì)均為0。

在此詳細(xì)講解代碼的DQN算法核心部分，環(huán)境部分代碼見(jiàn)：https://github.com/MorvanZhou/

class DeepQNetwork:def __init__(self,n_actions,#acttion space=4n_features,#features=statelearning_rate=0.01,reward_decay=0.9,e_greedy=0.9,replace_target_iter=300,memory_size=500,#the size of memory bankbatch_size=32,e_greedy_increment=None,output_graph=False,#tensorboard 輸出神經(jīng)網(wǎng)絡(luò)架構(gòu)):self.n_actions = n_actionsself.n_features = n_featuresself.lr = learning_rateself.gamma = reward_decayself.epsilon_max = e_greedyself.replace_target_iter = replace_target_iterself.memory_size = memory_sizeself.batch_size = batch_sizeself.epsilon_increment = e_greedy_incrementself.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max# total learning stepself.learn_step_counter = 0# initialize zero memory [s, a, r, s_]#初始化經(jīng)驗(yàn)池self.memory = np.zeros((self.memory_size, n_features * 2 + 2))# consist of [target_net, evaluate_net]self._build_net()t_params = tf.get_collection('target_net_params')e_params = tf.get_collection('eval_net_params')self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]self.sess = tf.Session()if output_graph:# $ tensorboard --logdir="logs/"# tf.train.SummaryWriter soon be deprecated, use followingtf.summary.FileWriter("logs/", self.sess.graph)self.sess.run(tf.global_variables_initializer())self.cost_his = []

self.replace_target_op:表示Q-target網(wǎng)絡(luò)要從Q-predict中復(fù)制神經(jīng)網(wǎng)絡(luò)的參數(shù)。

經(jīng)驗(yàn)池（memory bank）中存放的數(shù)據(jù)樣本為：當(dāng)前狀態(tài)，選擇的動(dòng)作，獎(jiǎng)勵(lì)，下一個(gè)狀態(tài)

在該游戲中，狀態(tài)的描述是通過(guò)16宮格中的二維的坐標(biāo)來(lái)表示的，對(duì)應(yīng)代碼中的features。

def _build_net(self):# ------------------ build evaluate_net ------------------self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s') # inputself.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target') # for calculating losswith tf.variable_scope('eval_net'):# c_names(collections_names) are the collections to store variablesc_names, n_l1, w_initializer, b_initializer = \['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 10, \tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1) # config of layers #第一層網(wǎng)絡(luò)的神經(jīng)元個(gè)數(shù)為n_l1:10# first layer. collections is used later when assign to target netwith tf.variable_scope('l1'):w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1) #l1層輸入經(jīng)過(guò)RELU激活函數(shù)，輸出為[None,n_l1]# second layer. collections is used later when assign to target netwith tf.variable_scope('l2'):w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)self.q_eval = tf.matmul(l1, w2) + b2 #l2輸出層結(jié)果為：[None,self_action]，輸出結(jié)果為Q-predict的值with tf.variable_scope('loss'):self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval)) #采用Mean Square Error計(jì)算Q-predict和Q-target之間的誤差with tf.variable_scope('train'):self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss) #此處梯度下降采用RMSprop進(jìn)行計(jì)算# ------------------ build target_net ------------------self.s_ = tf.placeholder(tf.float32, [None, self.n_features], name='s_') # inputwith tf.variable_scope('target_net'):# c_names(collections_names) are the collections to store variablesc_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES] #可以看到Q-target和Q-predict的網(wǎng)絡(luò)結(jié)構(gòu)完全相同，不同在于Q-target采用的參數(shù)是比較舊的，而Q-predcit采用的參數(shù)就是每次都會(huì)隨著梯度計(jì)算更新。# first layer. collections is used later when assign to target netwith tf.variable_scope('l1'):w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)l1 = tf.nn.relu(tf.matmul(self.s_, w1) + b1)# second layer. collections is used later when assign to target netwith tf.variable_scope('l2'):w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)self.q_next = tf.matmul(l1, w2) + b2

以上是DQN中重要的結(jié)構(gòu)：采用了兩個(gè)結(jié)構(gòu)完全相同，但參數(shù)不同步更新的全連接神經(jīng)網(wǎng)絡(luò)作為Q-table的代替輸出Q-target和Q-predict，兩個(gè)全連接神經(jīng)網(wǎng)絡(luò)都只采用了一層的隱藏層，激活函數(shù)使用了RELU，結(jié)構(gòu)非常簡(jiǎn)單。當(dāng)然，這里也可以采用CNN作為神經(jīng)網(wǎng)絡(luò)的結(jié)構(gòu)。

def store_transition(self, s, a, r, s_):if not hasattr(self, 'memory_counter'):self.memory_counter = 0 #判斷self對(duì)象有name特性返回True，否則返回False。若沒(méi)有這個(gè)索引值memory_counter，則令self.memory_counter=0transition = np.hstack((s, [a, r], s_))# replace the old memory with new memoryindex = self.memory_counter % self.memory_size #總 memory 大小是固定的, 如果超出總大小, 取index為余數(shù)，舊 memory 就被新 memory 替換self.memory[index, :] = transition #堆棧原理，在經(jīng)驗(yàn)池中的數(shù)據(jù)遵循先進(jìn)先出，后進(jìn)后出的原則self.memory_counter += 1 def choose_action(self, observation):# to have batch dimension when feed into tf placeholderobservation = observation[np.newaxis, :] #因?yàn)閛bservation在傳入時(shí)是一維的數(shù)值 #下面采取epsilon-greedy的策略 #當(dāng)隨機(jī)抽取數(shù)<0.9時(shí)，執(zhí)行貪婪策略;當(dāng)隨機(jī)抽取數(shù)>0.9，則執(zhí)行隨機(jī)策略if np.random.uniform() < self.epsilon:# forward feed the observation and get q value for every actionsactions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})action = np.argmax(actions_value)else:action = np.random.randint(0, self.n_actions)return

以上兩個(gè)函數(shù)是對(duì)經(jīng)驗(yàn)池的填充和動(dòng)作的選擇。接下來(lái)就是對(duì)經(jīng)驗(yàn)池中的經(jīng)驗(yàn)進(jìn)行回放以及更新Q-target網(wǎng)絡(luò)參數(shù)。

def learn(self):# check to replace target parametersif self.learn_step_counter % self.replace_target_iter == 0:self.sess.run(self.replace_target_op)print('\ntarget_params_replaced\n') #判斷是否對(duì)Q-target網(wǎng)絡(luò)參數(shù)進(jìn)行更新，更新后打印# sample batch memory from all memoryif self.memory_counter > self.memory_size:sample_index = np.random.choice(self.memory_size, size=self.batch_size)else:sample_index = np.random.choice(self.memory_counter, size=self.batch_size)batch_memory = self.memory[sample_index, :] #對(duì)經(jīng)驗(yàn)池的一個(gè)簡(jiǎn)單判斷，當(dāng)counter計(jì)數(shù)大于經(jīng)驗(yàn)池容量時(shí)，說(shuō)明經(jīng)驗(yàn)池已滿(mǎn)，因此我們直接在經(jīng)驗(yàn)池中取樣即可。若相反，說(shuō)明此時(shí)經(jīng)驗(yàn)池還未滿(mǎn)，我們則只能對(duì)已存儲(chǔ)了的數(shù)據(jù)進(jìn)行取樣q_next, q_eval = self.sess.run([self.q_next, self.q_eval],feed_dict={self.s_: batch_memory[:, -self.n_features:], # fixed paramsself.s: batch_memory[:, :self.n_features], # newest params}) #把取樣傳入神經(jīng)網(wǎng)絡(luò)中進(jìn)行回放# change q_target w.r.t q_eval's actionq_target = q_eval.copy()batch_index = np.arange(self.batch_size, dtype=np.int32)eval_act_index = batch_memory[:, self.n_features].astype(int)reward = batch_memory[:, self.n_features + 1] # 這個(gè)相當(dāng)于將q_target按[batch_index, eval_act_index]索引計(jì)算出相應(yīng)位置的q—_target值q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1) #以下是莫煩本人的解釋： """假如在這個(gè) batch 中, 我們有2個(gè)提取的記憶, 根據(jù)每個(gè)記憶可以生產(chǎn)3個(gè) action 的值:q_eval =[[1, 2, 3],[4, 5, 6]]q_target = q_eval =[[1, 2, 3],[4, 5, 6]]然后根據(jù) memory 當(dāng)中的具體 action 位置來(lái)修改 q_target 對(duì)應(yīng) action 上的值:q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)比如在:記憶 0 的 q_target 計(jì)算值是 -1, 而且我用了 action 0;記憶 1 的 q_target 計(jì)算值是 -2, 而且我用了 action 2:q_target =[[-1, 2, 3],[4, 5, -2]]所以 (q_target - q_eval) 就變成了:[[(-1)-(1), 0, 0],[0, 0, (-2)-(6)]]"""

在上圖代碼中的：change q_target w.r.t q_eval's action這一部分，實(shí)際上也是DQN網(wǎng)絡(luò)中比較難理解的部分。在DQN中，我們擁有兩個(gè)神經(jīng)網(wǎng)絡(luò)，假設(shè)我們?cè)谟洃浿性赒-predict網(wǎng)絡(luò)中選擇了action1，其對(duì)應(yīng)的Q值=3。根據(jù)DQN的Q值更新公式，在Q-target網(wǎng)絡(luò)中我們是根據(jù)貪婪法則選擇當(dāng)前狀態(tài)對(duì)應(yīng)的Q值最大的action。這樣就會(huì)出現(xiàn)一種情況，當(dāng)在Q-predict中選擇了action1，有可能對(duì)應(yīng)的Q-target中，選擇的Q值最大的action是action0。因此就出現(xiàn)了動(dòng)作位置不對(duì)應(yīng)的情況，這種情況就會(huì)出現(xiàn)兩個(gè)選擇的action，無(wú)法通過(guò)計(jì)算誤差反向傳播更新參數(shù)。因此這部分代碼就是為了解決這種情況，無(wú)論在Q-target網(wǎng)絡(luò)中Q值最大對(duì)應(yīng)的action是什么，我們都將在Q-predict網(wǎng)絡(luò)中選擇的action對(duì)應(yīng)到Q-target網(wǎng)絡(luò)中。舉個(gè)例子：

Q-predict=[0,3,0]表示這一個(gè)記憶中選用了action1，action1的Q=3，其他的Q均為0；

Q-target=[1,0,0]表示在這個(gè)記憶中的Q=reward+gamma*maxQ(s_)=1,但是在s_上我們選取了action0,此時(shí)兩個(gè)action無(wú)法對(duì)應(yīng)上，我們應(yīng)該把Q-target的樣本修改成：[0,1,0]，和Q-predict對(duì)應(yīng)起來(lái)。

_, self.cost = self.sess.run([self._train_op, self.loss],feed_dict={self.s: batch_memory[:, :self.n_features], self.q_target: q_target})self.cost_his.append(self.cost) # 反向訓(xùn)練# increasing epsilon提高選擇正確的概率，直到self.epsilon_maxself.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_maxself.learn_step_counter += 1def plot_cost(self): # 展示學(xué)習(xí)曲線(xiàn)import matplotlib.pyplot as pltplt.plot(np.arange(len(self.cost_his)), self.cost_his) # arange函數(shù)用于創(chuàng)建等差數(shù)組，arange返回的是一個(gè)array類(lèi)型的數(shù)據(jù)plt.ylabel('Cost')plt.xlabel('training steps')plt.show()

以下是代碼運(yùn)行后的游戲過(guò)程圖和學(xué)習(xí)曲線(xiàn)：

?tensorboard輸出結(jié)果：

參考資料：

https://www.cnblogs.com/pinard/p/9756075.htmlhttps://blog.csdn.net/november_chopin/article/details/107912720

https://zhuanlan.zhihu.com/p/46852675

https://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

https://ojs.aaai.org/index.php/AAAI/article/view/10295

以上是本教程全部?jī)?nèi)容，有問(wèn)題歡迎大家在評(píng)論區(qū)里交流！

總結(jié)

以上是生活随笔為你收集整理的【强化学习】 Nature DQN算法与莫烦代码重现（tensorflow)的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：海康摄像机通过Ehome协议接入Easy
下一篇： Form界面设置只读