日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

TensorFlow 2.0深度强化学习指南

發(fā)布時(shí)間:2024/8/23 编程问答 35 豆豆
生活随笔 收集整理的這篇文章主要介紹了 TensorFlow 2.0深度强化学习指南 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

在本教程中,我將通過實(shí)施Advantage Actor-Critic(演員-評(píng)論家,A2C)代理來解決經(jīng)典的CartPole-v0環(huán)境,通過深度強(qiáng)化學(xué)習(xí)(DRL)展示即將推出的TensorFlow2.0特性。雖然我們的目標(biāo)是展示TensorFlow2.0,但我將盡最大努力讓DRL的講解更加平易近人,包括對(duì)該領(lǐng)域的簡(jiǎn)要概述。

事實(shí)上,由于2.0版本的焦點(diǎn)是讓開發(fā)人員的生活變得更輕松,所以我認(rèn)為現(xiàn)在是使用TensorFlow進(jìn)入DRL的好時(shí)機(jī),本文用到的例子的源代碼不到150行!代碼可以在這里或者這里獲取。

建立

由于TensorFlow2.0仍處于試驗(yàn)階段,我建議將其安裝在獨(dú)立的虛擬環(huán)境中。我個(gè)人比較喜歡Anaconda,所以我將用它來演示安裝過程:

<span style="color:#f8f8f2"><code class="language-python"><span style="color:#f8f8f2">></span> conda create <span style="color:#f8f8f2">-</span>n tf2 python<span style="color:#f8f8f2">=</span><span style="color:#ae81ff"><span style="color:#ae81ff">3.6</span></span> <span style="color:#f8f8f2">></span> source activate tf2 <span style="color:#f8f8f2">></span> pip install tf<span style="color:#f8f8f2">-</span>nightly<span style="color:#ae81ff"><span style="color:#ae81ff">-2.0</span></span><span style="color:#f8f8f2">-</span>preview <span style="color:slategray"><span style="color:#75715e"># tf-nightly-gpu-2.0-preview for GPU version</span></span></code></span>

讓我們快速驗(yàn)證一切是否按能夠正常工作:

<span style="color:#f8f8f2"><code class="language-python"><span style="color:#f8f8f2"><span style="color:#75715e">>></span></span><span style="color:#f8f8f2"><span style="color:#75715e">></span></span> <span style="color:#66d9ef"><span style="color:#f92672">import</span></span> tensorflow <span style="color:#66d9ef"><span style="color:#f92672">as</span></span> tf <span style="color:#f8f8f2"><span style="color:#75715e">>></span></span><span style="color:#f8f8f2"><span style="color:#75715e">></span></span> <span style="color:#66d9ef">print</span><span style="color:#f8f8f2">(</span>tf<span style="color:#f8f8f2">.</span>__version__<span style="color:#f8f8f2">)</span> <span style="color:#ae81ff"><span style="color:#ae81ff">1.13</span></span><span style="color:#f8f8f2"><span style="color:#ae81ff">.</span></span><span style="color:#ae81ff"><span style="color:#ae81ff">0</span></span><span style="color:#f8f8f2">-</span>dev20190117 <span style="color:#f8f8f2"><span style="color:#75715e">>></span></span><span style="color:#f8f8f2"><span style="color:#75715e">></span></span> <span style="color:#66d9ef">print</span><span style="color:#f8f8f2">(</span>tf<span style="color:#f8f8f2">.</span>executing_eagerly<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">)</span> <span style="color:#ae81ff"><span style="color:#f92672">True</span></span></code></span>

不要擔(dān)心1.13.x版本,這只是意味著它是早期預(yù)覽。這里要注意的是我們默認(rèn)處于eager模式!

<span style="color:#f8f8f2"><code class="language-none">>>> print(tf.reduce_sum([1, 2, 3, 4, 5])) tf.Tensor(15, shape=(), dtype=int32)</code></span>

如果你還不熟悉eager模式,那么實(shí)質(zhì)上意味著計(jì)算是在運(yùn)行時(shí)被執(zhí)行的,而不是通過預(yù)編譯的圖(曲線圖)來執(zhí)行。你可以在TensorFlow文檔中找到一個(gè)很好的概述。

深度強(qiáng)化學(xué)習(xí)

一般而言,強(qiáng)化學(xué)習(xí)是解決連續(xù)決策問題的高級(jí)框架。RL通過基于某些agent進(jìn)行導(dǎo)航觀察環(huán)境,并且獲得獎(jiǎng)勵(lì)。大多數(shù)RL算法通過最大化代理在一輪游戲期間收集的獎(jiǎng)勵(lì)總和來工作。

基于RL的算法的輸出通常是policy(策略)-將狀態(tài)映射到函數(shù)有效的策略中,有效的策略可以像硬編碼的無操作動(dòng)作一樣簡(jiǎn)單。在某些狀態(tài)下,隨機(jī)策略表示為行動(dòng)的條件概率分布。

評(píng)論家方(Actor-Critic Methods)

RL算法通常基于它們優(yōu)化的目標(biāo)函數(shù)進(jìn)行分組。Value-based諸如DQN之類的方法通過減少預(yù)期的狀態(tài)-動(dòng)作值的誤差來工作。

策略梯度(Policy Gradients)方法通過調(diào)整其參數(shù)直接優(yōu)化策略本身,通常通過梯度下降完成的。完全計(jì)算梯度通常是難以處理的,因此通常要通過蒙特卡羅方法估算它們。

最流行的方法是兩者的混合:actor-critic方法,其中代理策略通過策略梯度進(jìn)行優(yōu)化,而基于值的方法用作預(yù)期值估計(jì)的引導(dǎo)。

深度演-評(píng)方法

雖然很多基礎(chǔ)的RL理論是在表格案例中開發(fā)的,但現(xiàn)代RL幾乎完全是用函數(shù)逼近器完成的,例如人工神經(jīng)網(wǎng)絡(luò)。具體而言,如果策略和值函數(shù)用深度神經(jīng)網(wǎng)絡(luò)近似,則RL算法被認(rèn)為是“深度”。

異步優(yōu)勢(shì)-評(píng)論家(actor-critical)

多年來,為了提高學(xué)習(xí)過程的樣本效率和穩(wěn)定性,技術(shù)發(fā)明者已經(jīng)進(jìn)行了一些改進(jìn)。

首先,梯度加權(quán)回報(bào):折現(xiàn)的未來獎(jiǎng)勵(lì),這在一定程度上緩解了信用分配問題,并以無限的時(shí)間步長(zhǎng)解決了理論問題。

其次,使用優(yōu)勢(shì)函數(shù)代替原始回報(bào)。優(yōu)勢(shì)在收益與某些基線之間的差異之間形成,并且可以被視為衡量給定值與某些平均值相比有多好的指標(biāo)。

第三,在目標(biāo)函數(shù)中使用額外的熵最大化項(xiàng)以確保代理充分探索各種策略。本質(zhì)上,熵以均勻分布最大化來測(cè)量概率分布的隨機(jī)性。

最后,并行使用多個(gè)工人加速樣品采集,同時(shí)在訓(xùn)練期間幫助它們?nèi)ハ嚓P(guān)。

將所有這些變化與深度神經(jīng)網(wǎng)絡(luò)相結(jié)合,我們得出了兩種最流行的現(xiàn)代算法:異步優(yōu)勢(shì)演員評(píng)論家(actor-critical)算法,簡(jiǎn)稱A3C或者A2C。兩者之間的區(qū)別在于技術(shù)性而非理論性:顧名思義,它歸結(jié)為并行工人如何估計(jì)其梯度并將其傳播到模型中。

有了這個(gè),我將結(jié)束我們的DRL方法之旅,因?yàn)椴┛臀恼碌闹攸c(diǎn)更多是關(guān)于TensorFlow2.0的功能。如果你仍然不了解該主題,請(qǐng)不要擔(dān)心,代碼示例應(yīng)該更清楚。如果你想了解更多,那么一個(gè)好的資源就可以開始Deep RL中進(jìn)行Spinning Up

使用TensorFlow 2.0優(yōu)勢(shì)-評(píng)論

讓我們看看實(shí)現(xiàn)現(xiàn)代DRL算法的基礎(chǔ)是什么:演員評(píng)論家代理(actor-critic agent。如前一節(jié)所述,為簡(jiǎn)單起見,我們不會(huì)實(shí)現(xiàn)并行工作程序,盡管大多數(shù)代碼都會(huì)支持它,感興趣的讀者可以將其用作鍛煉機(jī)會(huì)。

作為測(cè)試平臺(tái),我們將使用CartPole-v0環(huán)境。雖然它有點(diǎn)簡(jiǎn)單,但它仍然是一個(gè)很好的選擇開始。在實(shí)現(xiàn)RL算法時(shí),我總是依賴它作為一種健全性檢查。

Keras Model API實(shí)現(xiàn)的策略和價(jià)值

首先,讓我們?cè)趩蝹€(gè)模型類下創(chuàng)建策略和價(jià)值估計(jì)NN:

<span style="color:#f8f8f2"><code class="language-python"><span style="color:#66d9ef"><span style="color:#f92672">import</span></span> numpy <span style="color:#66d9ef"><span style="color:#f92672">as</span></span> np <span style="color:#66d9ef"><span style="color:#f92672">import</span></span> tensorflow <span style="color:#66d9ef"><span style="color:#f92672">as</span></span> tf <span style="color:#66d9ef"><span style="color:#f92672">import</span></span> tensorflow<span style="color:#f8f8f2">.</span>keras<span style="color:#f8f8f2">.</span>layers <span style="color:#66d9ef"><span style="color:#f92672">as</span></span> kl<span style="color:#66d9ef"><span style="color:#f92672">class</span></span> <span style="color:#f8f8f2">ProbabilityDistribution</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">tf</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">.</span></span><span style="color:#f8f8f2">keras</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">.</span></span><span style="color:#f8f8f2">Model</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span><span style="color:#66d9ef"><span style="color:#f92672">def</span></span> <span style="color:#e6db74"><span style="color:#a6e22e">call</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">self</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> logits</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span><span style="color:slategray"><span style="color:#75715e"># sample a random categorical action from given logits</span></span><span style="color:#66d9ef"><span style="color:#f92672">return</span></span> tf<span style="color:#f8f8f2">.</span>squeeze<span style="color:#f8f8f2">(</span>tf<span style="color:#f8f8f2">.</span>random<span style="color:#f8f8f2">.</span>categorical<span style="color:#f8f8f2">(</span>logits<span style="color:#f8f8f2">,</span> <span style="color:#ae81ff"><span style="color:#ae81ff">1</span></span><span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">,</span> axis<span style="color:#f8f8f2">=</span><span style="color:#f8f8f2"><span style="color:#ae81ff">-</span></span><span style="color:#ae81ff"><span style="color:#ae81ff">1</span></span><span style="color:#f8f8f2">)</span><span style="color:#66d9ef"><span style="color:#f92672">class</span></span> <span style="color:#f8f8f2">Model</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">tf</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">.</span></span><span style="color:#f8f8f2">keras</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">.</span></span><span style="color:#f8f8f2">Model</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span><span style="color:#66d9ef"><span style="color:#f92672">def</span></span> <span style="color:#e6db74"><span style="color:#a6e22e">__init__</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">self</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> num_actions</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span>super<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">.</span>__init__<span style="color:#f8f8f2">(</span><span style="color:#a6e22e"><span style="color:#e6db74">'mlp_policy'</span></span><span style="color:#f8f8f2">)</span><span style="color:slategray"><span style="color:#75715e"># no tf.get_variable(), just simple Keras API</span></span>self<span style="color:#f8f8f2">.</span>hidden1 <span style="color:#f8f8f2">=</span> kl<span style="color:#f8f8f2">.</span>Dense<span style="color:#f8f8f2">(</span><span style="color:#ae81ff"><span style="color:#ae81ff">128</span></span><span style="color:#f8f8f2">,</span> activation<span style="color:#f8f8f2">=</span><span style="color:#a6e22e"><span style="color:#e6db74">'relu'</span></span><span style="color:#f8f8f2">)</span>self<span style="color:#f8f8f2">.</span>hidden2 <span style="color:#f8f8f2">=</span> kl<span style="color:#f8f8f2">.</span>Dense<span style="color:#f8f8f2">(</span><span style="color:#ae81ff"><span style="color:#ae81ff">128</span></span><span style="color:#f8f8f2">,</span> activation<span style="color:#f8f8f2">=</span><span style="color:#a6e22e"><span style="color:#e6db74">'relu'</span></span><span style="color:#f8f8f2">)</span>self<span style="color:#f8f8f2">.</span>value <span style="color:#f8f8f2">=</span> kl<span style="color:#f8f8f2">.</span>Dense<span style="color:#f8f8f2">(</span><span style="color:#ae81ff"><span style="color:#ae81ff">1</span></span><span style="color:#f8f8f2">,</span> name<span style="color:#f8f8f2">=</span><span style="color:#a6e22e"><span style="color:#e6db74">'value'</span></span><span style="color:#f8f8f2">)</span><span style="color:slategray"><span style="color:#75715e"># logits are unnormalized log probabilities</span></span>self<span style="color:#f8f8f2">.</span>logits <span style="color:#f8f8f2">=</span> kl<span style="color:#f8f8f2">.</span>Dense<span style="color:#f8f8f2">(</span>num_actions<span style="color:#f8f8f2">,</span> name<span style="color:#f8f8f2">=</span><span style="color:#a6e22e"><span style="color:#e6db74">'policy_logits'</span></span><span style="color:#f8f8f2">)</span>self<span style="color:#f8f8f2">.</span>dist <span style="color:#f8f8f2">=</span> ProbabilityDistribution<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">)</span><span style="color:#66d9ef"><span style="color:#f92672">def</span></span> <span style="color:#e6db74"><span style="color:#a6e22e">call</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">self</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> inputs</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span><span style="color:slategray"><span style="color:#75715e"># inputs is a numpy array, convert to Tensor</span></span>x <span style="color:#f8f8f2">=</span> tf<span style="color:#f8f8f2">.</span>convert_to_tensor<span style="color:#f8f8f2">(</span>inputs<span style="color:#f8f8f2">,</span> dtype<span style="color:#f8f8f2">=</span>tf<span style="color:#f8f8f2">.</span>float32<span style="color:#f8f8f2">)</span><span style="color:slategray"><span style="color:#75715e"># separate hidden layers from the same input tensor</span></span>hidden_logs <span style="color:#f8f8f2">=</span> self<span style="color:#f8f8f2">.</span>hidden1<span style="color:#f8f8f2">(</span>x<span style="color:#f8f8f2">)</span>hidden_vals <span style="color:#f8f8f2">=</span> self<span style="color:#f8f8f2">.</span>hidden2<span style="color:#f8f8f2">(</span>x<span style="color:#f8f8f2">)</span><span style="color:#66d9ef"><span style="color:#f92672">return</span></span> self<span style="color:#f8f8f2">.</span>logits<span style="color:#f8f8f2">(</span>hidden_logs<span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">,</span> self<span style="color:#f8f8f2">.</span>value<span style="color:#f8f8f2">(</span>hidden_vals<span style="color:#f8f8f2">)</span><span style="color:#66d9ef"><span style="color:#f92672">def</span></span> <span style="color:#e6db74"><span style="color:#a6e22e">action_value</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">self</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> obs</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span><span style="color:slategray"><span style="color:#75715e"># executes call() under the hood</span></span>logits<span style="color:#f8f8f2">,</span> value <span style="color:#f8f8f2">=</span> self<span style="color:#f8f8f2">.</span>predict<span style="color:#f8f8f2">(</span>obs<span style="color:#f8f8f2">)</span>action <span style="color:#f8f8f2">=</span> self<span style="color:#f8f8f2">.</span>dist<span style="color:#f8f8f2">.</span>predict<span style="color:#f8f8f2">(</span>logits<span style="color:#f8f8f2">)</span><span style="color:slategray"><span style="color:#75715e"># a simpler option, will become clear later why we don't use it</span></span><span style="color:slategray"><span style="color:#75715e"># action = tf.random.categorical(logits, 1)</span></span><span style="color:#66d9ef"><span style="color:#f92672">return</span></span> np<span style="color:#f8f8f2">.</span>squeeze<span style="color:#f8f8f2">(</span>action<span style="color:#f8f8f2">,</span> axis<span style="color:#f8f8f2">=</span><span style="color:#f8f8f2"><span style="color:#ae81ff">-</span></span><span style="color:#ae81ff"><span style="color:#ae81ff">1</span></span><span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">,</span> np<span style="color:#f8f8f2">.</span>squeeze<span style="color:#f8f8f2">(</span>value<span style="color:#f8f8f2">,</span> axis<span style="color:#f8f8f2">=</span><span style="color:#f8f8f2"><span style="color:#ae81ff">-</span></span><span style="color:#ae81ff"><span style="color:#ae81ff">1</span></span><span style="color:#f8f8f2">)</span></code></span>

驗(yàn)證我們驗(yàn)證模型是否按預(yù)期工作:

<span style="color:#f8f8f2"><code class="language-python"><span style="color:#66d9ef"><span style="color:#f92672">import</span></span> gym env <span style="color:#f8f8f2">=</span> gym<span style="color:#f8f8f2">.</span>make<span style="color:#f8f8f2">(</span><span style="color:#a6e22e"><span style="color:#e6db74">'CartPole-v0'</span></span><span style="color:#f8f8f2">)</span> model <span style="color:#f8f8f2">=</span> Model<span style="color:#f8f8f2">(</span>num_actions<span style="color:#f8f8f2">=</span>env<span style="color:#f8f8f2">.</span>action_space<span style="color:#f8f8f2">.</span>n<span style="color:#f8f8f2">)</span> obs <span style="color:#f8f8f2">=</span> env<span style="color:#f8f8f2">.</span>reset<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">)</span> <span style="color:slategray"><span style="color:#75715e"># no feed_dict or tf.Session() needed at all</span></span> action<span style="color:#f8f8f2">,</span> value <span style="color:#f8f8f2">=</span> model<span style="color:#f8f8f2">.</span>action_value<span style="color:#f8f8f2">(</span>obs<span style="color:#f8f8f2">[</span><span style="color:#f92672">None</span><span style="color:#f8f8f2">,</span> <span style="color:#f8f8f2">:</span><span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">)</span> <span style="color:#66d9ef">print</span><span style="color:#f8f8f2">(</span>action<span style="color:#f8f8f2">,</span> value<span style="color:#f8f8f2">)</span> <span style="color:slategray"><span style="color:#75715e"># [1] [-0.00145713]</span></span></code></span>

這里要注意的事項(xiàng):

  • 模型層和執(zhí)行路徑是分開定義的;
  • 沒有“輸入”圖層,模型將接受原始numpy數(shù)組;
  • 可以通過函數(shù)API在一個(gè)模型中定義兩個(gè)計(jì)算路徑;
  • 模型可以包含一些輔助方法,例如動(dòng)作采樣;
  • 在eager的模式下,一切都可以從原始的numpy數(shù)組中運(yùn)行;

隨機(jī)代理

現(xiàn)在我們可以繼續(xù)學(xué)習(xí)一些有趣的東西A2CAgent類。首先,讓我們添加一個(gè)貫穿整集的test方法并返回獎(jiǎng)勵(lì)總和。

<span style="color:#f8f8f2"><code class="language-python"><span style="color:#66d9ef"><span style="color:#f92672">class</span></span> <span style="color:#f8f8f2">A2CAgent</span><span style="color:#f8f8f2">:</span><span style="color:#66d9ef"><span style="color:#f92672">def</span></span> <span style="color:#e6db74"><span style="color:#a6e22e">__init__</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">self</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> model</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span>self<span style="color:#f8f8f2">.</span>model <span style="color:#f8f8f2">=</span> model<span style="color:#66d9ef"><span style="color:#f92672">def</span></span> <span style="color:#e6db74"><span style="color:#a6e22e">test</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">self</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> env</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> render</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">=</span></span><span style="color:#ae81ff"><span style="color:#f8f8f2">True</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span>obs<span style="color:#f8f8f2">,</span> done<span style="color:#f8f8f2">,</span> ep_reward <span style="color:#f8f8f2">=</span> env<span style="color:#f8f8f2">.</span>reset<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">,</span> <span style="color:#ae81ff"><span style="color:#f92672">False</span></span><span style="color:#f8f8f2">,</span> <span style="color:#ae81ff"><span style="color:#ae81ff">0</span></span><span style="color:#66d9ef"><span style="color:#f92672">while</span></span> <span style="color:#f8f8f2"><span style="color:#f92672">not</span></span> done<span style="color:#f8f8f2">:</span>action<span style="color:#f8f8f2">,</span> _ <span style="color:#f8f8f2">=</span> self<span style="color:#f8f8f2">.</span>model<span style="color:#f8f8f2">.</span>action_value<span style="color:#f8f8f2">(</span>obs<span style="color:#f8f8f2">[</span><span style="color:#f92672">None</span><span style="color:#f8f8f2">,</span> <span style="color:#f8f8f2">:</span><span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">)</span>obs<span style="color:#f8f8f2">,</span> reward<span style="color:#f8f8f2">,</span> done<span style="color:#f8f8f2">,</span> _ <span style="color:#f8f8f2">=</span> env<span style="color:#f8f8f2">.</span>step<span style="color:#f8f8f2">(</span>action<span style="color:#f8f8f2">)</span>ep_reward <span style="color:#f8f8f2">+=</span> reward<span style="color:#66d9ef"><span style="color:#f92672">if</span></span> render<span style="color:#f8f8f2">:</span>env<span style="color:#f8f8f2">.</span>render<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">)</span><span style="color:#66d9ef"><span style="color:#f92672">return</span></span> ep_reward</code></span>

讓我們看看我們的模型在隨機(jī)初始化權(quán)重下得分多少:

<span style="color:#f8f8f2"><code class="language-python">agent <span style="color:#f8f8f2">=</span> A2CAgent<span style="color:#f8f8f2">(</span>model<span style="color:#f8f8f2">)</span> rewards_sum <span style="color:#f8f8f2">=</span> agent<span style="color:#f8f8f2">.</span>test<span style="color:#f8f8f2">(</span>env<span style="color:#f8f8f2">)</span> <span style="color:#66d9ef">print</span><span style="color:#f8f8f2">(</span><span style="color:#a6e22e"><span style="color:#e6db74">"%d out of 200"</span></span> <span style="color:#f8f8f2">%</span> rewards_sum<span style="color:#f8f8f2">)</span> <span style="color:slategray"><span style="color:#75715e"># 18 out of 200</span></span></code></span>

離最佳轉(zhuǎn)臺(tái)還有很遠(yuǎn),接下來是訓(xùn)練部分!

損失/標(biāo)函數(shù)

正如我在DRL概述部分所描述的那樣,代理通過基于某些損失(目標(biāo))函數(shù)的梯度下降來改進(jìn)其策略。在演員評(píng)論家中,我們訓(xùn)練了三個(gè)目標(biāo):用優(yōu)勢(shì)加權(quán)梯度加上熵最大化來改進(jìn)策略,并最小化價(jià)值估計(jì)誤差。

<span style="color:#f8f8f2"><code class="language-python"><span style="color:#66d9ef"><span style="color:#f92672">import</span></span> tensorflow<span style="color:#f8f8f2">.</span>keras<span style="color:#f8f8f2">.</span>losses <span style="color:#66d9ef"><span style="color:#f92672">as</span></span> kls <span style="color:#66d9ef"><span style="color:#f92672">import</span></span> tensorflow<span style="color:#f8f8f2">.</span>keras<span style="color:#f8f8f2">.</span>optimizers <span style="color:#66d9ef"><span style="color:#f92672">as</span></span> ko <span style="color:#66d9ef"><span style="color:#f92672">class</span></span> <span style="color:#f8f8f2">A2CAgent</span><span style="color:#f8f8f2">:</span><span style="color:#66d9ef"><span style="color:#f92672">def</span></span> <span style="color:#e6db74"><span style="color:#a6e22e">__init__</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">self</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> model</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span><span style="color:slategray"><span style="color:#75715e"># hyperparameters for loss terms</span></span>self<span style="color:#f8f8f2">.</span>params <span style="color:#f8f8f2">=</span> <span style="color:#f8f8f2">{</span><span style="color:#a6e22e"><span style="color:#e6db74">'value'</span></span><span style="color:#f8f8f2">:</span> <span style="color:#ae81ff"><span style="color:#ae81ff">0.5</span></span><span style="color:#f8f8f2">,</span> <span style="color:#a6e22e"><span style="color:#e6db74">'entropy'</span></span><span style="color:#f8f8f2">:</span> <span style="color:#ae81ff"><span style="color:#ae81ff">0.0001</span></span><span style="color:#f8f8f2">}</span>self<span style="color:#f8f8f2">.</span>model <span style="color:#f8f8f2">=</span> modelself<span style="color:#f8f8f2">.</span>model<span style="color:#f8f8f2">.</span>compile<span style="color:#f8f8f2">(</span>optimizer<span style="color:#f8f8f2">=</span>ko<span style="color:#f8f8f2">.</span>RMSprop<span style="color:#f8f8f2">(</span>lr<span style="color:#f8f8f2">=</span><span style="color:#ae81ff"><span style="color:#ae81ff">0.0007</span></span><span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">,</span><span style="color:slategray"><span style="color:#75715e"># define separate losses for policy logits and value estimate</span></span>loss<span style="color:#f8f8f2">=</span><span style="color:#f8f8f2">[</span>self<span style="color:#f8f8f2">.</span>_logits_loss<span style="color:#f8f8f2">,</span> self<span style="color:#f8f8f2">.</span>_value_loss<span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">)</span><span style="color:#66d9ef"><span style="color:#f92672">def</span></span> <span style="color:#e6db74"><span style="color:#a6e22e">test</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">self</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> env</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> render</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">=</span></span><span style="color:#ae81ff"><span style="color:#f8f8f2">True</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span><span style="color:slategray"><span style="color:#75715e"># unchanged from previous section</span></span><span style="color:#f8f8f2">.</span><span style="color:#f8f8f2">.</span><span style="color:#f8f8f2">.</span><span style="color:#66d9ef"><span style="color:#f92672">def</span></span> <span style="color:#e6db74"><span style="color:#a6e22e">_value_loss</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">self</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> returns</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> value</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span><span style="color:slategray"><span style="color:#75715e"># value loss is typically MSE between value estimates and returns</span></span><span style="color:#66d9ef"><span style="color:#f92672">return</span></span> self<span style="color:#f8f8f2">.</span>params<span style="color:#f8f8f2">[</span><span style="color:#a6e22e"><span style="color:#e6db74">'value'</span></span><span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">*</span>kls<span style="color:#f8f8f2">.</span>mean_squared_error<span style="color:#f8f8f2">(</span>returns<span style="color:#f8f8f2">,</span> value<span style="color:#f8f8f2">)</span><span style="color:#66d9ef"><span style="color:#f92672">def</span></span> <span style="color:#e6db74"><span style="color:#a6e22e">_logits_loss</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">self</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> acts_and_advs</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> logits</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span><span style="color:slategray"><span style="color:#75715e"># a trick to input actions and advantages through same API</span></span>actions<span style="color:#f8f8f2">,</span> advantages <span style="color:#f8f8f2">=</span> tf<span style="color:#f8f8f2">.</span>split<span style="color:#f8f8f2">(</span>acts_and_advs<span style="color:#f8f8f2">,</span> <span style="color:#ae81ff"><span style="color:#ae81ff">2</span></span><span style="color:#f8f8f2">,</span> axis<span style="color:#f8f8f2">=</span><span style="color:#f8f8f2"><span style="color:#ae81ff">-</span></span><span style="color:#ae81ff"><span style="color:#ae81ff">1</span></span><span style="color:#f8f8f2">)</span><span style="color:slategray"><span style="color:#75715e"># polymorphic CE loss function that supports sparse and weighted options</span></span><span style="color:slategray"><span style="color:#75715e"># from_logits argument ensures transformation into normalized probabilities</span></span>cross_entropy <span style="color:#f8f8f2">=</span> kls<span style="color:#f8f8f2">.</span>CategoricalCrossentropy<span style="color:#f8f8f2">(</span>from_logits<span style="color:#f8f8f2">=</span><span style="color:#ae81ff"><span style="color:#f92672">True</span></span><span style="color:#f8f8f2">)</span><span style="color:slategray"><span style="color:#75715e"># policy loss is defined by policy gradients, weighted by advantages</span></span><span style="color:slategray"><span style="color:#75715e"># note: we only calculate the loss on the actions we've actually taken</span></span><span style="color:slategray"><span style="color:#75715e"># thus under the hood a sparse version of CE loss will be executed</span></span>actions <span style="color:#f8f8f2">=</span> tf<span style="color:#f8f8f2">.</span>cast<span style="color:#f8f8f2">(</span>actions<span style="color:#f8f8f2">,</span> tf<span style="color:#f8f8f2">.</span>int32<span style="color:#f8f8f2">)</span>policy_loss <span style="color:#f8f8f2">=</span> cross_entropy<span style="color:#f8f8f2">(</span>actions<span style="color:#f8f8f2">,</span> logits<span style="color:#f8f8f2">,</span> sample_weight<span style="color:#f8f8f2">=</span>advantages<span style="color:#f8f8f2">)</span><span style="color:slategray"><span style="color:#75715e"># entropy loss can be calculated via CE over itself</span></span>entropy_loss <span style="color:#f8f8f2">=</span> cross_entropy<span style="color:#f8f8f2">(</span>logits<span style="color:#f8f8f2">,</span> logits<span style="color:#f8f8f2">)</span><span style="color:slategray"><span style="color:#75715e"># here signs are flipped because optimizer minimizes</span></span><span style="color:#66d9ef"><span style="color:#f92672">return</span></span> policy_loss <span style="color:#f8f8f2">-</span> self<span style="color:#f8f8f2">.</span>params<span style="color:#f8f8f2">[</span><span style="color:#a6e22e"><span style="color:#e6db74">'entropy'</span></span><span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">*</span>entropy_loss</code></span>

我們完成了目標(biāo)函數(shù)!請(qǐng)注意代碼的緊湊程度:注釋行幾乎比代碼本身多。

代理訓(xùn)練環(huán)

最后,還有訓(xùn)練回路本身,它相對(duì)較長(zhǎng),但相當(dāng)簡(jiǎn)單:收集樣本,計(jì)算回報(bào)和優(yōu)勢(shì),并在其上訓(xùn)練模型。

<span style="color:#f8f8f2"><code class="language-python"><span style="color:#66d9ef"><span style="color:#f92672">class</span></span> <span style="color:#f8f8f2">A2CAgent</span><span style="color:#f8f8f2">:</span><span style="color:#66d9ef"><span style="color:#f92672">def</span></span> <span style="color:#e6db74"><span style="color:#a6e22e">__init__</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">self</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> model</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span><span style="color:slategray"><span style="color:#75715e"># hyperparameters for loss terms</span></span>self<span style="color:#f8f8f2">.</span>params <span style="color:#f8f8f2">=</span> <span style="color:#f8f8f2">{</span><span style="color:#a6e22e"><span style="color:#e6db74">'value'</span></span><span style="color:#f8f8f2">:</span> <span style="color:#ae81ff"><span style="color:#ae81ff">0.5</span></span><span style="color:#f8f8f2">,</span> <span style="color:#a6e22e"><span style="color:#e6db74">'entropy'</span></span><span style="color:#f8f8f2">:</span> <span style="color:#ae81ff"><span style="color:#ae81ff">0.0001</span></span><span style="color:#f8f8f2">,</span> <span style="color:#a6e22e"><span style="color:#e6db74">'gamma'</span></span><span style="color:#f8f8f2">:</span> <span style="color:#ae81ff"><span style="color:#ae81ff">0.99</span></span><span style="color:#f8f8f2">}</span><span style="color:slategray"><span style="color:#75715e"># unchanged from previous section</span></span><span style="color:#f8f8f2">.</span><span style="color:#f8f8f2">.</span><span style="color:#f8f8f2">.</span><span style="color:#66d9ef"><span style="color:#f92672">def</span></span> <span style="color:#e6db74"><span style="color:#a6e22e">train</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">self</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> env</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> batch_sz</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">=</span></span><span style="color:#ae81ff"><span style="color:#f8f8f2"><span style="color:#ae81ff">32</span></span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> updates</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">=</span></span><span style="color:#ae81ff"><span style="color:#f8f8f2"><span style="color:#ae81ff">1000</span></span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span><span style="color:slategray"><span style="color:#75715e"># storage helpers for a single batch of data</span></span>actions <span style="color:#f8f8f2">=</span> np<span style="color:#f8f8f2">.</span>empty<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">(</span>batch_sz<span style="color:#f8f8f2">,</span><span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">,</span> dtype<span style="color:#f8f8f2">=</span>np<span style="color:#f8f8f2">.</span>int32<span style="color:#f8f8f2">)</span>rewards<span style="color:#f8f8f2">,</span> dones<span style="color:#f8f8f2">,</span> values <span style="color:#f8f8f2">=</span> np<span style="color:#f8f8f2">.</span>empty<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">(</span><span style="color:#ae81ff"><span style="color:#ae81ff">3</span></span><span style="color:#f8f8f2">,</span> batch_sz<span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">)</span>observations <span style="color:#f8f8f2">=</span> np<span style="color:#f8f8f2">.</span>empty<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">(</span>batch_sz<span style="color:#f8f8f2">,</span><span style="color:#f8f8f2">)</span> <span style="color:#f8f8f2">+</span> env<span style="color:#f8f8f2">.</span>observation_space<span style="color:#f8f8f2">.</span>shape<span style="color:#f8f8f2">)</span><span style="color:slategray"><span style="color:#75715e"># training loop: collect samples, send to optimizer, repeat updates times</span></span>ep_rews <span style="color:#f8f8f2">=</span> <span style="color:#f8f8f2">[</span><span style="color:#ae81ff"><span style="color:#ae81ff">0.0</span></span><span style="color:#f8f8f2">]</span>next_obs <span style="color:#f8f8f2">=</span> env<span style="color:#f8f8f2">.</span>reset<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">)</span><span style="color:#66d9ef"><span style="color:#f92672">for</span></span> update <span style="color:#66d9ef"><span style="color:#f92672">in</span></span> range<span style="color:#f8f8f2">(</span>updates<span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">:</span><span style="color:#66d9ef"><span style="color:#f92672">for</span></span> step <span style="color:#66d9ef"><span style="color:#f92672">in</span></span> range<span style="color:#f8f8f2">(</span>batch_sz<span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">:</span>observations<span style="color:#f8f8f2">[</span>step<span style="color:#f8f8f2">]</span> <span style="color:#f8f8f2">=</span> next_obs<span style="color:#f8f8f2">.</span>copy<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">)</span>actions<span style="color:#f8f8f2">[</span>step<span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">,</span> values<span style="color:#f8f8f2">[</span>step<span style="color:#f8f8f2">]</span> <span style="color:#f8f8f2">=</span> self<span style="color:#f8f8f2">.</span>model<span style="color:#f8f8f2">.</span>action_value<span style="color:#f8f8f2">(</span>next_obs<span style="color:#f8f8f2">[</span><span style="color:#f92672">None</span><span style="color:#f8f8f2">,</span> <span style="color:#f8f8f2">:</span><span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">)</span>next_obs<span style="color:#f8f8f2">,</span> rewards<span style="color:#f8f8f2">[</span>step<span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">,</span> dones<span style="color:#f8f8f2">[</span>step<span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">,</span> _ <span style="color:#f8f8f2">=</span> env<span style="color:#f8f8f2">.</span>step<span style="color:#f8f8f2">(</span>actions<span style="color:#f8f8f2">[</span>step<span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">)</span>ep_rews<span style="color:#f8f8f2">[</span><span style="color:#f8f8f2"><span style="color:#ae81ff">-</span></span><span style="color:#ae81ff"><span style="color:#ae81ff">1</span></span><span style="color:#f8f8f2">]</span> <span style="color:#f8f8f2">+=</span> rewards<span style="color:#f8f8f2">[</span>step<span style="color:#f8f8f2">]</span><span style="color:#66d9ef"><span style="color:#f92672">if</span></span> dones<span style="color:#f8f8f2">[</span>step<span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">:</span>ep_rews<span style="color:#f8f8f2">.</span>append<span style="color:#f8f8f2">(</span><span style="color:#ae81ff"><span style="color:#ae81ff">0.0</span></span><span style="color:#f8f8f2">)</span>next_obs <span style="color:#f8f8f2">=</span> env<span style="color:#f8f8f2">.</span>reset<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">)</span>_<span style="color:#f8f8f2">,</span> next_value <span style="color:#f8f8f2">=</span> self<span style="color:#f8f8f2">.</span>model<span style="color:#f8f8f2">.</span>action_value<span style="color:#f8f8f2">(</span>next_obs<span style="color:#f8f8f2">[</span><span style="color:#f92672">None</span><span style="color:#f8f8f2">,</span> <span style="color:#f8f8f2">:</span><span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">)</span>returns<span style="color:#f8f8f2">,</span> advs <span style="color:#f8f8f2">=</span> self<span style="color:#f8f8f2">.</span>_returns_advantages<span style="color:#f8f8f2">(</span>rewards<span style="color:#f8f8f2">,</span> dones<span style="color:#f8f8f2">,</span> values<span style="color:#f8f8f2">,</span> next_value<span style="color:#f8f8f2">)</span><span style="color:slategray"><span style="color:#75715e"># a trick to input actions and advantages through same API</span></span>acts_and_advs <span style="color:#f8f8f2">=</span> np<span style="color:#f8f8f2">.</span>concatenate<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">[</span>actions<span style="color:#f8f8f2">[</span><span style="color:#f8f8f2">:</span><span style="color:#f8f8f2">,</span> <span style="color:#f92672">None</span><span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">,</span> advs<span style="color:#f8f8f2">[</span><span style="color:#f8f8f2">:</span><span style="color:#f8f8f2">,</span> <span style="color:#f92672">None</span><span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">,</span> axis<span style="color:#f8f8f2">=</span><span style="color:#f8f8f2"><span style="color:#ae81ff">-</span></span><span style="color:#ae81ff"><span style="color:#ae81ff">1</span></span><span style="color:#f8f8f2">)</span><span style="color:slategray"><span style="color:#75715e"># performs a full training step on the collected batch</span></span><span style="color:slategray"><span style="color:#75715e"># note: no need to mess around with gradients, Keras API handles it</span></span>losses <span style="color:#f8f8f2">=</span> self<span style="color:#f8f8f2">.</span>model<span style="color:#f8f8f2">.</span>train_on_batch<span style="color:#f8f8f2">(</span>observations<span style="color:#f8f8f2">,</span> <span style="color:#f8f8f2">[</span>acts_and_advs<span style="color:#f8f8f2">,</span> returns<span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">)</span><span style="color:#66d9ef"><span style="color:#f92672">return</span></span> ep_rews<span style="color:#66d9ef"><span style="color:#f92672">def</span></span> <span style="color:#e6db74"><span style="color:#a6e22e">_returns_advantages</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">self</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> rewards</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> dones</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> values</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> next_value</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span><span style="color:slategray"><span style="color:#75715e"># next_value is the bootstrap value estimate of a future state (the critic)</span></span>returns <span style="color:#f8f8f2">=</span> np<span style="color:#f8f8f2">.</span>append<span style="color:#f8f8f2">(</span>np<span style="color:#f8f8f2">.</span>zeros_like<span style="color:#f8f8f2">(</span>rewards<span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">,</span> next_value<span style="color:#f8f8f2">,</span> axis<span style="color:#f8f8f2">=</span><span style="color:#f8f8f2"><span style="color:#ae81ff">-</span></span><span style="color:#ae81ff"><span style="color:#ae81ff">1</span></span><span style="color:#f8f8f2">)</span><span style="color:slategray"><span style="color:#75715e"># returns are calculated as discounted sum of future rewards</span></span><span style="color:#66d9ef"><span style="color:#f92672">for</span></span> t <span style="color:#66d9ef"><span style="color:#f92672">in</span></span> reversed<span style="color:#f8f8f2">(</span>range<span style="color:#f8f8f2">(</span>rewards<span style="color:#f8f8f2">.</span>shape<span style="color:#f8f8f2">[</span><span style="color:#ae81ff"><span style="color:#ae81ff">0</span></span><span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">:</span>returns<span style="color:#f8f8f2">[</span>t<span style="color:#f8f8f2">]</span> <span style="color:#f8f8f2">=</span> rewards<span style="color:#f8f8f2">[</span>t<span style="color:#f8f8f2">]</span> <span style="color:#f8f8f2">+</span> self<span style="color:#f8f8f2">.</span>params<span style="color:#f8f8f2">[</span><span style="color:#a6e22e"><span style="color:#e6db74">'gamma'</span></span><span style="color:#f8f8f2">]</span> <span style="color:#f8f8f2">*</span> returns<span style="color:#f8f8f2">[</span>t<span style="color:#f8f8f2">+</span><span style="color:#ae81ff"><span style="color:#ae81ff">1</span></span><span style="color:#f8f8f2">]</span> <span style="color:#f8f8f2">*</span> <span style="color:#f8f8f2">(</span><span style="color:#ae81ff"><span style="color:#ae81ff">1</span></span><span style="color:#f8f8f2">-</span>dones<span style="color:#f8f8f2">[</span>t<span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">)</span>returns <span style="color:#f8f8f2">=</span> returns<span style="color:#f8f8f2">[</span><span style="color:#f8f8f2">:</span><span style="color:#f8f8f2"><span style="color:#ae81ff">-</span></span><span style="color:#ae81ff"><span style="color:#ae81ff">1</span></span><span style="color:#f8f8f2">]</span><span style="color:slategray"><span style="color:#75715e"># advantages are returns - baseline, value estimates in our case</span></span>advantages <span style="color:#f8f8f2">=</span> returns <span style="color:#f8f8f2">-</span> values<span style="color:#66d9ef"><span style="color:#f92672">return</span></span> returns<span style="color:#f8f8f2">,</span> advantages<span style="color:#66d9ef"><span style="color:#f92672">def</span></span> <span style="color:#e6db74"><span style="color:#a6e22e">test</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">self</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> env</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> render</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">=</span></span><span style="color:#ae81ff"><span style="color:#f8f8f2">True</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span><span style="color:slategray"><span style="color:#75715e"># unchanged from previous section</span></span><span style="color:#f8f8f2">.</span><span style="color:#f8f8f2">.</span><span style="color:#f8f8f2">.</span><span style="color:#66d9ef"><span style="color:#f92672">def</span></span> <span style="color:#e6db74"><span style="color:#a6e22e">_value_loss</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">self</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> returns</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> value</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span><span style="color:slategray"><span style="color:#75715e"># unchanged from previous section</span></span><span style="color:#f8f8f2">.</span><span style="color:#f8f8f2">.</span><span style="color:#f8f8f2">.</span><span style="color:#66d9ef"><span style="color:#f92672">def</span></span> <span style="color:#e6db74"><span style="color:#a6e22e">_logits_loss</span></span><span style="color:#f8f8f2"><span style="color:#f8f8f2">(</span></span><span style="color:#f8f8f2">self</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> acts_and_advs</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">,</span></span><span style="color:#f8f8f2"> logits</span><span style="color:#f8f8f2"><span style="color:#f8f8f2">)</span></span><span style="color:#f8f8f2">:</span><span style="color:slategray"><span style="color:#75715e"># unchanged from previous section</span></span><span style="color:#f8f8f2">.</span><span style="color:#f8f8f2">.</span><span style="color:#f8f8f2">.</span></code></span>

訓(xùn)練和結(jié)

我們現(xiàn)在已經(jīng)準(zhǔn)備好在CartPole-v0上訓(xùn)練我們的單工A2C代理了!訓(xùn)練過程不應(yīng)超過幾分鐘,訓(xùn)練完成后,你應(yīng)該看到代理成功達(dá)到200分中的目標(biāo)。

<span style="color:#f8f8f2"><code class="language-python">rewards_history <span style="color:#f8f8f2">=</span> agent<span style="color:#f8f8f2">.</span>train<span style="color:#f8f8f2">(</span>env<span style="color:#f8f8f2">)</span> <span style="color:#66d9ef">print</span><span style="color:#f8f8f2">(</span><span style="color:#a6e22e"><span style="color:#e6db74">"Finished training, testing..."</span></span><span style="color:#f8f8f2">)</span> <span style="color:#66d9ef">print</span><span style="color:#f8f8f2">(</span><span style="color:#a6e22e"><span style="color:#e6db74">"%d out of 200"</span></span> <span style="color:#f8f8f2">%</span> agent<span style="color:#f8f8f2">.</span>test<span style="color:#f8f8f2">(</span>env<span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">)</span> <span style="color:slategray"><span style="color:#75715e"># 200 out of 200</span></span></code></span>

?

在源代碼中,我包含了一些額外的幫助程序,可以打印出運(yùn)行的獎(jiǎng)勵(lì)和損失,以及rewards_history的基本繪圖儀。

態(tài)計(jì)

有了所有這種渴望模式的成功的喜悅,你可能想知道靜態(tài)圖形執(zhí)行是否可以。當(dāng)然!此外,我們還需要多一行代碼來啟用它!

<span style="color:#f8f8f2"><code class="language-python"><span style="color:#66d9ef"><span style="color:#f92672">with</span></span> tf<span style="color:#f8f8f2">.</span>Graph<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">.</span>as_default<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">:</span><span style="color:#66d9ef">print</span><span style="color:#f8f8f2">(</span>tf<span style="color:#f8f8f2">.</span>executing_eagerly<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">)</span> <span style="color:slategray"><span style="color:#75715e"># False</span></span>model <span style="color:#f8f8f2">=</span> Model<span style="color:#f8f8f2">(</span>num_actions<span style="color:#f8f8f2">=</span>env<span style="color:#f8f8f2">.</span>action_space<span style="color:#f8f8f2">.</span>n<span style="color:#f8f8f2">)</span>agent <span style="color:#f8f8f2">=</span> A2CAgent<span style="color:#f8f8f2">(</span>model<span style="color:#f8f8f2">)</span>rewards_history <span style="color:#f8f8f2">=</span> agent<span style="color:#f8f8f2">.</span>train<span style="color:#f8f8f2">(</span>env<span style="color:#f8f8f2">)</span><span style="color:#66d9ef">print</span><span style="color:#f8f8f2">(</span><span style="color:#a6e22e"><span style="color:#e6db74">"Finished training, testing..."</span></span><span style="color:#f8f8f2">)</span><span style="color:#66d9ef">print</span><span style="color:#f8f8f2">(</span><span style="color:#a6e22e"><span style="color:#e6db74">"%d out of 200"</span></span> <span style="color:#f8f8f2">%</span> agent<span style="color:#f8f8f2">.</span>test<span style="color:#f8f8f2">(</span>env<span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">)</span> <span style="color:slategray"><span style="color:#75715e"># 200 out of 200</span></span></code></span>

有一點(diǎn)需要注意,在靜態(tài)圖形執(zhí)行期間,我們不能只有Tensors,這就是為什么我們?cè)谀P投x期間需要使用CategoricalDistribution的技巧。事實(shí)上,當(dāng)我在尋找一種在靜態(tài)模式下執(zhí)行的方法時(shí),我發(fā)現(xiàn)了一個(gè)關(guān)于通過Keras API構(gòu)建的模型的一個(gè)有趣的低級(jí)細(xì)節(jié)。

還有一件事

還記得我說過TensorFlow默認(rèn)是運(yùn)行在eager模式下吧,甚至用代碼片段證明它嗎?好吧,我錯(cuò)了。

如果你使用Keras API來構(gòu)建和管理模型,那么它將嘗試將它們編譯為靜態(tài)圖形。所以你最終得到的是靜態(tài)計(jì)算圖的性能,具有渴望執(zhí)行的靈活性。

你可以通過model.run_eagerly標(biāo)志檢查模型的狀態(tài),你也可以通過設(shè)置此標(biāo)志來強(qiáng)制執(zhí)行eager模式變成True,盡管大多數(shù)情況下你可能不需要這樣做。但如果Keras檢測(cè)到?jīng)]有辦法繞過eager模式,它將自動(dòng)退出。

為了說明它確實(shí)是作為靜態(tài)圖運(yùn)行,這里是一個(gè)簡(jiǎn)單的基準(zhǔn)測(cè)試:

<span style="color:#f8f8f2"><code class="language-python"><span style="color:slategray"><span style="color:#75715e"># create a 100000 samples batch</span></span> env <span style="color:#f8f8f2">=</span> gym<span style="color:#f8f8f2">.</span>make<span style="color:#f8f8f2">(</span><span style="color:#a6e22e"><span style="color:#e6db74">'CartPole-v0'</span></span><span style="color:#f8f8f2">)</span> obs <span style="color:#f8f8f2">=</span> np<span style="color:#f8f8f2">.</span>repeat<span style="color:#f8f8f2">(</span>env<span style="color:#f8f8f2">.</span>reset<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">[</span><span style="color:#f92672">None</span><span style="color:#f8f8f2">,</span> <span style="color:#f8f8f2">:</span><span style="color:#f8f8f2">]</span><span style="color:#f8f8f2">,</span> <span style="color:#ae81ff"><span style="color:#ae81ff">100000</span></span><span style="color:#f8f8f2">,</span> axis<span style="color:#f8f8f2">=</span><span style="color:#ae81ff"><span style="color:#ae81ff">0</span></span><span style="color:#f8f8f2">)</span></code></span>

Eager基準(zhǔn)

<span style="color:#f8f8f2"><code class="language-python"><span style="color:#f8f8f2">%</span><span style="color:#f8f8f2">%</span>time model <span style="color:#f8f8f2">=</span> Model<span style="color:#f8f8f2">(</span>env<span style="color:#f8f8f2">.</span>action_space<span style="color:#f8f8f2">.</span>n<span style="color:#f8f8f2">)</span> model<span style="color:#f8f8f2">.</span>run_eagerly <span style="color:#f8f8f2">=</span> <span style="color:#ae81ff"><span style="color:#f92672">True</span></span> <span style="color:#66d9ef">print</span><span style="color:#f8f8f2">(</span><span style="color:#a6e22e"><span style="color:#e6db74">"Eager Execution: "</span></span><span style="color:#f8f8f2">,</span> tf<span style="color:#f8f8f2">.</span>executing_eagerly<span style="color:#f8f8f2">(</span><span style="color:#f8f8f2">)</span><span style="color:#f8f8f2">)</span> <span style="color:#66d9ef">print</span><span style="color:#f8f8f2">(</span><span style="color:#a6e22e"><span style="color:#e6db74">"Eager Keras Model:"</span></span><span style="color:#f8f8f2">,</span> model<span style="color:#f8f8f2">.</span>run_eagerly<span style="color:#f8f8f2">)</span> _ <span style="color:#f8f8f2">=</span> model<span style="color:#f8f8f2">(</span>obs<span style="color:#f8f8f2">)</span> <span style="color:slategray"><span style="color:#75715e">######## Results #######</span></span> Eager Execution<span style="color:#f8f8f2">:</span> <span style="color:#ae81ff"><span style="color:#f92672">True</span></span> Eager Keras Model<span style="color:#f8f8f2">:</span> <span style="color:#ae81ff"><span style="color:#f92672">True</span></span> CPU times<span style="color:#f8f8f2">:</span> user <span style="color:#ae81ff"><span style="color:#ae81ff">639</span></span> ms<span style="color:#f8f8f2">,</span> sys<span style="color:#f8f8f2">:</span> <span style="color:#ae81ff"><span style="color:#ae81ff">736</span></span> ms<span style="color:#f8f8f2">,</span> total<span style="color:#f8f8f2">:</span> <span style="color:#ae81ff"><span style="color:#ae81ff">1.38</span></span> s</code></span>

態(tài)基準(zhǔn)

<span style="color:#f8f8f2"><code class="language-none">%%time with tf.Graph().as_default():model = Model(env.action_space.n)print("Eager Execution: ", tf.executing_eagerly())print("Eager Keras Model:", model.run_eagerly)_ = model.predict(obs) ######## Results ####### Eager Execution: False Eager Keras Model: False CPU times: user 793 ms, sys: 79.7 ms, total: 873 ms</code></span>

認(rèn)基準(zhǔn)

<span style="color:#333333"><span style="color:#f8f8f2"><code class="language-none">%%time model = Model(env.action_space.n) print("Eager Execution: ", tf.executing_eagerly()) print("Eager Keras Model:", model.run_eagerly) _ = model.predict(obs) ######## Results ####### Eager Execution: True Eager Keras Model: False CPU times: user 994 ms, sys: 23.1 ms, total: 1.02 s</code></span></span>

正如你所看到的,eager模式是靜態(tài)模式的背后,默認(rèn)情況下,我們的模型確實(shí)是靜態(tài)執(zhí)行的。

結(jié)論

希望本文能夠幫助你理解DRL和TensorFlow2.0。請(qǐng)注意,TensorFlow2.0仍然只是預(yù)覽版本,甚至不是候選版本,一切都可能發(fā)生變化。如果TensorFlow有什么東西你特別不喜歡,讓它的開發(fā)者知道

人們可能會(huì)有一個(gè)揮之不去的問題:TensorFlow比PyTorch好嗎?也許,也許不是。它們兩個(gè)都是偉大的庫(kù),所以很難說這樣誰好,誰不好。如果你熟悉PyTorch,你可能已經(jīng)注意到TensorFlow 2.0不僅趕上了它,而且還避免了一些PyTorch API的缺陷。

在任何一種情況下,對(duì)于開發(fā)者來說,這場(chǎng)競(jìng)爭(zhēng)都已經(jīng)為雙方帶來了積極的結(jié)果,我很期待看到未來的框架將會(huì)變成什么樣。

?

原文鏈接
本文為云棲社區(qū)原創(chuàng)內(nèi)容,未經(jīng)允許不得轉(zhuǎn)載。

創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎(jiǎng)勵(lì)來咯,堅(jiān)持創(chuàng)作打卡瓜分現(xiàn)金大獎(jiǎng)

總結(jié)

以上是生活随笔為你收集整理的TensorFlow 2.0深度强化学习指南的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。