梯度 cv2.sobel_TensorFlow 2.0中连续策略梯度的最小工作示例
梯度 cv2.sobel
At the root of all the sophisticated actor-critic algorithms that are designed and applied these days is the vanilla policy gradient algorithm, which essentially is an actor-only algorithm. Nowadays, the actor that learns the decision-making policy is often represented by a neural network. In continuous control problems, this network outputs the relevant distribution parameters to sample appropriate actions.
如今,已設(shè)計(jì)和應(yīng)用的所有復(fù)雜的行為者批評(píng)算法的根本是香草策略梯度算法,該算法本質(zhì)上是僅行為者算法。 如今,學(xué)習(xí)決策策略的演員通常以神經(jīng)網(wǎng)絡(luò)為代表。 在連續(xù)控制問(wèn)題中,該網(wǎng)絡(luò)輸出相關(guān)的分配參數(shù)以對(duì)適當(dāng)?shù)膭?dòng)作進(jìn)行采樣。
With so many deep reinforcement learning algorithms in circulation, you’d expect it to be easy to find abundant plug-and-play TensorFlow implementations for a basic actor network in continuous control, but this is hardly the case. Various reasons may exist for this. First, TensorFlow 2.0 was released only in September 2019, differing quite substantially from its predecessor. Second, most implementations focus on discrete action spaces rather than continuous ones. Third, there are many different implementations in circulation, yet some are tailored such that they only work in specific problem settings. It can be a tad frustrating to plow through several hundred lines of code riddled with placeholders and class members, only to find out the approach is not suitable to your problem after all. This article — based on our ResearchGate note [1] — provides a minimal working example that functions in TensorFlow 2.0. We will show that the real magic happens in only three lines of code!
鑒于有如此之多的深度強(qiáng)化學(xué)習(xí)算法正在流通中,您希望可以輕松地為連續(xù)控制中的基本參與者網(wǎng)絡(luò)找到豐富的即插即用TensorFlow實(shí)現(xiàn),但這并不是事實(shí)。 為此可能存在多種原因。 首先,TensorFlow 2.0僅在2019年9月發(fā)布,與之前的版本有很大不同。 其次,大多數(shù)實(shí)現(xiàn)都集中在離散的動(dòng)作空間而不是連續(xù)的動(dòng)作空間上。 第三,流通中有許多不同的實(shí)現(xiàn),但有些實(shí)現(xiàn)是經(jīng)過(guò)量身定制的,因此它們僅在特定的問(wèn)題環(huán)境中起作用。 翻遍數(shù)百行充滿占位符和類成員的代碼可能有點(diǎn)令人沮喪,只是發(fā)現(xiàn)該方法畢竟不適合您的問(wèn)題。 本文基于我們的ResearchGate注釋[1],提供了一個(gè)在TensorFlow 2.0中起作用的最小工作示例。 我們將展示真正的魔力僅發(fā)生在三行代碼中!
一些數(shù)學(xué)背景 (Some mathematical background)
In this article, we present a simple and generic implementation for an actor network in the context of the vanilla policy gradient algorithm REINFORCE [2]. In the continuous variant, we usually draw actions from a Gaussian distribution; the goal is to learn an appropriate mean μ and a standard deviation σ. The actor network learns and outputs these parameters.
在本文中,我們?cè)谙悴莶呗蕴荻人惴≧EINFORCE [2]的背景下,為參與者網(wǎng)絡(luò)提供了一個(gè)簡(jiǎn)單而通用的實(shí)現(xiàn)。 在連續(xù)變體中,我們通常從高斯分布中得出動(dòng)作。 目的是學(xué)習(xí)適當(dāng)?shù)钠骄郸毯蜆?biāo)準(zhǔn)偏差σ 。 參與者網(wǎng)絡(luò)學(xué)習(xí)并輸出這些參數(shù)。
Let’s formalize this actor network a bit more. Here, the input is the state s or a feature array ?(s), followed by one or more hidden layers that transform the input, with the output being μ and σ. Once obtaining this output, an action a is randomly drawn from the corresponding Gaussian distribution. Thus, we have a=μ(s)+σ(s)ξ , where ξ ~ 𝒩(0,1).
讓我們進(jìn)一步規(guī)范這個(gè)actor網(wǎng)絡(luò)。 在這里,輸入是狀態(tài)s或特征數(shù)組?(s) ,后跟一個(gè)或多個(gè)隱藏層來(lái)轉(zhuǎn)換輸入,輸出為μ和σ 。 一旦獲得此輸出,便從相應(yīng)的高斯分布中隨機(jī)抽取一個(gè)動(dòng)作a 。 因此,我們有A =μ(s)+σ(S)ξ,其中ξ?𝒩(0,1)。
After taking our action a, we observe a corresponding reward signal v. Together with some learning rate α, we may update the weights into a direction that improves the expected reward of our policy. The corresponding update rule [2] — based on gradient ascent — is given by:
在采取行動(dòng)a之后 ,我們觀察到相應(yīng)的獎(jiǎng)勵(lì)信號(hào)v 。 連同一些學(xué)習(xí)率α,我們可以將權(quán)重更新為一個(gè)方向,以改善我們的政策的預(yù)期回報(bào)。 基于梯度上升的相應(yīng)更新規(guī)則[2]由下式給出:
If we use a linear approximation scheme μ_θ(s)=θ^? ?(s), we may directly apply these update rules on each feature weight. For neural networks, it may not be as straightforward how we should perform this update though.
如果我們使用線性近似方案μ_θ(s)=θ^?(s) ,則可以將這些更新規(guī)則直接應(yīng)用于每個(gè)特征權(quán)重。 對(duì)于神經(jīng)網(wǎng)絡(luò),我們應(yīng)該如何執(zhí)行此更新可能并不那么簡(jiǎn)單。
Neural networks are trained by minimizing a loss function. We often compute the loss by computing the mean-squared error (squaring the difference between the predicted- and observed value). For instance, in a critic network the loss could be defined as (r? + Q??? - Q?)2, with Q? being the predicted value and r? + Q??? the observed value. After computing the loss, we backpropagate it through the network, computing the partial losses and gradients required to update the network weights.
通過(guò)最小化損失函數(shù)來(lái)訓(xùn)練神經(jīng)網(wǎng)絡(luò)。 我們通常通過(guò)計(jì)算均方誤差(對(duì)預(yù)測(cè)值和觀察值之間的差進(jìn)行平方)來(lái)計(jì)算損失。 例如,在一個(gè)網(wǎng)絡(luò)評(píng)論家的損失可以被定義為(R?+ Q??? - Q?)2,具有Q?作為預(yù)測(cè)值和r?+ Q???所觀察到的值。 計(jì)算完損耗后,我們通過(guò)網(wǎng)絡(luò)反向傳播,計(jì)算更新網(wǎng)絡(luò)權(quán)重所需的部分損耗和梯度。
At first glance, the update equations have little in common with such a loss function. We simply try to improve our policy by moving into a certain direction, but do not have an explicit ‘target’ or ‘true value’ in mind. Indeed, we will need to define a ‘pseudo loss function’ that helps us update the network [3]. The link between the traditional update rules and this loss function become more clear when expressing the update rule into its generic form:
乍一看,更新方程與這種損失函數(shù)幾乎沒(méi)有共同點(diǎn)。 我們只是試圖朝某個(gè)方向發(fā)展以改善政策,但沒(méi)有明確的“目標(biāo)”或“真實(shí)價(jià)值”。 確實(shí),我們將需要定義一個(gè)“偽損失函數(shù)”來(lái)幫助我們更新網(wǎng)絡(luò)[3]。 將更新規(guī)則表達(dá)為通用形式時(shí),傳統(tǒng)更新規(guī)則與此損失函數(shù)之間的聯(lián)系變得更加清晰:
Transformation into a loss function is fairly straightforward. As the loss is only the input for the backpropagation procedure, we first drop the learning rate α and gradient ?_θ. Furthermore, neural networks are updated using gradient descent instead of gradient ascent, so we must add a minus sign. These steps yield the following loss function:
轉(zhuǎn)換為損失函數(shù)非常簡(jiǎn)單。 由于損耗只是反向傳播過(guò)程的輸入 ,因此我們首先降低學(xué)習(xí)率α和梯度?_θ 。 此外,神經(jīng)網(wǎng)絡(luò)是使用梯度下降而不是梯度上升來(lái)更新的,因此我們必須添加減號(hào)。 這些步驟產(chǎn)生以下?lián)p失函數(shù):
Quite similar to the update rule, right? To provide some intuition: remind that the log transformation yields a negative number for all values smaller than 1. If we have an action with a low probability and a high reward, we’d want to observe a large loss, i.e., a strong signal to update our policy into the direction of that high reward. The loss function does precisely that.
很像更新規(guī)則,對(duì)不對(duì)? 為了提供一些直覺(jué):請(qǐng)注意,對(duì)數(shù)轉(zhuǎn)換會(huì)為所有小于1的值產(chǎn)生一個(gè)負(fù)數(shù)。如果我們進(jìn)行的動(dòng)作的概率較低且回報(bào)較高,則希望觀察到較大的損失,即強(qiáng)烈的信號(hào)將我們的政策更新為高回報(bào)的方向。 損失函數(shù)正是這樣做的。
To apply the update for a Gaussian policy, we can simply substitute π_θ with the Gaussian probability density function (pdf) — note that in the continuous domain we work with pdf values rather than actual probabilities — to obtain the so-called weighted Gaussian log likelihood loss function:
要將更新應(yīng)用于高斯策略,我們可以簡(jiǎn)單地用高斯概率密度函數(shù)(pdf)替換π_θ-注意,在連續(xù)域中,我們使用pdf值而不是實(shí)際概率-獲得所謂的加權(quán)高斯對(duì)數(shù)似然損失函數(shù) :
TensorFlow 2.0實(shí)施 (TensorFlow 2.0 implementation)
Enough mathematics for now, it’s time for the implementation.
到目前為止,數(shù)學(xué)已經(jīng)足夠多了,是時(shí)候?qū)嵤┝恕?
We just defined the loss function, but unfortunately we cannot directly apply it in Tensorflow 2.0. When training a neural network, you may be used to something like model.compile(loss='mse',optimizer=opt), followed by model.fitormodel.train_on_batch, but this doesn’t work. First of all, the Gaussian log likelihood loss function is not a default one in TensorFlow 2.0 — it is in the Theano library for example[4] — meaning we have to create a custom loss function. More restrictive though: TensorFlow 2.0 requires a loss function to have exactly two arguments, y_true and y_predicted. As we just saw, we have three arguments due to multiplying with the reward. Let’s worry about that later though and first present our custom Guassian loss function:
我們只是定義了損失函數(shù),但不幸的是我們無(wú)法在Tensorflow 2.0中直接應(yīng)用它。 在訓(xùn)練神經(jīng)網(wǎng)絡(luò)時(shí),您可能習(xí)慣了諸如model.compile(loss='mse',optimizer=opt) ,隨后是model.fit或model.train_on_batch ,但這是行不通的。 首先,高斯對(duì)數(shù)似然損失函數(shù)不是TensorFlow 2.0中的默認(rèn)函數(shù)-例如在Theano庫(kù)中[4]-這意味著我們必須創(chuàng)建一個(gè)自定義損失函數(shù)。 但是, y_true y_predicted :TensorFlow 2.0需要一個(gè)損失函數(shù)來(lái)具有正好兩個(gè)參數(shù)y_true和y_predicted 。 正如我們剛剛看到的,由于乘以獎(jiǎng)勵(lì),我們有三個(gè)論點(diǎn)。 不過(guò),讓我們稍后再擔(dān)心,首先介紹我們的自定義Guassian損失函數(shù):
"""Weighted Gaussian log likelihood loss function for RL""" def custom_loss_gaussian(state, action, reward):# Predict mu and sigma with actor networkmu, sigma = actor_network(state)# Compute Gaussian pdf valuepdf_value = tf.exp(-0.5 *((action - mu) / (sigma))**2)* 1 / (sigma * tf.sqrt(2 * np.pi))# Convert pdf value to log probabilitylog_probability = tf.math.log(pdf_value + 1e-5)# Compute weighted lossloss_actor = - reward * log_probabilityreturn loss_actorSo we have the correct loss function now, but we cannot apply it!? Of course we can — otherwise all of this would have been fairly pointless — it’s just slightly different than you might be used to.
因此,我們現(xiàn)在具有正確的損失函數(shù),但是我們無(wú)法應(yīng)用它! 我們當(dāng)然可以-否則所有這些都將毫無(wú)意義-與您過(guò)去的習(xí)慣略有不同。
This is where the GradientTapefunctionality comes in, which is a novel addition to TensorFlow 2.0 [5]. It essentially records your forward steps on a ‘tape’ such that it can apply automatic differentiation. The updating approach consists of three steps [6]. First, in our custom loss function we make a forward pass through the actor network — which is memorized — and calculate the loss. Second, with the function .trainable_variables, we recall the weights found during our forward pass. Subsequently, tape.gradient calculates all the gradients for you by simply plugging in the loss value and the trainable variables. Third, with optimizer.apply_gradients we update the network weights, where the optimizer is one of your choosing (e.g., SGD, Adam, RMSprop). In Python, the update steps look as follows:
這是GradientTape功能出現(xiàn)的地方,它是TensorFlow 2.0 [5]的新增功能。 它實(shí)質(zhì)上將您的前進(jìn)步驟記錄在“ 磁帶”上 ,以便可以應(yīng)用自動(dòng)區(qū)分。 更新方法包括三個(gè)步驟[6]。 首先,在我們的自定義損失函數(shù)中,我們通過(guò)參與者網(wǎng)絡(luò)(已存儲(chǔ))進(jìn)行前向傳遞,然后計(jì)算損失。 其次,使用函數(shù).trainable_variables ,我們可以回憶起向前通過(guò)過(guò)程中發(fā)現(xiàn)的權(quán)重。 隨后, tape.gradient只需插入損失值和可訓(xùn)練變量,即可為您計(jì)算所有梯度。 第三,使用optimizer.apply_gradients我們更新網(wǎng)絡(luò)權(quán)重,其中優(yōu)化器是您的選擇之一(例如,SGD,Adam,RMSprop)。 在Python中,更新步驟如下所示:
"""Compute and apply gradients to update network weights""" with tf.GradientTape() as tape:# Compute Gaussian loss with custom loss functionloss_value = custom_loss_gaussian(state, action, reward)# Compute gradients for actor networkgrads = tape.gradient(loss_value, actor_network.trainable_variables)# Apply gradients to update network weightsoptimizer.apply_gradients(zip(grads, actor_network.trainable_variables))So in the end, we only need a few lines of codes to perform the update!
因此,最后,我們只需要執(zhí)行幾行代碼即可執(zhí)行更新!
數(shù)值例子 (Numerical example)
We present a minimal working example for a continuous control problem, the full code can be found on my GitHub. We consider an extremely simple problem, namely a one-shot game with only one state and a trivial optimal policy. The closer we are to the (fixed but unknown) target, the higher our reward. The reward function is formally denoted as R =ζ β / max(ζ,|τ - a|), with β as the maximum reward, τ as the target and ζ as the target range.
我們?yōu)檫B續(xù)控制問(wèn)題提供了一個(gè)最小的工作示例,完整的代碼可以在我的GitHub上找到。 我們考慮一個(gè)非常簡(jiǎn)單的問(wèn)題,即只有一個(gè)狀態(tài)和一個(gè)瑣碎的最優(yōu)策略的單機(jī)游戲。 我們?cè)浇咏?固定但未知)的目標(biāo),我們的獎(jiǎng)勵(lì)就越高。 獎(jiǎng)勵(lì)函數(shù)正式表示為R =ζβ/ max (ζ,|τ-a |) ,其中β為最大獎(jiǎng)勵(lì), τ為目標(biāo), ζ為目標(biāo)范圍。
To represent the actor we define a dense neural network (using Keras) that takes the fixed state (a tensor with value 1) as input, performs transformations in two hidden layers with ReLUs as activation functions (five per layer) and returns μ and σ as output. We initialize bias weights such that we start with μ=0 and σ=1. For our optimizer, we use Adam with its default learning rate of 0.001.
為了表示參與者,我們定義了一個(gè)密集的神經(jīng)網(wǎng)絡(luò)(使用Keras),該網(wǎng)絡(luò)以固定狀態(tài)(值為1的張量)作為輸入,在兩個(gè)隱藏層中執(zhí)行變換,其中ReLU作為激活函數(shù)(每層五個(gè)),并返回μ和σ作為輸出。 我們初始化偏差權(quán)重,使得我們從μ = 0和σ = 1開(kāi)始。 對(duì)于我們的優(yōu)化器,我們使用默認(rèn)學(xué)習(xí)率為0.001的Adam。
Some sample runs are shown in the figure below. Note that the convergence pattern is in line with our expectations. At first the losses are relatively high, causing μ to move into the direction of higher rewards and σ to increase and allow for more exploration. Once hitting the target the observed losses decrease, resulting in μ to stabilize and σ to drop to nearly 0.
下圖顯示了一些示例運(yùn)行。 請(qǐng)注意,收斂模式符合我們的預(yù)期。 首先,損失相對(duì)較高,導(dǎo)致μ向更高獎(jiǎng)勵(lì)的方向移動(dòng),而σ增加并允許更多的探索。 一旦達(dá)到目標(biāo),觀察到的損耗就會(huì)減少,從而使μ穩(wěn)定下來(lái),而σ下降到接近0。
μ (own work by author [1])μ (作者自己的工作[1])關(guān)鍵點(diǎn) (Key points)
- The policy gradient method does not work with traditional loss functions; we must define a pseudo-loss to update actor networks. For continuous control, the pseudo-loss function is simply the negative log of the pdf value multiplied with the reward signal. 策略梯度法不適用于傳統(tǒng)的損失函數(shù); 我們必須定義一個(gè)偽損失來(lái)更新參與者網(wǎng)絡(luò)。 對(duì)于連續(xù)控制,偽損失函數(shù)只是pdf值的負(fù)對(duì)數(shù)乘以獎(jiǎng)勵(lì)信號(hào)。
Several TensorFlow 2.0 update functions only accept custom loss functions with exactly two arguments. The GradientTape functionality does not have this restriction.
一些TensorFlow 2.0更新函數(shù)僅接受帶有兩個(gè)自變量的自定義損失函數(shù)。 GradientTape功能沒(méi)有此限制。
- Actor networks are updated using three steps: (i) define a custom loss function, (ii) compute the gradients for the trainable variables and (iii) apply the gradients to update the weights of the actor network. 使用三個(gè)步驟來(lái)更新Actor網(wǎng)絡(luò):(i)定義自定義損失函數(shù),(ii)計(jì)算可訓(xùn)練變量的梯度,以及(iii)應(yīng)用梯度來(lái)更新Actor網(wǎng)絡(luò)的權(quán)重。
This article is partially based on my ResearchGate paper: ‘Implementing Gaussian Actor Networks for Continuous Control in TensorFlow 2.0’ , available at https://www.researchgate.net/publication/343714359_Implementing_Gaussian_Actor_Networks_for_Continuous_Control_in_TensorFlow_20
本文部分基于我的ResearchGate論文:``在TensorFlow 2.0中實(shí)現(xiàn)高斯Actor網(wǎng)絡(luò)以實(shí)現(xiàn)連續(xù)控制'', 網(wǎng)址 為 https://www.researchgate.net/publication/343714359_Implementing_Gaussian_Actor_Networks_for_Continuous_Control_in_TensorFlow_20
The GitHub code (implemented using Python 3.8 and TensorFlow 2.3) can be found at: www.github.com/woutervanheeswijk/example_continuous_control
GitHub代碼(使用Python 3.8和TensorFlow 2.3實(shí)現(xiàn))可以在以下位置找到: www.github.com/woutervanheeswijk/example_continuous_control
[1] Van Heeswijk, W.J.A. (2020) Implementing Gaussian Actor Networks for Continuous Control in TensorFlow 2.0. https://www.researchgate.net/publication/343714359_Implementing_Gaussian_Actor_Networks_for_Continuous_Control_in_TensorFlow_20
[1] Van Heeswijk,WJA(2020)在TensorFlow 2.0中實(shí)現(xiàn)高斯演員網(wǎng)絡(luò)進(jìn)行連續(xù)控制。 https://www.researchgate.net/publication/343714359_Implementing_Gaussian_Actor_Networks_for_Continuous_Control_in_TensorFlow_20
[2] Williams, R. J. (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4):229-256.
[2] Williams,RJ(1992)用于連接主義強(qiáng)化學(xué)習(xí)的簡(jiǎn)單統(tǒng)計(jì)梯度跟蹤算法。 機(jī)器學(xué)習(xí),8(3-4):229-256。
[3] Levine, S. (2019) CS 285 at UC Berkeley Deep Reinforcement Learning: Policy Gradients. http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-5.pdf
[3] Levine,S.(2019)CS 285,加州大學(xué)伯克利分校深度強(qiáng)化學(xué)習(xí):政策梯度。 http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-5.pdf
[4] Theanets 0.7.3 documentation. Gaussian Log Likelihood Function. https://theanets.readthedocs.io/en/stable/api/generated/theanets.losses.GaussianLogLikelihood.html#theanets.losses.GaussianLogLikelihood
[4] Theanets 0.7.3文檔。 高斯對(duì)數(shù)似然函數(shù)。 https://theanets.readthedocs.io/zh_CN/stable/api/generated/theanets.losses.GaussianLogLikelihood.html#theanets.losses.GaussianLogLikelihood
[5] Rosebrock, A. (2020) Using TensorFlow and GradientTape to train a Keras model. https://www.tensorflow.org/api_docs/python/tf/GradientTape
[5] Rosebrock,A.(2020)使用TensorFlow和GradientTape訓(xùn)練Keras模型。 https://www.tensorflow.org/api_docs/python/tf/GradientTape
[6] Nandan, A. (2020) Actor Critic Method. https://keras.io/examples/rl/actor_critic_cartpole/
[6] Nandan,A.(2020)演員評(píng)論方法。 https://keras.io/examples/rl/actor_critic_cartpole/
翻譯自: https://towardsdatascience.com/a-minimal-working-example-for-continuous-policy-gradients-in-tensorflow-2-0-d3413ec38c6b
梯度 cv2.sobel
總結(jié)
以上是生活随笔為你收集整理的梯度 cv2.sobel_TensorFlow 2.0中连续策略梯度的最小工作示例的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 梦到蛇追别人预示着什么
- 下一篇: 量子信息与量子计算_量子计算为23美分。