當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

梯度反传_反事实政策梯度解释

發(fā)布時(shí)間：2023/11/29 编程问答 57 豆豆

生活随笔收集整理的這篇文章主要介紹了梯度反传_反事实政策梯度解释小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

梯度反傳

Among many of its challenges, multi-agent reinforcement learning has one obstacle that is overlooked: “credit assignment.” To explain this concept, let’s first take a look at an example…

在許多挑戰(zhàn)中，多主體強(qiáng)化學(xué)習(xí)有一個(gè)被忽略的障礙：“學(xué)分分配”。為了解釋這個(gè)概念，讓我們首先看一個(gè)例子……

Say we have two robots, robot A and robot B. They are trying to collaboratively push a box into a hole. In addition, they both receive a reward of 1 if they push it in and 0 otherwise. In the ideal case, the two robots would both push the box towards the hole at the same time, maximizing the speed and efficiency of the task.

假設(shè)我們有兩個(gè)機(jī)器人，即機(jī)器人A和機(jī)器人B。他們正在嘗試將盒子推入一個(gè)洞中。此外，如果他們都將其推入，他們都將獲得1的獎(jiǎng)勵(lì)，否則將獲得0。在理想情況下，兩個(gè)機(jī)器人都將盒子同時(shí)推向Kong，從而最大程度地提高了任務(wù)的速度和效率。

However, suppose that robot A does all the heavy lifting, meaning robot A pushes the box into the hole while robot B stands idly on the sidelines. Even though robot B simply loitered around, both robot A and robot B would receive a reward of 1. In other words, the same behavior is encouraged later on even though robot B executed a suboptimal policy. This is when the issue of “credit assignment” comes in. In multi-agent systems, we need to find a way to give “credit” or reward to agents who contribute to the overall goal, not to those who let others do the work.

但是，假設(shè)機(jī)器人A完成了所有繁重的工作，這意味著機(jī)器人A將箱子推入Kong中，而機(jī)器人B空著站在邊線上。即使機(jī)器人B只是閑逛，機(jī)器人A 和機(jī)器人B都將獲得1的獎(jiǎng)勵(lì)。換句話說(shuō)，即使機(jī)器人B執(zhí)行了次優(yōu)策略，以后也會(huì)鼓勵(lì)相同的行為。這就是“信用分配”問題出現(xiàn)的時(shí)候。在多主體系統(tǒng)中，我們需要找到一種方法，向?yàn)榭傮w目標(biāo)做出貢獻(xiàn)的代理人而不是讓他人完成工作的代理人給予“信用”或獎(jiǎng)勵(lì)。。

Okay so what’s the solution? Maybe we only give rewards to agents who contribute to the task itself.

好的，那有什么解決方案？也許我們只獎(jiǎng)勵(lì)那些為任務(wù)本身做出貢獻(xiàn)的特工。

Photo by Kira auf der Heide on Unsplash圖片由Kira auf der Heide 攝于Unsplash

比看起來(lái)難 (It’s Harder than It Seems)

It seems like this easy solution may just work, but we have to keep several things in mind.

似乎這個(gè)簡(jiǎn)單的解決方案可能會(huì)奏效，但我們必須牢記幾件事。

First, state representation in reinforcement learning might not be expressive enough to properly tailor rewards like this. In other words, we can’t always easily quantify whether an agent contributed to a given task and dole out rewards accordingly.

首先，強(qiáng)化學(xué)習(xí)中的狀態(tài)表示可能不足以適當(dāng)?shù)卣{(diào)整這樣的獎(jiǎng)勵(lì)。換句話說(shuō)，我們不能總是輕松地量化代理商是否為給定任務(wù)做出貢獻(xiàn)并相應(yīng)地發(fā)放獎(jiǎng)勵(lì)。

Secondly, we don’t want to handcraft these rewards, because it defeats the purpose of designing multi-agent algorithms. There’s a fine line between telling agents how to collaborate and encouraging them to learn how to do so.

其次，我們不想手工獲得這些獎(jiǎng)勵(lì)，因?yàn)樗`背了設(shè)計(jì)多主體算法的目的。在告訴代理人如何合作與鼓勵(lì)他們學(xué)習(xí)如何做之間有一條很好的界限。

一個(gè)答案 (One Answer)

Counterfactual policy gradients address this issue of credit assignment without explicitly giving away the answer to its agents.

反事實(shí)的政策梯度解決了這一信用分配問題，而沒有向其代理商明確給出答案。

The main idea behind the approach? Let’s train agent policies by comparing its actions to other actions it could’ve taken. In other words, an agent will ask itself:

該方法背后的主要思想是什么？讓我們通過將代理的操作與它可能采取的其他操作進(jìn)行比較來(lái)訓(xùn)練代理策略。換句話說(shuō)，座席會(huì)問自己：

“ Would we have gotten more reward if I had chosen a different action?”

“如果我選擇其他動(dòng)作，我們會(huì)得到更多的回報(bào)嗎？”

By putting this thinking process into mathematics, counterfactual multi-agent (COMA) policy gradients tackle the issue of credit assignment by quantifying how much an agent contributes to completing a task.

通過將這種思維過程納入數(shù)學(xué)，反事實(shí)多主體(COMA)策略梯度通過量化代理對(duì)完成任務(wù)的貢獻(xiàn)來(lái)解決信用分配問題。

Photo by Brandon Mowinkel on Unsplash照片由Brandon Mowinkel在Unsplash上拍攝

組成部分 (The Components)

COMA is an actor-critic method that uses centralized learning with decentralized execution. This means we train two networks:

COMA是一種參與者批評(píng)方法，它使用集中式學(xué)習(xí)和分散式執(zhí)行。這意味著我們訓(xùn)練兩個(gè)網(wǎng)絡(luò)：

An actor: given a state, outputs an action
演員：給定狀態(tài)，輸出動(dòng)作
A critic: given a state, estimates a value function
評(píng)論家 ：給定狀態(tài)，估計(jì)價(jià)值函數(shù)

In addition, the critic is only used during training and is removed during testing. We can think of the critic as the algorithm’s “training wheels.” We use the critic to guide the actor throughout training and give it advice on how to update and learn its policies. However, we remove the critic when it’s time to execute the actor’s learned policies.

此外，注釋器僅在訓(xùn)練期間使用，而在測(cè)試期間被刪除。我們可以將批評(píng)者視為算法的“訓(xùn)練輪”。我們使用評(píng)論家在整個(gè)培訓(xùn)過程中指導(dǎo)演員，并為演員提供有關(guān)如何更新和學(xué)習(xí)其政策的建議。但是，在執(zhí)行演員的學(xué)習(xí)策略時(shí)，我們會(huì)刪除批評(píng)者。

For more background on actor-critic methods in general, take a look at Chris Yoon’s in-depth article here:

要獲得有關(guān)演員批評(píng)方法的更多背景知識(shí)，請(qǐng)?jiān)诖颂幉榭碈hris Yoon的深入文章：

Let’s start by taking a look at the critic. In this algorithm, we train a network to estimate the joint Q-value across all agents. We’ll discuss the critic’s nuances and how it’s specifically designed later in this article. However, all we need to know now is that we have two copies of the critic network. One is the network we are trying to train and the other is our target network, used for training stability. The target network’s parameters are copied from the training network periodically.

讓我們先看一下評(píng)論家。在此算法中，我們訓(xùn)練網(wǎng)絡(luò)以估計(jì)所有代理之間的聯(lián)合Q值 。我們將在本文后面討論評(píng)論家的細(xì)微差別以及它是如何專門設(shè)計(jì)的。但是，我們現(xiàn)在需要知道的是，我們有批評(píng)者網(wǎng)絡(luò)的兩個(gè)副本。一個(gè)是我們正在嘗試訓(xùn)練的網(wǎng)絡(luò)，另一個(gè)是我們用于訓(xùn)練穩(wěn)定性的目標(biāo)網(wǎng)絡(luò)。定期從訓(xùn)練網(wǎng)絡(luò)復(fù)制目標(biāo)網(wǎng)絡(luò)的參數(shù)。

To train the networks, we use on-policy training. Instead of using one-step or n-step lookahead to determine our target Q-values, we use TD(lambda), which uses a mixture of n-step returns.

為了訓(xùn)練網(wǎng)絡(luò)，我們使用了策略訓(xùn)練。我們使用TD(lambda)而不是使用單步或n步前瞻來(lái)確定目標(biāo)Q值，而是使用n步返回值的混合。

n-step returns and target value using TD (lambda)使用TD(λ)的n步返回和目標(biāo)值

where gamma is the discount factor, r denotes a reward at a specific time step, f is our target value function, and lambda is a hyper-parameter. This seemingly infinite horizon value is calculated using bootstrapped estimates by a target network.

其中g(shù)amma是折現(xiàn)因子，r表示在特定時(shí)間步長(zhǎng)的獎(jiǎng)勵(lì)，f是我們的目標(biāo)值函數(shù)，lambda是超參數(shù)。這個(gè)看似無(wú)限的地平線值是由目標(biāo)網(wǎng)絡(luò)使用自舉估計(jì)來(lái)計(jì)算的。

For more information on TD(lambda), Andre Violante’s article provides a fantastic explanation:

有關(guān)TD(lambda)的更多信息， Andre Violante的文章提供了一個(gè)奇妙的解釋：

Finally, we update the critic’s parameters by minimizing this function:

最后，我們通過最小化此函數(shù)來(lái)更新評(píng)論者的參數(shù)：

Loss function損失函數(shù) Photo by Jose Morales on Unsplash Jose Morales在Unsplash上拍攝的照片

趕上 (The Catch)

Now, you may be wondering: this is nothing new! What makes this algorithm special? The beauty behind this algorithm comes with how we update the actor networks’ parameters.

現(xiàn)在，您可能想知道：這不是什么新鮮事！是什么使該算法與眾不同？該算法背后的美在于我們?nèi)绾胃陆巧W(wǎng)絡(luò)的參數(shù)。

In COMA, we train a probabilistic policy, meaning each action in a given state is chosen with a specific probability that is changed throughout training. In typical actor-critic scenarios, we update the policy by using a policy gradient, typically using the value function as a baseline to create advantage actor-critic:

在COMA中，我們訓(xùn)練概率策略，這意味著在給定狀態(tài)下的每個(gè)動(dòng)作都以特定概率選擇，該概率在整個(gè)訓(xùn)練過程中都會(huì)改變。在典型的參與者批評(píng)者場(chǎng)景中，我們通過使用策略梯度來(lái)更新策略，通常使用價(jià)值函數(shù)作為基準(zhǔn)來(lái)創(chuàng)建優(yōu)勢(shì)參與者批評(píng)者：

Naive advantage actor critic policy update天真優(yōu)勢(shì)演員評(píng)論家政策更新

However, there’s a problem here. This fails to address the original issue we were trying to solve: “credit assignment.” We have no notion of “how much any one agent contributes to the task.” Instead, all agents are being given the same amount of “credit,” considering our value function estimates joint value functions. As a result, COMA proposes using a different term as our baseline.

但是，這里有一個(gè)問題。這無(wú)法解決我們?cè)噲D解決的原始問題：“信用分配”。我們沒有“任何一個(gè)特工為這項(xiàng)任務(wù)做出多少貢獻(xiàn)”的概念。取而代之的是，考慮到我們的價(jià)值函數(shù)估算聯(lián)合價(jià)值函數(shù) ，所有代理商都會(huì)獲得相同數(shù)量的“信用”。因此，COMA建議使用其他術(shù)語(yǔ)作為我們的基準(zhǔn)。

To calculate this counterfactual baseline for each agent, we calculate an expected value over all actions that agent can take while keeping the actions of all other agents fixed.

為了計(jì)算每個(gè)業(yè)務(wù)代表的反事實(shí)基準(zhǔn) ，我們?cè)诒３炙衅渌麡I(yè)務(wù)代表的動(dòng)作不變的情況下，計(jì)算了該業(yè)務(wù)代表可以采取的所有行動(dòng)的期望值。

Adding counterfactual baseline to advantage function estimate將反事實(shí)基準(zhǔn)添加到優(yōu)勢(shì)函數(shù)估計(jì)中

Let’s take a step back here and dissect this equation. The first term is just the Q-value associated with the joint state and joint action (all agents). The second term is an expected value. Looking at each individual term in that summation, there are two values being multiplied together. The first is the probability this agent would’ve chosen a specific action. The second is the Q-value of taking that action while all other agents kept their actions fixed.

讓我們退后一步，剖析這個(gè)方程。第一項(xiàng)只是與關(guān)節(jié)狀態(tài)和關(guān)節(jié)動(dòng)作(所有主體)相關(guān)的Q值。第二項(xiàng)是期望值。看一下該求和中的每個(gè)單獨(dú)的項(xiàng)，有兩個(gè)值相乘在一起。首先是該特工選擇特定動(dòng)作的可能性。第二個(gè)是在所有其他代理保持其動(dòng)作固定的同時(shí)執(zhí)行該動(dòng)作的Q值。

Now, why does this work? Intuitively, by using this baseline, the agent knows how much reward this action contributes relative to all other actions it could’ve taken. In doing so, it can better distinguish which actions will better contribute to the overall reward across all agents.

現(xiàn)在，為什么這樣做呢？憑直覺，通過使用此基準(zhǔn)，代理可以知道此操作相對(duì)于它可能已經(jīng)執(zhí)行的所有其他操作有多少獎(jiǎng)勵(lì)。這樣，它可以更好地區(qū)分哪些行為將更好地為所有代理提供總體獎(jiǎng)勵(lì)。

COMA proposes using a specific network architecture helps make computing the baseline more efficient [1]. Furthermore, the algorithm can be extended to continuous action spaces by estimating the expected value using Monte Carlo Samples.

COMA提出使用特定的網(wǎng)絡(luò)體系結(jié)構(gòu)有助于使基準(zhǔn)線的計(jì)算效率更高[1]。此外，通過使用蒙特卡洛樣本估計(jì)期望值，可以將該算法擴(kuò)展到連續(xù)動(dòng)作空間。

Photo by JESHOOTS.COM on Unsplash JESHOOTS.COM在Unsplash上的照片

結(jié)果 (Results)

COMA was tested on StarCraft unit micromanagement, pitted against various central and independent actor critic variations, estimating both Q-values and value functions. It was shown that the approach outperformed others significantly. For official reported results and analysis, check out the original paper [1].

COMA已在StarCraft單位的微觀管理上進(jìn)行了測(cè)試，與各種中央和獨(dú)立演員評(píng)論家的變化進(jìn)行了對(duì)比，從而估算了Q值和值函數(shù)。結(jié)果表明，該方法明顯優(yōu)于其他方法。有關(guān)官方報(bào)告的結(jié)果和分析，請(qǐng)查看原始論文[1]。

結(jié)論 (Conclusion)

Nobody likes slackers. Neither do robots.

沒有人喜歡懶人。機(jī)器人也沒有。

Properly allowing agents to recognize their personal contribution to a task and optimizing their policies to best use this information is an essential part of making robots collaborate. In the future, better decentralized approaches may be explored, effectively lowering the learning space exponentially. However, this is easier said than done, as with all problems of these sorts. But of course, this is a strong milestone to letting multi-agents function at a far higher, more complex level.

適當(dāng)?shù)卦试S代理識(shí)別他們對(duì)任務(wù)的個(gè)人貢獻(xiàn)并優(yōu)化其策略以最佳地利用此信息，這是使機(jī)器人進(jìn)行協(xié)作的重要組成部分。將來(lái)，可能會(huì)探索更好的分散方法，從而有效地減少學(xué)習(xí)空間。但是，對(duì)于所有這些問題，說(shuō)起來(lái)容易做起來(lái)難。但是，當(dāng)然，這是使多代理在更高，更復(fù)雜的級(jí)別上起作用的重要里程碑。

From the classic to state-of-the-art, here are related articles discussing both multi-agent and single-agent reinforcement learning:

從經(jīng)典到最新，以下是討論多主體和單主體增強(qiáng)學(xué)習(xí)的相關(guān)文章：

翻譯自: https://towardsdatascience.com/counterfactual-policy-gradients-explained-40ac91cef6ae