當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Twin Delayed DDPG(TD3)-强化学习算法

發布時間：2024/9/15 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 Twin Delayed DDPG(TD3)-强化学习算法小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

Background
Quick Facts
Key Equations
Exploration vs. Exploitation
Pseudocode
Documentation

Background

盡管DDPG有時可以實現出色的性能，但它在超參數和其他類型的調整方面通常很脆弱。 DDPG的常見故障模式是，學習到的Q函數開始顯著高估Q值，然后導致策略中斷，因為它利用了Q函數中的錯誤。雙延遲DDPG（TD3）是一種通過引入三個關鍵技巧來解決此問題的算法：

Trick One: Clipped Double-Q Learning. TD3學習兩個Q-functions(因此命名為“twin")，還用了較小的兩個Q-值去構造Bellman誤差損失函數的目標s。、
Trick Two: “Delayed” Policy Updates. TD3更新策略(和目標網絡)的頻次比Q-function要少。文章建議每兩次Q-function的更新再更新一次策略。
Trick Three: Target Policy Smoothing. TD3對目標動作加入了噪聲，通過根據操作變化平滑Q，使策略更難以利用Q函數誤差。

總之，這三個技巧使性能大大超過了基準DDPG。

Quick Facts

TD3 is an off-policy algorithm.
TD3 can only be used for environments with continuous action spaces.
SpinningUP中的TD3不支持并行運算

Key Equations

TD3同時學習兩個Q-functions, $Q?1andQ?2Q_{\phi_1} and \ Q_{\phi_2}$ 通過均方Bellman誤差最小化（與DDPG學習它的單Q-函數幾乎同樣的方式）。為了更準確的說明TD3怎么做以及它與一般的DDPG到底有什么不同，我們將從損失函數的最內部開始進行工作。

1：**目標策略平滑化. ** 用于構建Q-learning目標的動作是基于目標策略 $μθtarg\mu_{\theta_targ}$ 的, 但是伴隨著clipped噪聲加入動作的每個維度。在加入截斷噪聲，然后將目標動作裁剪為位于有效動作范圍內(所有有效動作, $a$ ,滿足 $aLow≤a≤aHigha_{Low}\leq a\leq a_{High}$ )。因此目標動作為： $\text{clip}\left(\mu_{\theta_{\text{targ}}}(s') + \text{clip}(\epsilon,-c,c), a_{Low}, a_{High}\right), \;\;\;\;\; \epsilon \sim \mathcal{N}(0, \sigma)$ 目標策略平滑實質上是算法的正則化器。它解決了DDPG中可能發生的特定故障模式：如果Q函數逼近器為某些動作形成了一個不正確的尖峰，該策略將迅速利用該峰，然后出現脆性或不正確的表現。可以通過對Q函數進行平滑處理來避免此類行為，這是針對目標策略進行平滑處理而設計的。
2:截斷double-Q learning 兩個Q-函數使用一個目標，計算時使用兩個Q-函數中給出的最小目標值的那個： $\gamma (1 - d) \min_{i=1,2} Q_{\phi_{i, \text{targ}}}(s', a'(s')),$ 然后都可以通過退回到這個目標來學習： $L(?1,D)=E(s,a,r,s′,d)～D(Q?1(s,a)?y(r,s′,d))2,L(\phi_1, {\mathcal D}) = E_{(s,a,r,s',d) \sim {\mathcal D}}{ \Bigg( Q_{\phi_1}(s,a) - y(r,s',d) \Bigg)^2 },$ $L(?2,D)=E(s,a,r,s′,d)～D(Q?2(s,a)?y(r,s′,d))2.L(\phi_2, {\mathcal D}) = E_{(s,a,r,s',d) \sim {\mathcal D}}{ \Bigg( Q_{\phi_2}(s,a) - y(r,s',d) \Bigg)^2 }.$ 對目標使用較小的Q值，然后逐步回歸該值，有助于避免Q函數的過高估計。
3: 僅通過最大化來學習策略 $Q?1:Q_{\phi_1}:$ $max?θEs～D[Q?1(s,μθ(s))],\max_{\theta} \underset{s \sim {\mathcal D}}{{\mathrm E}}\left[ Q_{\phi_1}(s, \mu_{\theta}(s)) \right],$ 這個同DDPG沒啥改變。然而，在TD3,策略比Q-函數更新得慢。由于策略的更新如何地更改目標，這種方式有助于抑制DDPG中通常出現的波動。

Exploration vs. Exploitation

TD3以off-policy方式訓練確定性策略。由于該策略是確定性的，因此如果代理要探索策略，則一開始它可能不會嘗試采取足夠多種措施來找到有用的學習信號。為了使TD3策略更好地探索，我們在訓練時在其操作中增加了噪聲，通常是不相關的均值零高斯噪聲。為了便于獲取更高質量的訓練數據，您可以在訓練過程中減小噪聲的大小。（在實現過程中，我們不會這樣做，并且始終將噪聲等級保持固定。）

在測試時，要查看策略如何充分利用它所學到的知識，我們不會在操作中增加噪音。

TD3實施在訓練開始時使用了一個技巧來改進探索。對于開始時有固定數量的步驟（使用start_steps關鍵字參數設置），代理將采取從有效動作的均勻隨機分布中采樣的動作。之后，它將返回到正常的TD3探索。

Pseudocode

Documentation

spinup.td3(env_fn, actor_critic=, ac_kwargs={}, seed=0, steps_per_epoch=5000, epochs=100, replay_size=1000000, gamma=0.99, polyak=0.995, pi_lr=0.001, q_lr=0.001, batch_size=100, start_steps=10000, act_noise=0.1, target_noise=0.2, noise_clip=0.5, policy_delay=2, max_ep_len=1000, logger_kwargs={}, save_freq=1)
Parameters:

env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
actor_critic – A function which takes in placeholder symbols for state, x_ph, and action, a_ph, and returns the main outputs from the agent’s Tensorflow computation graph:
ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to TD3.
seed (int) – Seed for random number generators.
steps_per_epoch (int) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch.
epochs (int) – Number of epochs to run and train agent.
replay_size (int) – Maximum length of replay buffer.
gamma (float) – Discount factor. (Always between 0 and 1.)
polyak (float) – Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to: $θtarg←ρθtarg+(1?ρ)θ\theta_{targ}\leftarrow\rho\theta_{targ}+(1-\rho)\theta$ where $ρ\rho$ is polyak. (Always between 0 and 1, usually close to 1.)
pi_lr (float) – Learning rate for policy.
q_lr (float) – Learning rate for Q-networks.
batch_size (int) – Minibatch size for SGD.
start_steps (int) – Number of steps for uniform-random action selection, before running real policy. Helps exploration.
act_noise (float) – Stddev for Gaussian exploration noise added to policy at training time. (At test time, no noise is added.)
target_noise (float) – Stddev for smoothing noise added to target policy.
noise_clip (float) – Limit for absolute value of target policy smoothing noise.
policy_delay (int) – Policy will only be updated once every policy_delay times for each update of the Q-networks.
max_ep_len (int) – Maximum length of trajectory / episode / rollout.
logger_kwargs (dict) – Keyword args for EpochLogger.
save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.

總結

以上是生活随笔為你收集整理的Twin Delayed DDPG(TD3)-强化学习算法的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： PPO-强化学习算法
下一篇： Soft-Actor-Critic-强化