當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

DeepMind：所谓SACX学习范式

發(fā)布時間：2023/12/31 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 DeepMind：所谓SACX学习范式小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

?????????? 機(jī)器人是否能應(yīng)用于服務(wù)最終還是那兩條腿值多少錢，而與人交互，能真正地做“服務(wù)”工作，還是看那兩條胳膊怎么工作。大腦的智能化還是非常遙遠(yuǎn)的，還是先把感受器和效應(yīng)器做好才是王道。

?????????? 關(guān)于強化學(xué)習(xí)，根據(jù)Agent對策略的主動性不同劃分為主動強化學(xué)習(xí)（學(xué)習(xí)策略：必須自己決定采取什么行動）和被動強化學(xué)習(xí)（固定的策略決定其行為，為評價學(xué)習(xí)，即Agent如何從成功與失敗中、回報與懲罰中進(jìn)行學(xué)習(xí)，學(xué)習(xí)效用函數(shù)）。

?????????? 被動強化學(xué)習(xí)：EnforceLearning-被動強化學(xué)習(xí)

?????????? 主動強化學(xué)習(xí)：EnforceLearning-主動強化學(xué)習(xí)

?????????? 文章：SACX新范式，訓(xùn)練用于機(jī)器人抓取任務(wù)

???????? ? DeepMind提出調(diào)度輔助控制（Scheduled Auxiliary Control，SACX），這是強化學(xué)習(xí)（RL）上下文中一種新型的學(xué)習(xí)范式。SAC-X能夠在存在多個稀疏獎勵信號的情況下，從頭開始（from scratch）學(xué)習(xí)復(fù)雜行為。為此，智能體配備了一套通用的輔助任務(wù)，它試圖通過off-policy強化學(xué)習(xí)同時從中進(jìn)行學(xué)習(xí)。

????????? 這個長向量的形式化以及優(yōu)化為論文的亮點。

In this paper, we introduce a new method dubbed Scheduled Auxiliary Control (SAC-X), as a first step towards such an approach. It is based on four main principles: ??? 1. Every state-action pair is paired with a vector of rewards, consisting of ( typically sparse ) externally provided rewards and (typically sparse) internal auxiliary rewards. ???? 2. Each reward entry has an assigned policy, called intention in the following, which is trained to maximize its corresponding cumulative reward. ???? 3. There is a high-level scheduler which selects and executes the individual intentions with the goal of improving performance of the agent on the external tasks. ???? 4. Learning is performed off-policy ( and asynchronouslyfrom policy execution ) and the experience between intentions is shared – to use information effectively. Although the approach proposed in this paper is generally applicable to a wider range of problems, we discuss our method in the light of a typical robotics manipulation applica tion with sparse rewards: stacking various objects and cleaning a table。
??????? 由四個基本準(zhǔn)則組成：狀態(tài)配備多個稀疏獎懲向量-一個稀疏的長向量；每個獎懲被分配策略-稱為意圖，通過最大化累計獎懲向量反饋；建立一個高層的選擇執(zhí)行特定意圖的機(jī)制用以提高Agent的表現(xiàn)；學(xué)習(xí)是基于off-policy（新策略，Q值更新使用新策略），且意圖之間的經(jīng)驗共享增加效率?？傮w方法可以應(yīng)用于通用領(lǐng)域，在此我們以典型的機(jī)器人任務(wù)進(jìn)行演示。
??????? 基于Off-Play的好處：https://www.zhihu.com/question/57159315
???????

論文：Learning by Playing – Solving Sparse Reward Tasks from Scratch

總結(jié)

以上是生活随笔為你收集整理的DeepMind：所谓SACX学习范式的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： CNN结构：场景分割与Relation
下一篇：腾讯新闻APP怎么查看广告推送历史?(腾