In this paper, we introduce a new method dubbed Scheduled Auxiliary Control (SAC-X), as a first step towards such an approach. It is based on four main principles:??? 1. Every state-action pair is paired with a vector of rewards, consisting of ( typically sparse ) externally provided rewards and (typically sparse) internal auxiliary rewards.???? 2. Each reward entry has an assigned policy, called intention in the following, which is trained to maximize its corresponding cumulative reward.???? 3. There is a high-level scheduler which selects and executes the individual intentions with the goal of improving performance of the agent on the external tasks.???? 4. Learning is performed off-policy ( and asynchronouslyfrom policy execution ) and the experience between intentions is shared – to use information effectively. Although the approach proposed in this paper is generally applicable to a wider range of problems, we discuss our method in the light of a typical robotics manipulation applica tion with sparse rewards: stacking various objects and cleaning a table。 ??????? 由四個基本準則組成:狀態配備多個稀疏獎懲向量-一個稀疏的長向量;每個獎懲被分配策略-稱為意圖,通過最大化累計獎懲向量反饋;建立一個高層的選擇執行特定意圖的機制用以提高Agent的表現;學習是基于off-policy(新策略,Q值更新使用新策略),且意圖之間的經驗共享增加效率。總體方法可以應用于通用領域,在此我們以典型的機器人任務進行演示。 ??????? 基于Off-Play的好處:https://www.zhihu.com/question/57159315 ???????
論文:Learning by Playing – Solving Sparse Reward Tasks from Scratch