當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

深度强化学习的 18 个关键问题 | PaperDaily #30

發(fā)布時間：2024/10/8 编程问答 41 豆豆

生活随笔收集整理的這篇文章主要介紹了深度强化学习的 18 个关键问题 | PaperDaily #30 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

在碎片化閱讀充斥眼球的時代，越來越少的人會去關(guān)注每篇論文背后的探索和思考。

在這個欄目里，你會快速 get 每篇精選論文的亮點和痛點，時刻緊跟 AI 前沿成果。

點擊本文底部的「閱讀原文」即刻加入社區(qū)，查看更多最新論文推薦。

這是 PaperDaily 的第?30?篇文章

關(guān)于作者：王凌霄（社區(qū)ID @Nevertiree），中國科學(xué)院自動化研究所實習(xí)生，研究方向為強化學(xué)習(xí)和多智能體。

這兩天我閱讀了兩篇篇猛文 A Brief Survey of Deep Reinforcement Learning 和 Deep Reinforcement Learning: An Overview，作者排山倒海的引用了 200 多篇文獻，闡述強化學(xué)習(xí)未來的方向。

■?論文 | A Brief Survey of Deep Reinforcement Learning

■ 鏈接 | http://www.paperweekly.site/papers/922

■ 作者 | Nevertiree

■?論文 | Deep Reinforcement Learning: An Overview

■ 鏈接 | http://www.paperweekly.site/papers/1372

■ 作者 | Nevertiree

原文歸納出深度強化學(xué)習(xí)中的常見科學(xué)問題，并列出了目前解法與相關(guān)綜述，我在這里做出整理，抽取了相關(guān)的論文。 這里精選 18 個關(guān)鍵問題，涵蓋空間搜索、探索利用、策略評估、內(nèi)存使用、網(wǎng)絡(luò)設(shè)計、反饋激勵等等話題。

本文精選了 73 篇論文（其中 2017 年的論文有 27 篇，2016 年的論文有 21 篇），為了方便閱讀，原標(biāo)題放在文章最后，可以根據(jù)索引找到。

問題一：預(yù)測與策略評估

prediction, policy evaluation?

萬變不離其宗，Temporal Difference 方法仍然是策略評估的核心哲學(xué)【Sutton 1988】。TD的拓展版本和她本身一樣鼎鼎大名—1992 年的 Q-learning 與 2015 年的 DQN。?

美中不足，TD Learning 中很容易出現(xiàn) Over-Estimate（高估）問題，具體原因如下：?

The max operator in standard Q-learning and DQN use the same values both to select and to evaluate an action. — van Hasselt?

曠世猛將 van Hasselt 先生很喜歡處理 Over-Estimate 問題，他先搞出一個 Double Q-learning【van Hasselt 2010】大鬧 NIPS，六年后搞出深度學(xué)習(xí)版本的 Double DQN【van Hasselt 2016a】。

問題二：控制與最佳策略選擇

control, finding optimal policy?

目前解法有三個流派，一圖勝千言：

△?圖1：臺大李宏毅教授的 Slide

1. 最傳統(tǒng)的方法是 Value-Based，就是選擇有最優(yōu) Value 的 Action。最經(jīng)典方法有：Q-learning 【W(wǎng)atkins and Dayan 1992】、SARSA 【Sutton and Barto 2017】 。

2. 后來 Policy-Based 方法引起注意，最開始是 REINFORCE 算法【W(wǎng)illiams 1992】，后來策略梯度 Policy Gradient【Sutton 2000】出現(xiàn)。

3. 最時行的 Actor-Critic 【Barto et al 1983】把兩者做了結(jié)合。樓上 Sutton 老爺子的好學(xué)生、AlphaGo 的總設(shè)計師 David Silver 同志提出了 Deterministic Policy Gradient，表面上是 PG，實際講了一堆 AC，這個改進史稱 DPG【Silver 2014】。

△?圖2：Actor-Critic 的循環(huán)促進過程

問題三：不穩(wěn)定與不收斂問題

Instability and Divergence when combining off-policy，function approximation，bootstrapping?

早在 1997 年 Tsitsiklis 就證明了如果 Function Approximator 采用了神經(jīng)網(wǎng)絡(luò)這種非線性的黑箱，那么其收斂性和穩(wěn)定性是無法保證的。?

分水嶺論文 Deep Q-learning Network【Mnih et al 2013】中提到：雖然我們的結(jié)果看上去很好，但是沒有任何理論依據(jù)（原文很狡猾的反過來說一遍）。?

This suggests that, despite lacking any theoretical convergence guarantees, our method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in stable manner.

△?圖3：征服 Atari 游戲的 DQN

DQN 的改良主要依靠兩個 Trick：?

1. 經(jīng)驗回放【Lin 1993】

雖然做不到完美的獨立同分布，但還是要盡力減少數(shù)據(jù)之間的關(guān)聯(lián)性。

2. Target Network【Mnih 2015】

Estimated Network 和 Target Network 不能同時更新參數(shù)，應(yīng)該另設(shè) Target Network 以保證穩(wěn)定性。

Since the network Q being updated is also used in calculating the target value, the Q update is prone to divergence.（為什么我們要用 Target Network）?

下面幾篇論文都是 DQN 相關(guān)話題的：?

1. 經(jīng)驗回放升級版：Prioritized Experience Replay 【Schaul 2016】?

2. 更好探索策略【Osband 2016】?

3. DQN 加速【He 2017a】?

4. 通過平均減少方差與不穩(wěn)定性 Averaged-DQN 【Anschel 2017】?

下面跳出 DQN 的范疇：

Duel DQN【W(wǎng)ang 2016c】（ICML 2016 最佳論文）?

Tips：閱讀此文請掌握 DQN、Double DQN、Prioritized Experience Replay 這三個背景。?

異步算法 A3C 【Mnih 2016】
TRPO (Trust Region Policy Optimization)【Schulman 2015】
Distributed Proximal Policy Optimization 【Heess 2017】?

Policy gradient 與 Q-learning 的結(jié)合【O'Donoghue 2017、Nachum 2017、 Gu 2017、Schulman 2017】?
GTD 【Sutton 2009a、Sutton 2009b、Mahmood 2014】?
Emphatic-TD 【Sutton 2016】

問題四：End-to-End 下的訓(xùn)練感知與控制

train perception and control jointly end-to-end?

現(xiàn)有解法是 Guided Policy Search 【Levine et al 2016a】。

問題五：數(shù)據(jù)利用效率

data/sample efficiency?

現(xiàn)有解法有：?

Q-learning 與 Actor-Critic?
經(jīng)驗回放下的actor-critic 【W(wǎng)ang et al 2017b】?
PGQ，policy gradient and Q-learning 【O'Donoghue et al 2017】?
Q-Prop, policy gradient with off-policy critic 【Gu et al 2017】?
return-based off-policy control, Retrace 【Munos et al 2016】, Reactor 【Gruslyset al 2017】?
learning to learn, 【Duan et al 2017、Wang et al 2016a、Lake et al 2015】

問題六：無法取得激勵

reward function not available?

現(xiàn)有解法基本上圍繞模仿學(xué)習(xí)：

吳恩達的逆強化學(xué)習(xí)【Ng and Russell 2000】?
learn from demonstration 【Hester et al 2017】?
imitation learning with GANs 【Ho and Ermon 2016、Stadie et al 2017】（附TensorFlow 實現(xiàn) [1]）?
train dialogue policy jointly with reward model 【Su et al 2016b】

問題七：探索-利用問題

exploration-exploitation tradeoff?

現(xiàn)有解法有：?

unify count-based exploration and intrinsic motivation 【Bellemare et al 2017】?
under-appreciated reward exploration 【Nachum et al 2017】?
deep exploration via bootstrapped DQN 【Osband et al 2016)】?
variational information maximizing exploration 【Houthooft et al 2016】

問題八：基于模型的學(xué)習(xí)

model-based learning?

現(xiàn)有解法：?

Sutton 老爺子教科書里的經(jīng)典安利：Dyna-Q 【Sutton 1990】?
model-free 與 model-based 的結(jié)合使用【Chebotar et al 2017】

問題九：無模型規(guī)劃

model-free planning?

比較新的解法有兩個：?

1. Value Iteration Networks【Tamar et al 2016】是勇奪 NIPS2016 最佳論文頭銜的猛文。

知乎上有專門的文章解說：Value iteration Network [2]，還有作者的采訪：NIPS 2016 最佳論文作者：如何打造新型強化學(xué)習(xí)觀?[3]。VIN 的 TensorFlow 實現(xiàn) [4]。

△?圖4：Value Iteration Network 的框架

2.?DeepMind 的 Silver 大神發(fā)表的 Predictron 方法【Silver et al 2016b】，附 TensorFlow 實現(xiàn) [5]。

問題十：它山之石可以攻玉

focus on salient parts?

@賈揚清大神曾經(jīng)說過：?

伯克利人工智能方向的博士生，入學(xué)一年以后資格考試要考這幾個內(nèi)容：強化學(xué)習(xí)和 Robotics、統(tǒng)計和概率圖模型、計算機視覺和圖像處理、語音和自然語言處理、核方法及其理論、搜索，CSP，邏輯，Planning 等。

如果真的想做人工智能，建議都了解一下，不是說都要搞懂搞透，但是至少要達到開會的時候和人在 poster 前面談笑風(fēng)生不出錯的程度吧。?

因此，一個很好的思路是從計算機視覺與自然語言處理領(lǐng)域汲取靈感，例如下文中將會提到的 unsupervised auxiliary learning 方法借鑒了 RNN+LSTM 中的大量操作。?

下面是 CV 和 NLP 方面的幾個簡介：物體檢測【Mnih 2014】、機器翻譯【Bahdanau 2015】、圖像標(biāo)注【Xu 2015】、用 Attention 代替 CNN 和 RNN【Vaswani 2017】等等。

問題十一：長時間數(shù)據(jù)儲存

data storage over long time, separating from computation?

最出名的解法是在 Nature 上大秀一把的 Differentiable Neural Computer【Graves et al 2016】。

問題十二：無回報訓(xùn)練

benefit from non-reward training signals in environments?

現(xiàn)有解法圍繞著無監(jiān)督學(xué)習(xí)開展：

Horde 【Sutton et al 2011】?

極其優(yōu)秀的工作：

unsupervised reinforcement and auxiliary learning 【Jaderberg et al 2017】?

learn to navigate with unsupervised auxiliary learning 【Mirowski et al 2017】?

大名鼎鼎的 GANs 【Goodfellow et al 2014】

問題十三：跨領(lǐng)域?qū)W習(xí)

learn knowledge from different domains?

現(xiàn)有解法全部圍繞遷移學(xué)習(xí)走：【Taylor and Stone, 2009、Pan and Yang 2010、Weiss et al 2016】，learn invariant features to transfer skills 【Gupta et al 2017】。

問題十四：有標(biāo)簽數(shù)據(jù)與無標(biāo)簽數(shù)據(jù)混合學(xué)習(xí)

benefit from both labelled and unlabelled data?

現(xiàn)有解法全部圍繞半監(jiān)督學(xué)習(xí)：

【Zhu and Goldberg 2009】?
learn with MDPs both with and without reward functions 【Finn et al 2017)】?
learn with expert's trajectories and those may not from experts 【Audiffren et al 2015】

問題十五：多層抽象差分空間的表示與推斷

learn, plan, and represent knowledge with spatio-temporal abstraction at multiple levels?

現(xiàn)有解法：

多層強化學(xué)習(xí) 【Barto and Mahadevan 2003】?
strategic attentive writer to learn macro-actions 【Vezhnevets et al 2016】?
integrate temporal abstraction with intrinsic motivation 【Kulkarni et al 2016】?
stochastic neural networks for hierarchical RL 【Florensa et al 2017】?
lifelong learning with hierarchical RL 【Tessler et al 2017】

問題十六：不同任務(wù)環(huán)境快速適應(yīng)

adapt rapidly to new tasks?

現(xiàn)有解法基本上是 learn to learn learn：

a flexible RNN model to handle a family of RL tasks 【Duan et al 2017、Wang et al 2016a】?
one/few/zero-shot learning 【Duan et al 2017、Johnson et al 2016、 Kaiser et al 2017b、Koch et al 2015、Lake et al 2015、Li and Malik 2017、Ravi and Larochelle, 2017、Vinyals et al 2016】

問題十七：巨型搜索空間

gigantic search space?

現(xiàn)有解法依然是蒙特卡洛搜索，詳情可以參考初代 AlphaGo 的實現(xiàn)【Silver et al 2016a】。

問題十八：神經(jīng)網(wǎng)絡(luò)架構(gòu)設(shè)計

neural networks architecture design

現(xiàn)有的網(wǎng)絡(luò)架構(gòu)搜索方法【Baker et al 2017、Zoph and Le 2017】，其中 Zoph 的工作分量非常重。?

新的架構(gòu)有【Kaiser et al 2017a、Silver et al 2016b、Tamar et al 2016、Vaswani et al 2017、Wang et al 2016c】。

參考文獻

[1] Anschel, O., Baram, N., and Shimkin, N. (2017).?Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning.?In the International Conference on Machine Learning (ICML).

[2]?Audiffren, J., Valko, M., Lazaric, A., and Ghavamzadeh, M. (2015).?Maximum entropy semisupervised inverse reinforcement learning. In the International Joint Conference on Artificial Intelligence (IJCAI).

[3]?Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and Bengio, Y. (2017).?An actor-critic algorithm for sequence prediction. In the International Conference on Learning Representations (ICLR).

[4]?Baker, B., Gupta, O., Naik, N., and Raskar, R. (2017).?Designing neural network architectures using reinforcement learning. In the International Conference on Learning Representations (ICLR).

[5]?Barto, A. G. and Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379.

[6]?Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13:835–846

[7]?Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S.,Lakshminarayanan, B., Hoyer, S., and Munos, R. (2017). The Cramer Distance as a Solution to Biased Wasserstein Gradients. ArXiv e-prints.

[8]?Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., and Levine, S. (2017). Combining model-based and model-free updates for trajectory-centric reinforcement learning. In the?International Conference on Machine Learning (ICML)

[9]?Duan, Y., Andrychowicz, M., Stadie, B. C., Ho, J., Schneider, J.,Sutskever, I., Abbeel, P., and Zaremba, W. (2017). One-Shot Imitation Learning. ArXiv e-prints.

[10]?Finn, C., Christiano, P., Abbeel, P., and Levine, S. (2016a). A connection between GANs, inverse reinforcement learning, and energy-based models. In NIPS 2016 Workshop on Adversarial Training.

[11]?Florensa, C., Duan, Y., and Abbeel, P. (2017).?Stochastic neural networks for hierarchical reinforcement learning. In the International Conference on Learning Representations (ICLR)

[12]?Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., , and Bengio, Y. (2014). Generative adversarial nets. In the Annual Conference on Neural Information Processing Systems (NIPS), page 2672?2680.

[13]?Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwinska, A., Col- ′ menarejo, S. G., Grefenstette, E., Ramalho, T., Agapiou, J., nech Badia, A. P., Hermann, K. M., Zwols, Y., Ostrovski, G., Cain, A., King, H., Summerfield, C., Blunsom, P., Kavukcuoglu, K., and Hassabis, D. (2016).?Hybrid computing using a neural network with dynamic external memory. Nature, 538:471–476

[14]?Gruslys, A., Gheshlaghi Azar, M., Bellemare, M. G., and Munos, R. (2017).?The Reactor: A Sample-Efficient Actor-Critic Architecture. ArXiv e-prints

[15]?Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. (2017).?Q-Prop: Sampleefficient policy gradient with an off-policy critic. In the International Conference on Learning?Representations (ICLR).

[16]?Gupta, A., Devin, C., Liu, Y., Abbeel, P., and Levine, S. (2017).?Learning invariant feature spaces to transfer skills with reinforcement learning.?In the International Conference on Learning Representations (ICLR).

[17]?He, F. S., Liu, Y., Schwing, A. G., and Peng, J. (2017a).?Learning to play in a day: Faster deep reinforcement learning by optimality tightening. In the International Conference on Learning?Representations (ICLR)

[18]?Heess, N., TB, D., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, A., Riedmiller, M., and Silver, D. (2017).?Emergence of Locomotion Behaviours in Rich Environments. ArXiv e-prints

[19]?Hester, T. and Stone, P. (2017).?Intrinsically motivated model learning for developing curious robots. Artificial Intelligence, 247:170–86.

[20]?Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. In the Annual Conference?on Neural Information Processing Systems (NIPS).

[21]?Houthooft, R., Chen, X., Duan, Y., Schulman, J., Turck, F. D., and Abbeel, P. (2016). Vime: Variational information maximizing exploration. In the Annual Conference on Neural Information?Processing Systems (NIPS).

[22]?Jaderberg, M., Mnih, V., Czarnecki, W., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. (2017). Reinforcement learning with unsupervised auxiliary tasks. In the International Conference on Learning Representations (ICLR).

[23]?Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viegas, F., Watten- ′berg, M., Corrado, G., Hughes, M., and Dean, J. (2016). Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation.?ArXive-prints.

[24]?Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., and Uszkoreit, J. (2017a). One Model To Learn Them All.?ArXiv e-prints.

[25]?Kaiser, ?., Nachum, O., Roy, A., and Bengio, S. (2017b). Learning to Remember Rare Events. In?the International Conference on Learning Representations (ICLR).

[26]?Koch, G., Zemel, R., and Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition. In?the International Conference on Machine Learning (ICML).

[27]?Kulkarni, T. D., Narasimhan, K. R., Saeedi, A., and Tenenbaum, J. B. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In?the Annual Conference on Neural Information Processing Systems (NIPS)

[28]?Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction.?Science, 350(6266):1332–1338.

[29]?Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016a). End-to-end training of deep visuomotor policies.?The Journal of Machine Learning Research, 17:1–40.

[30]?Li, K. and Malik, J. (2017). Learning to optimize. In?the International Conference on Learning Representations (ICLR).

[31]?Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., & Tassa, Y., et al. (2015). Continuous control with deep reinforcement learning.?Computer Science,?8(6), A187.

[32]?Lin, L. J. (1993). Reinforcement learning for robots using neural networks.

[33]?Mahmood, A. R., van Hasselt, H., and Sutton, R. S. (2014). Weighted importance sampling for off-policy learning with linear function approximation. In?the Annual Conference on Neural Information Processing Systems (NIPS).

[34]?Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., Kumaran, D., and Hadsell, R. (2017).?Learning to navigate in complex environments. In?the International Conference on Learning Representations (ICLR).

[35]?Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wier- stra, Daan, and Riedmiller, Martin.?Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

[36]?Mnih, V., Heess, N., Graves, A., and Kavukcuoglu, K. (2014).?Recurrent models of visual attention. In?the Annual Conference on Neural Information Processing Systems (NIPS).

[37]?Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,?Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control through deep reinforcement learning.?Nature, 518(7540):529–533.

[38]?Mnih, V., Badia, A. P., Mirza, M., Graves, A., Harley, T., Lillicrap, T. P., Silver, D., and Kavukcuoglu, K. (2016).?Asynchronous methods for deep reinforcement learning. In?the International Conference on Machine Learning (ICML)

[39]?Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G.(2016).?Safe and efficient offpolicy reinforcement learning. In?the Annual Conference on Neural Information Processing Systems (NIPS).

[40]?Nachum, O., Norouzi, M., and Schuurmans, D. (2017).?Improving policy gradient by exploring under-appreciated rewards. In?the International Conference on Learning Representations (ICLR).

[41]?Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017). Bridging the Gap Between Value and Policy Based Reinforcement Learning.?ArXive-prints.

[42]?Ng, A. and Russell, S. (2000).Algorithms for inverse reinforcement learning. In?the International Conference on Machine Learning (ICML).

[43]?O'Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2017).?PGQ: Combining policy gradient and q-learning. In?the International Conference on Learning Representations (ICLR).

[44]?Osband, I., Blundell, C., Pritzel, A., and Roy, B. V. (2016).?Deep exploration via bootstrapped DQN. In?the Annual Conference on Neural Information Processing Systems (NIPS).

[45]?Pan, S. J. and Yang, Q. (2010). A survey on transfer learning.?IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 – 1359.

[46]?Ravi, S. and Larochelle, H. (2017).?Optimization as a model for few-shot learning. In?the International Conference on Learning Representations (ICLR).

[47]?Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). Prioritized experience replay. In?the International Conference on Learning Representations (ICLR).

[48]?Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015).?Trust region policy optimization. In?the International Conference on Machine Learning (ICML).

[49]?Schulman, J., Abbeel, P., and Chen, X. (2017). Equivalence Between Policy Gradients and Soft Q-Learning.?ArXiv e-prints.

[50]?Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014).?Deterministic policy gradient algorithms.?International Conference on International Conference on Machine Learning?(pp.387-395). JMLR.org.

[51]?Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016a). Mastering the game of go with deep neural networks and tree search.?Nature, 529(7587):484–489.

[52]?Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., and Degris, T. (2016b).?The predictron: End-to-end learning and planning. In?NIPS 2016 Deep Reinforcement Learning Workshop.

[53]?Stadie, B. C., Abbeel, P., and Sutskever, I. (2017).Third person imitation learning. In?the International Conference on Learning Representations (ICLR).

[54]?Sutton, R. S. and Barto, A. G. (2017).?Reinforcement Learning: An Introduction (2nd Edition, in preparation). MIT Press.

[55]?Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In?the Annual Conference on Neural Information Processing Systems
(NIPS).

[56]?Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C., and Wiewiora, ′E. (2009a).?Fast gradient-descent methods for temporal-difference learning with linear function approximation. In?the International Conference on Machine Learning (ICML).

[57]?Sutton, R. S., Szepesvari, C., and Maei, H. R. (2009b). A convergent O( ′?n) algorithm for off-policy temporal-difference learning with linear function approximation. In?the Annual Conference on Neural Information Processing Systems (NIPS).

[58]?Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. (2011).?Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction, , proc. of 10th. In?International Conference on Autonomous Agents and Multiagent Systems (AAMAS).

[59]?Sutton, R. S., Mahmood, A. R., and White, M. (2016).?An emphatic approach to the problem of off-policy temporal-difference learning.?The Journal of Machine Learning Research, 17:1–29

[60]?Sutton, R. S. (1988). Learning to predict by the methods of temporal differences.?Machine Learning,3(1):9–44.

Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In?the International Conference on Machine Learning (ICML).

[61]?Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. (2016). Value iteration networks. In?the Annual Conference on Neural Information Processing Systems (NIPS).

[62]?Taylor, M. E. and Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey.?Journal of Machine Learning Research, 10:1633–1685.

[63]?Tessler, C., Givony, S., Zahavy, T., Mankowitz, D. J., and Mannor, S. (2017).?A deep hierarchical approach to lifelong learning in minecraft. In?the AAAI Conference on Artificial Intelligence (AAAI).

[64]?van Hasselt, H. (2010).?Double Q-learning.?Advances in Neural Information Processing Systems 23:, Conference on Neural Information Processing Systems 2010.

[65]?van Hasselt, H., Guez, A., , and Silver, D. (2016a). Deep reinforcement learning with double Qlearning. In?the AAAI Conference on Artificial Intelligence (AAAI).

[66]?Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need.?ArXiv e-prints.

[67]?Vezhnevets, A. S., Mnih, V., Agapiou, J., Osindero, S., Graves, A., Vinyals, O., and Kavukcuoglu, K. (2016).?Strategic attentive writer for learning macro-actions. In?the Annual Conference on Neural Information Processing Systems (NIPS).

[68]?Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., and Wierstra, D. (2016).?Matching networks for one shot learning. In?the Annual Conference on Neural Information Processing Systems (NIPS).

[69]?Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. (2016a).?Learning to reinforcement learn.?arXiv:1611.05763v1.

[70]?Wang, S. I., Liang, P., and Manning, C. D. (2016b).?Learning language games through interaction. In?the Association for Computational Linguistics annual meeting (ACL)

[71]?Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. (2016c). Dueling network architectures for deep reinforcement learning. In?the International Conference on Machine Learning (ICML).

[72]?Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning.?Machine Learning, 8:279–292

[73]?Weiss, K., Khoshgoftaar, T. M., and Wang, D. (2016). A survey of transfer learning.?Journal of Big Data, 3(9)

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning.?Machine Learning, 8(3):229–256.

[74]?Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A.,Salakhutdinov, R., Zemel, R. S., and Bengio,Y. (2015).?Show, attend and tell: Neural image caption generation with visual attention. In?the International Conference on Machine Learning (ICML).

[75]?Zhu, X. and Goldberg, A. B. (2009). Introduction to semi-supervised learning. Morgan & Claypool

Zoph, B. and Le, Q. V. (2017).?Neural architecture search with reinforcement learning. In the International Conference on Learning Representations (ICLR)

本文由 AI 學(xué)術(shù)社區(qū) PaperWeekly 精選推薦，社區(qū)目前已覆蓋自然語言處理、計算機視覺、人工智能、機器學(xué)習(xí)、數(shù)據(jù)挖掘和信息檢索等研究方向，點擊「閱讀原文」即刻加入社區(qū)！

??我是彩蛋?

解鎖新功能：熱門職位推薦！

PaperWeekly小程序升級啦

今日arXiv√猜你喜歡√熱門職位√

找全職找實習(xí)都不是問題

?解鎖方式?

1. 識別下方二維碼打開小程序

2. 用PaperWeekly社區(qū)賬號進行登陸

3. 登陸后即可解鎖所有功能

?職位發(fā)布?

請?zhí)砑有≈治⑿?#xff08;pwbot01）進行咨詢

長按識別二維碼，使用小程序

*點擊閱讀原文即可注冊

? ? ? ? ???

關(guān)于PaperWeekly

PaperWeekly 是一個推薦、解讀、討論、報道人工智能前沿論文成果的學(xué)術(shù)平臺。如果你研究或從事 AI 領(lǐng)域，歡迎在公眾號后臺點擊「交流群」，小助手將把你帶入 PaperWeekly 的交流群里。

總結(jié)

以上是生活随笔為你收集整理的深度强化学习的 18 个关键问题 | PaperDaily #30的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：每周「Paper + Code」清单：句
下一篇：直播预告：基于动态词表的对话生成研究 |

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

生活随笔

生活随笔

编程问答

深度强化学习的 18 个关键问题 | PaperDaily #30

關(guān)于作者：王凌霄（社區(qū)ID @Nevertiree），中國科學(xué)院自動化研究所實習(xí)生，研究方向為強化學(xué)習(xí)和多智能體。

問題一：預(yù)測與策略評估

問題二：控制與最佳策略選擇

問題三：不穩(wěn)定與不收斂問題

問題四：End-to-End 下的訓(xùn)練感知與控制

問題五：數(shù)據(jù)利用效率

問題六：無法取得激勵

問題七：探索-利用問題

問題八：基于模型的學(xué)習(xí)

問題九：無模型規(guī)劃

問題十：它山之石可以攻玉

問題十一：長時間數(shù)據(jù)儲存

問題十二：無回報訓(xùn)練

問題十三：跨領(lǐng)域?qū)W習(xí)

問題十四：有標(biāo)簽數(shù)據(jù)與無標(biāo)簽數(shù)據(jù)混合學(xué)習(xí)

問題十五：多層抽象差分空間的表示與推斷

問題十六：不同任務(wù)環(huán)境快速適應(yīng)

問題十七：巨型搜索空間

問題十八：神經(jīng)網(wǎng)絡(luò)架構(gòu)設(shè)計

相關(guān)鏈接

參考文獻

總結(jié)

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

深度强化学习的 18 个关键问题 | PaperDaily #30

關(guān)于作者：王凌霄（社區(qū)ID @Nevertiree），中國科學(xué)院自動化研究所實習(xí)生，研究方向為強化學(xué)習(xí)和多智能體。

問題一：預(yù)測與策略評估

問題二：控制與最佳策略選擇

問題三：不穩(wěn)定與不收斂問題

問題四：End-to-End 下的訓(xùn)練感知與控制

問題五：數(shù)據(jù)利用效率

問題六：無法取得激勵

問題七：探索-利用問題

問題八：基于模型的學(xué)習(xí)

問題九：無模型規(guī)劃

問題十：它山之石可以攻玉

問題十一：長時間數(shù)據(jù)儲存

問題十二：無回報訓(xùn)練

問題十三：跨領(lǐng)域?qū)W習(xí)

問題十四：有標(biāo)簽數(shù)據(jù)與無標(biāo)簽數(shù)據(jù)混合學(xué)習(xí)

問題十五：多層抽象差分空間的表示與推斷

問題十六：不同任務(wù)環(huán)境快速適應(yīng)

問題十七：巨型搜索空間

問題十八：神經(jīng)網(wǎng)絡(luò)架構(gòu)設(shè)計

相關(guān)鏈接

參考文獻

總結(jié)

關(guān)于作者：王凌霄（社區(qū)ID @Nevertiree），中國科學(xué)院自動化研究所實習(xí)生，研究方向為強化學(xué)習(xí)和多智能體。