當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

强化学习之三：双臂赌博机（Two-armed Bandit）

發(fā)布時間：2025/5/22 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了强化学习之三：双臂赌博机（Two-armed Bandit）小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

本文是對Arthur Juliani在Medium平臺發(fā)布的強化學習系列教程的個人中文翻譯，該翻譯是基于個人分享知識的目的進行的，歡迎交流！（This article is my personal translation for the tutorial written and posted by Arthur Juliani on Medium.com. And my work is completely based on aim of sharing knowledges and welcome communicating!）

原文地址（URL for original article）

引入Introduction
雙臂賭博機Two-Armed Bandit
策略梯度Policy Gradient

引入（Introduction）

強化學習不只給我們提供了去教會一個智能agent如何行動的能力，也使得agent可以通過自己與環(huán)境的交互去進行學習。通過結(jié)合深度神經(jīng)網(wǎng)絡(luò)針對一個基于目標驅(qū)動的agent學習可以獲得的復(fù)雜表征（representations），計算機已經(jīng)實現(xiàn)了一些非常驚人的成果，比如在一系列atari游戲中擊敗人類玩家，并且在圍棋上打敗世界冠軍。

然而要學會如何建立這么強大的agent需要已經(jīng)習慣于有監(jiān)督學習（Supervised Learning）的人們轉(zhuǎn)變一下思想，我們現(xiàn)在的做法再也不是簡單地讓算法去學會對某種刺激和某種響應(yīng)進行一一匹配了。相反地，強化學習算法必須讓agent自己通過使用觀察、回報和行動的方式來學會匹配。因為對于agent來說，再也不會有某種給定狀態(tài)下應(yīng)該采取的絕對“正確”的行動，所以這就使得這件事情看起來有點困難了。在本博客中，我將帶你完整地走一遍強化學習agents的創(chuàng)造和訓練過程。最開始的agent和任務(wù)（task）的示例都將比較簡單，所以相關(guān)的概念也都會比較明晰，之后我們再嘗試理解更復(fù)雜的任務(wù)和環(huán)境。

雙臂賭博機（Two-Armed Bandit）

最簡單的強化學習問題就是N臂賭博機。本質(zhì)上來說，N臂賭博機就是由n個槽機器（n-many slot machine），每個槽對應(yīng)了一個不同的固定回報概率。我們的目標是去發(fā)現(xiàn)有最優(yōu)回報的機器，并且通過一直選取這個機器以獲得最大化回報。我們先簡化一下這個問題，即只有兩個槽機器供我們選擇。實際上，這個問題如此簡單，它更像是一個強化學習的引導例子而不能稱作一個強化學習問題本身。因為一個典型的強化學習任務(wù)包含以下方面：

不同的行動產(chǎn)生不同的回報。舉例來說，當在迷宮中找寶藏時，往左走可能找到寶藏，而往右走可能遇到一群蛇。
回報總是在時間上延遲的。這就意味著即使在上面的迷宮例子里，往左走是正確的選擇，但是我們不會知道這一點直到我們做出選擇并到達新的狀態(tài)之后。
一個行動的回報是基于環(huán)境的狀態(tài)的。仍然是迷宮的例子，在某個分叉往左走可能是理想的，但其他的分叉可能不是這樣。

n臂賭博機是一個非常好的入門問題，因為我們不用考慮上述的第二、三方面。我們只需要集中精力去學習對應(yīng)的每種行動對應(yīng)的回報，并保證我們總是選擇最優(yōu)的那些行動。在強化學習術(shù)語中，這叫做學習一個策略（Learn a policy）。我們將使用一種稱為策略梯度（policy gradient）的方法，即我們將用一個簡單的神經(jīng)網(wǎng)絡(luò)來學習如何選擇行動，它將基于環(huán)境的反饋通過梯度下降來調(diào)整它的參數(shù)。還有另一種解決強化學習問題的方法，這些方法里，agent會學習價值函數(shù)（value function）。在這種方法里，相比于學習給定狀態(tài)下的最優(yōu)行動，agent會學習預(yù)測一個agent將處于的給定狀態(tài)或者采取的行動多么好。而這兩種方法都可以讓agent表現(xiàn)優(yōu)異，不過策略梯度方法顯得更加直接一點。

策略梯度（Policy Gradient）

最簡單的理解策略梯度網(wǎng)絡(luò)的方法就是：它其實就是一個會生成明確輸出的神經(jīng)網(wǎng)絡(luò)。在賭博機的例子里，我們不需要基于任何狀態(tài)來說明這些輸出。因此，我們的網(wǎng)絡(luò)將由一系列的權(quán)重構(gòu)成，每個權(quán)重都和每一個可以拉動的賭博機臂相關(guān)，并且會展現(xiàn)出我們的agent認為拉動每個臂分別會對應(yīng)多么好的結(jié)果。如果我們初始化權(quán)重為1，那么我們的agent將會對每個臂的潛在回報都非常樂觀。

為了更新我們的網(wǎng)絡(luò)，我們將簡單地基于e-貪婪策略（e-greedy policy）嘗試每個臂（在Part 7可以看到更多關(guān)于行動選擇策略的內(nèi)容）。這意味著大多數(shù)時間里，我們的agent將會選擇有著預(yù)期最大回報值的行動，但偶爾，它也會隨機行動。通過這種方式，agent可能嘗試到每一個不同的臂并持續(xù)地學習到更多知識。一旦我們的agent采取一個行動，它將會收獲到一個值為1或-1的回報。基于這個回報，我們就可以使用策略損失函數(shù)來對我們的網(wǎng)絡(luò)進行更新：

Loss=?log(π)?A

A是優(yōu)越度，也是所有強化學習算法的一個重要部分。直覺上，它描述了一個行動比某個基準線好多少。在未來的算法中，我們將遇到更復(fù)雜的用于比較回報的基準線，而現(xiàn)在我們就假設(shè)基準線為0，我們也可以簡單地把它想成我們采取每個行動對應(yīng)的回報。

π是策略。在這個例子中，它和所選行動的權(quán)重相關(guān)。

直覺上，這個損失函數(shù)使我們可以增加那些有望產(chǎn)出正回報行動的權(quán)重，而降低那些可能產(chǎn)生負回報的行動的權(quán)重。通過這種方式，agent將更有可能或更不可能在未來采取某個行動。通過采取行動，獲得回報并更新網(wǎng)絡(luò)這個過程的循環(huán)，我們將很快得到一個收斂的agent，它將可以解決賭博機問題。不要只是聽我講，你應(yīng)該自己試一試。

# Simple Reinforcement Learning in Tensorflow Part 1: # The Multi-armed bandit # This tutorial contains a simple example of how to build a policy-gradient based agent that can solve the multi-armed bandit problem. For more information, see this Medium post. # 簡單強化學習的Tensorflow實現(xiàn) Part 1： # 多臂賭博機 # 這個教程包含一個簡單的，能夠解決多臂賭博機問題的建立基于策略梯度的agent的實例# For more Reinforcement Learning algorithms, including DQN and Model-based learning in Tensorflow, see my Github repo, DeepRL-Agents. # 對于更多強化學習算法，包括用Tensorflow實現(xiàn)的DQN和基于模型的學習，都可以看我的Github庫，DeepRL-Agents。import tensorflow as tf import numpy as np# The Bandits # Here we define our bandits. For this example we are using a four-armed bandit. The pullBandit function generates a random number from a normal distribution with a mean of 0. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit that will give that positive reward. # 賭博機 # 這里我們定義了賭博機。這個例子里我們使用了一個四臂賭博機。pullBandit函數(shù)產(chǎn)生了一個服從0均值正態(tài)分布的隨機數(shù)。這個賭博機數(shù)值越小，獲得一個正回報的可能性越大。我們想讓我們的agent學會總是選擇正回報的行動。# List out our bandits. Currently bandit 4 (index#3) is set to most often provide a positive reward. # 賭博機的列表。當前賭博機4（標號#3）被設(shè)置為最常給出正回報的機器。 bandits = [0.2,0,-0.2,-5] num_bandits = len(bandits) def pullBandit(bandit):# Get a random number.# 獲得一個隨機數(shù)result = np.random.randn(1)if result > bandit:# return a positive reward.# 返回一個正回報return 1else:# return a negative reward.# 返回一個負回報return -1# The Agent # The code below established our simple neural agent. It consists of a set of values for each of the bandits. Each value is an estimate of the value of the return from choosing the bandit. We use a policy gradient method to update the agent by moving the value for the selected action toward the recieved reward. # 下面的代碼建立了我們的樣例神經(jīng)網(wǎng)絡(luò)版本的agent，它由一套針對每個賭博機的數(shù)值構(gòu)成。每個數(shù)值都是對于選擇相應(yīng)賭博機的回報的估計值。我們使用策略梯度方法來更新我們的agent，即將選擇的行動的數(shù)值賦給收到的匯報。tf.reset_default_graph()# These two lines established the feed-forward part of the network. This does the actual choosing. # 下面兩行簡歷了網(wǎng)絡(luò)的前饋部分。這個部分用來做行動決策。 weights = tf.Variable(tf.ones([num_bandits])) chosen_action = tf.argmax(weights,0)# The next six lines establish the training proceedure. We feed the reward and chosen action into the network # to compute the loss, and use it to update the network. # 下面六行代碼建立了訓練過程。我們喂給網(wǎng)絡(luò)回報以及所選行動。 # 計算損失，并用其更新網(wǎng)絡(luò)。 reward_holder = tf.placeholder(shape=[1],dtype=tf.float32) action_holder = tf.placeholder(shape=[1],dtype=tf.int32) responsible_weight = tf.slice(weights,action_holder,[1]) loss = -(tf.log(responsible_weight)*reward_holder) optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001) update = optimizer.minimize(loss)# Training the Agent # We will train our agent by taking actions in our environment, and recieving rewards. Using the rewards and actions, we can know how to properly update our network in order to more often choose actions that will yield the highest rewards over time. # 訓練Agent # 我們將通過在環(huán)境中采取行動并接收回報來訓練agent。通過回報和行動，我們可以知道如何合適地更新網(wǎng)絡(luò)，以使得它將隨著訓練的進行，越來越經(jīng)常的選擇有更高回報的行動。total_episodes = 1000 #Set total number of episodes to train agent on. agent將要訓練的episodes輪數(shù) total_reward = np.zeros(num_bandits) #Set scoreboard for bandits to 0. 將賭博機的得分全部設(shè)為0 e = 0.1 #Set the chance of taking a random action. 設(shè)置采取一個隨機行動的概率init = tf.initialize_all_variables()# Launch the tensorflow graph # 啟動tensorflow計算圖 with tf.Session() as sess:sess.run(init)i = 0while i < total_episodes:# Choose either a random action or one from our network.# 選擇一個隨機行動，或者讓網(wǎng)絡(luò)來決策if np.random.rand(1) < e:action = np.random.randint(num_bandits)else:action = sess.run(chosen_action)reward = pullBandit(bandits[action]) #Get our reward from picking one of the bandits. 從選擇的賭博機上獲得回報# Update the network.# 更新網(wǎng)絡(luò)_,resp,ww = sess.run([update,responsible_weight,weights], feed_dict={reward_holder:[reward],action_holder:[action]})# Update our running tally of scores.# 更新運行記分器total_reward[action] += rewardif i % 50 == 0:print "Running reward for the " + str(num_bandits) + " bandits: " + str(total_reward)i+=1 print "The agent thinks bandit " + str(np.argmax(ww)+1) + " is the most promising...." if np.argmax(ww) == np.argmax(-np.array(bandits)):print "...and it was right!" else:print "...and it was wrong!" Running reward for the 4 bandits: [ 1. 0. 0. 0.] Running reward for the 4 bandits: [ 0. -2. -1. 38.] Running reward for the 4 bandits: [ 0. -4. -2. 83.] Running reward for the 4 bandits: [ 0. -6. -1. 128.] Running reward for the 4 bandits: [ 0. -8. 1. 172.] Running reward for the 4 bandits: [ -1. -9. 2. 219.] Running reward for the 4 bandits: [ -1. -10. 4. 264.] Running reward for the 4 bandits: [ 0. -11. 4. 312.] Running reward for the 4 bandits: [ 2. -10. 4. 357.] Running reward for the 4 bandits: [ 2. -9. 4. 406.] Running reward for the 4 bandits: [ 0. -11. 4. 448.] Running reward for the 4 bandits: [ -1. -10. 3. 495.] Running reward for the 4 bandits: [ -3. -10. 2. 540.] Running reward for the 4 bandits: [ -3. -10. 3. 585.] Running reward for the 4 bandits: [ -3. -8. 3. 629.] Running reward for the 4 bandits: [ -2. -7. 1. 673.] Running reward for the 4 bandits: [ -4. -7. 2. 720.] Running reward for the 4 bandits: [ -4. -7. 3. 769.] Running reward for the 4 bandits: [ -6. -8. 3. 814.] Running reward for the 4 bandits: [ -7. -7. 3. 858.] The agent thinks bandit 4 is the most promising.... ...and it was right!

Github完整代碼

（09/10/2016更新）：我重新為這個教程寫了iPython代碼。之前的損失函數(shù)不太直觀，我已經(jīng)用一個更標準和具備解釋性的版本來替代了，而且對于那些非常有興趣應(yīng)用策略梯度方法到更復(fù)雜的問題上的人也更有參考價值。）

如果這篇博文對你有幫助，你可以考慮捐贈以支持未來更多的相關(guān)的教程、文章和實現(xiàn)。對任意的幫助與貢獻都表示非常感激！

如果你想跟進我在深度學習、人工智能、感知科學方面的工作，可以在Medium上follow我 @Arthur Juliani，或者推特@awjliani。

用Tensorflow實現(xiàn)簡單強化學習的系列教程：

Part 0?—?Q-Learning Agents

Part 1?—?Two-Armed Bandit

Part 1.5?—?Contextual Bandits

Part 2?—?Policy-Based Agents

Part 3?—?Model-Based RL

Part 4?—?Deep Q-Networks and Beyond

Part 5?—?Visualizing an Agent’s Thoughts and Actions

Part 6?—?Partial Observability and Deep Recurrent Q-Networks

Part 7?—?Action-Selection Strategies for Exploration

Part 8?—?Asynchronous Actor-Critic Agents (A3C)

轉(zhuǎn)載于:https://www.cnblogs.com/bluemapleman/p/9276663.html

總結(jié)

以上是生活随笔為你收集整理的强化学习之三：双臂赌博机（Two-armed Bandit）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。