當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【动手学强化学习】DDPG+HER

發布時間：2024/3/13 编程问答 42 豆豆

生活随笔收集整理的這篇文章主要介紹了【动手学强化学习】DDPG+HER 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

代碼參考自動手學強化學習（jupyter notebook版本）：https://github.com/boyu-ai/Hands-on-RL

使用pycharm打開的請查看：https://github.com/zxs-000202/dsx-rl

理論部分

實踐部分

1.定義一個簡單地二維平面上的環境，在一個二維網格世界上，每個維度的位置范圍時[0,5]，在每一個序列的初始，智能體都位于（0,0）的位置，環境將自動從3.5<=x,y<=4.5的矩形區域內生成一個目標。每個時刻智能體可以選擇縱向和橫向分別移動[-1,1]作為這一時刻的動作。當智能體距離目標足夠近時，它將得到值為0的獎勵并結束任務，否則獎勵為-1。每一條軌跡的最大長度為50。

class WorldEnv:def __init__(self):self.distance_threshold = 0.15self.action_bound = 1def reset(self): # 重置環境# 生成一個目標狀態, 坐標范圍是[3.5～4.5, 3.5～4.5]self.goal = np.array([4 + random.uniform(-0.5, 0.5), 4 + random.uniform(-0.5, 0.5)])self.state = np.array([0, 0]) # 初始狀態self.count = 0return np.hstack((self.state, self.goal)) # 水平方向上拼接def step(self, action):action = np.clip(action, -self.action_bound, self.action_bound)x = max(0, min(5, self.state[0] + action[0]))y = max(0, min(5, self.state[1] + action[1]))self.state = np.array([x, y])self.count += 1dis = np.sqrt(np.sum(np.square(self.state - self.goal)))reward = -1.0 if dis > self.distance_threshold else 0if dis <= self.distance_threshold or self.count == 50:done = Trueelse:done = Falsereturn np.hstack((self.state, self.goal)), reward, done

2.首先通過以下代碼產生一個episode的軌跡并保存在Trajectory類實例化的traj對象中，其中存儲了這條軌跡的狀態序列、動作序列、獎勵序列、是否完成序列、軌跡長度值。

class Trajectory:''' 用來記錄一條完整軌跡 '''def __init__(self, init_state):self.states = [init_state]self.actions = []self.rewards = []self.dones = []self.length = 0def store_step(self, action, state, reward, done):self.actions.append(action)self.states.append(state)self.rewards.append(reward)self.dones.append(done)self.length += 1 episode_return = 0 state = env.reset() # state為初始位置與目標位置的拼接 traj = Trajectory(state) done = False while not done:action = agent.take_action(state)state, reward, done = env.step(action)episode_return += rewardtraj.store_step(action, state, reward, done)

生成一個軌跡之后的traj中存儲的數據如下圖所示：

3.之后將生成的軌跡存儲到經驗回放池中

replay_buffer.add_trajectory(traj) return_list.append(episode_return)

和DDPG、DQN不同的是這里是存儲的一條一條的trajectory，而那兩個存儲的是一個一個的transition。另外，這里的sample方法和前者有較大的差別。

class ReplayBuffer_Trajectory:''' 存儲軌跡的經驗回放池 '''def __init__(self, capacity):self.buffer = collections.deque(maxlen=capacity)def add_trajectory(self, trajectory):self.buffer.append(trajectory)def size(self):return len(self.buffer)def sample(self, batch_size, use_her, dis_threshold=0.15, her_ratio=0.8):batch = dict(states=[],actions=[],next_states=[],rewards=[],dones=[])for _ in range(batch_size):traj = random.sample(self.buffer, 1)[0]step_state = np.random.randint(traj.length)state = traj.states[step_state]next_state = traj.states[step_state + 1]action = traj.actions[step_state]reward = traj.rewards[step_state]done = traj.dones[step_state]if use_her and np.random.uniform() <= her_ratio:step_goal = np.random.randint(step_state + 1, traj.length + 1)goal = traj.states[step_goal][:2] # 使用HER算法的future方案設置目標dis = np.sqrt(np.sum(np.square(next_state[:2] - goal)))reward = -1.0 if dis > dis_threshold else 0done = False if dis > dis_threshold else Truestate = np.hstack((state[:2], goal))next_state = np.hstack((next_state[:2], goal))batch['states'].append(state)batch['next_states'].append(next_state)batch['actions'].append(action)batch['rewards'].append(reward)batch['dones'].append(done)batch['states'] = np.array(batch['states'])batch['next_states'] = np.array(batch['next_states'])batch['actions'] = np.array(batch['actions'])return batch

4.當經驗回放池中的軌跡數大于等于200開始進行更新
首先，通過從replay_buffer中sample數據，具體而言就是：sample256條數據到batch中，其中每條是通過從replay_buffer中抽樣一個trajectory，然后從trajectory中抽樣一個transition；如果應用HER算法，那么在一定幾率下，對transition中得數據做一些處理再添加到batch中。處理的操作就是：從該trajectory隨機抽樣一個該transition之后的transition，我們叫它step_goal吧，然后將這個transition的狀態的前兩個數（即該transition的初始位置）和原transition的狀態的后兩個數進行替換，即新的要存到batch中的transition的狀態的前兩個是原來的，然后目標是后面transition的初始位置。

就比如紅色是原來的transition對應位置，黃色是終極目標，通過隨機抽樣整個軌跡中紅色之后的一個transition，這里假設就是綠色的點的位置。我們原來的狀態是紅色，想要到達紅色，很明顯比較困難。此時，我們可以先假設我們的目標是綠色，通過這種方式，更換transition中state的信息。最終，可以更容易的訓練實現物體到達目標位置。

for _ in range(n_train): # n_train=20 batch_size = 256transition_dict = replay_buffer.sample(batch_size, True)agent.update(transition_dict) class ReplayBuffer_Trajectory:''' 存儲軌跡的經驗回放池 '''def __init__(self, capacity):self.buffer = collections.deque(maxlen=capacity)def add_trajectory(self, trajectory):self.buffer.append(trajectory)def size(self):return len(self.buffer)def sample(self, batch_size, use_her, dis_threshold=0.15, her_ratio=0.8):batch = dict(states=[],actions=[],next_states=[],rewards=[],dones=[])for _ in range(batch_size): # batch_size=256traj = random.sample(self.buffer, 1)[0] # 從buffer中隨機抽樣一個軌跡step_state = np.random.randint(traj.length) # 從抽樣的軌跡中隨機抽樣出一個transitionstate = traj.states[step_state] # 抽樣出的transition的statenext_state = traj.states[step_state + 1]# 抽樣出的transition的next_stateaction = traj.actions[step_state] # 抽樣出的transition的動作reward = traj.rewards[step_state] # 抽樣出的transition的獎勵done = traj.dones[step_state] # 抽樣出的transition是否doneif use_her and np.random.uniform() <= her_ratio:step_goal = np.random.randint(step_state + 1, traj.length + 1)# 從上面transition之后的軌跡選擇一個設置之后的goalgoal = traj.states[step_goal][:2] # 使用HER算法的future方案設置目標，選擇此transition的當前位置作為goaldis = np.sqrt(np.sum(np.square(next_state[:2] - goal)))reward = -1.0 if dis > dis_threshold else 0done = False if dis > dis_threshold else Truestate = np.hstack((state[:2], goal)) # 將原來的初始位置和后來挑選的goal拼接next_state = np.hstack((next_state[:2], goal)) # 將原來的下一個transition的初始位置和goal拼接batch['states'].append(state)batch['next_states'].append(next_state)batch['actions'].append(action)batch['rewards'].append(reward)batch['dones'].append(done)batch['states'] = np.array(batch['states']) # 256*4batch['next_states'] = np.array(batch['next_states'])# 256*4batch['actions'] = np.array(batch['actions'])# 256*2return batch

5.實驗過程和結果

總結

以上是生活随笔為你收集整理的【动手学强化学习】DDPG+HER的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

DDPG

上一篇：我对大学的憧憬||每个人都有自己的罗马
下一篇： Application.streamin

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

【动手学强化学习】DDPG+HER

理論部分

實踐部分

總結