當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

游戏开发之强化学习

發(fā)布時(shí)間：2025/3/15 编程问答 13 豆豆

生活随笔收集整理的這篇文章主要介紹了游戏开发之强化学习小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

游戲開發(fā)之強(qiáng)化學(xué)習(xí)

基于價(jià)值
- Q-Learning（離線學(xué)習(xí)）
- - 簡(jiǎn)述
  - 實(shí)現(xiàn)
- Saras（在線學(xué)習(xí)）
- - 簡(jiǎn)述
  - 實(shí)現(xiàn)
- SarasLambda（在線學(xué)習(xí)）
- - 簡(jiǎn)介
  - 實(shí)現(xiàn)

基于價(jià)值

Q-Learning（離線學(xué)習(xí)）

移步：莫煩Python Q-Learning原理

簡(jiǎn)述

在某一個(gè)環(huán)境下，玩家Player想要知道自己當(dāng)前狀態(tài)State的行為Action正確與否需要由環(huán)境Env來(lái)反饋。Player所做的決策都將得到Env給的反饋，從而不斷去更新Player在環(huán)境中每個(gè)State的Action權(quán)重；最終達(dá)到學(xué)習(xí)的目的。

真實(shí)獎(jiǎng)勵(lì)算法：q_target = r + self.gamma * self.q_table.loc[s_, :].max()取比重最大的action。

從self.q_table.loc[s, a] += self.lr * (q_target - q_predict)公式角度來(lái)看待，也算是與AI中的BP算法有著異曲同工之妙。但是實(shí)際上與真正的人工智能還是有區(qū)別的。在這個(gè)算法中，可以找出明顯的劣勢(shì)：QTable所需要的空間可能會(huì)大到爆炸（State可能過(guò)多）

由于State不能過(guò)大，因此會(huì)導(dǎo)致QLearning過(guò)于依賴環(huán)境的影響，如果換了一種環(huán)境，就不能適應(yīng)了。

實(shí)現(xiàn)

import numpy as np import pandas as pd"""算法鏈接: https://mofanpy.com/tutorials/machine-learning/reinforcement-learning/tabular-q1/算法思想: 貪婪1、 Q表中記錄每個(gè)state所對(duì)應(yīng)的所有action的權(quán)重, 選出權(quán)重最大的那個(gè)action2、 Q表的更新是通過(guò)計(jì)算 error = (實(shí)際獎(jiǎng)勵(lì) - 預(yù)估獎(jiǎng)勵(lì)), 得到error后進(jìn)行update的 """ class QLearningTable:'''初始化'''def __init__(self, actions, learning_rate = 0.01, reward_decay = 0.9, e_greedy = 0.9):# 行為self.actions = actions# 學(xué)習(xí)率self.lr = learning_rate# Q2的權(quán)重self.gamma = reward_decay# 貪婪權(quán)重self.epsilon = e_greedy# Q Learning的 (狀態(tài) <===> 行為) ===> 決策表self.q_table = pd.DataFrame(columns = self.actions, dtype = np.float64)'''檢查state是否存在, 不存在則創(chuàng)建state'''def check_state_exit(self, state):# 如果state不在q_table中if state not in self.q_table.index:# 創(chuàng)建一條states = pd.Series([0]*len(self.actions), index = self.q_table.columns, name = state)# 添加到q_tableself.q_table = self.q_table.append(s)'''在當(dāng)前env的state下的action選擇'''def choose_action(self, observation):# 檢查state是否存在, 不存在則創(chuàng)建self.check_state_exit(observation)# 根據(jù)權(quán)重選擇 => 探索 or 貪婪if np.random.uniform() < self.epsilon:# 貪婪# 先獲取對(duì)應(yīng)state的action權(quán)重state_action = self.q_table.loc[observation, :]# 選擇權(quán)重最大的action = np.random.choice(state_action[state_action == np.max(state_action)].index)else:# 探索action = np.random.choice(self.actions)# 返回選擇的actionreturn action'''學(xué)習(xí)s: current statea: actionr: rewards_: next state'''def learn(self, s, a, r, s_):# 檢查next state是否存在self.check_state_exit(s_)# 得到預(yù)測(cè)值q_predict = self.q_table.loc[s, a]# 判斷是否到達(dá)終點(diǎn)if s_ != 'terminal':# 非終點(diǎn)q_target = r + self.gamma * self.q_table.loc[s_, :].max()else:# 終點(diǎn)q_target = r# 更新q_tableself.q_table.loc[s, a] += self.lr * (q_target - q_predict)'''################################以下為偽代碼################################ ''' def main():# 創(chuàng)建環(huán)境env = Env()# actions, 表示環(huán)境下可以做的行為RL = QLearningTable(env.actions)# 重復(fù)100次for i in range(100):# 重置env環(huán)境env.reset()while True:# 獲取主角的狀態(tài)s = env.player.state()# 主角在當(dāng)前狀態(tài)下所做的行為a = RL.choose_action(s)# 主角走了一步s_, r, done = env.step(a)# 學(xué)習(xí)RL.learn(s, a, r, s_)# 結(jié)束if done:breakif __name__ == "__main__":main()

Saras（在線學(xué)習(xí)）

移步：莫煩Python Saras原理

簡(jiǎn)述

Saras算法與Q-Learning類似，Saras基于當(dāng)前的State直接作出對(duì)應(yīng)的Action，并且也想好了下一次的State要作出什么樣的Action，不斷去更新對(duì)應(yīng)的表。不一樣的地方在于：Q-Learning是基于當(dāng)前的State選擇對(duì)應(yīng)的Action，再取下一個(gè)State中最大的Action作為最終獎(jiǎng)勵(lì)，這是一個(gè)嘗試的過(guò)程，因?yàn)橄乱粋€(gè)State的最大獎(jiǎng)勵(lì)是虛無(wú)的，是一種假象已經(jīng)拿到獎(jiǎng)勵(lì)的行為。

主要區(qū)別：A、策略的不同，主要在于Saras是確定自己走的每一步是什么樣的Action。B、學(xué)習(xí)的不同，Saras每一步都是直接作出對(duì)應(yīng)的Action，因此目標(biāo)函數(shù)為 q_target = r + self.gamma * self.q_table.loc[s_, a_] 。

實(shí)現(xiàn)

import numpy as np import pandas as pd from QLearningTable import QLearningTableclass SarsaTable(QLearningTable):def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):super(SarsaTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)'''學(xué)習(xí)s: current statea: actionr: rewards_: next statea_: next actiondone: is final point'''def learn(self, s, a, r, s_, a_, done):# 檢查next state是否存在self.check_state_exit(s_)# 得到預(yù)測(cè)值q_predict = self.q_table.loc[s, a]# 判斷是否到達(dá)終點(diǎn)if not done:# 非終點(diǎn)q_target = r + self.gamma * self.q_table.loc[s_, a_]else:# 終點(diǎn)q_target = r# 更新q_tableself.q_table.loc[s, a] += self.lr * (q_target - q_predict)'''################################以下為偽代碼################################ ''' def main():# 創(chuàng)建環(huán)境env = Env()# actions, 表示環(huán)境下可以做的行為RL = SarsaTable(env.actions)# 重復(fù)100次for i in range(100):# 重置env環(huán)境env.reset()# 獲取主角的狀態(tài)s = env.player.state()# 主角在當(dāng)前狀態(tài)下所做的行為a = RL.choose_action(s)while True:# 主角走了一步s_, r, done = env.step(a)# 獲取下一個(gè)狀態(tài)會(huì)做的行為a_ = RL.choose_action(s_)# 學(xué)習(xí)RL.learn(s, a, r, s_, a_, done)# 賦值s = s_a = a_# 結(jié)束if done:breakif __name__ == "__main__":main()

SarasLambda（在線學(xué)習(xí)）

移步：莫煩Python SarasLambda原理

簡(jiǎn)介

與Saras不同之處在于，可以通過(guò)lambda值來(lái)更新路徑權(quán)重。這樣容易快速收斂QTable。

實(shí)現(xiàn)

import numpy as np import pandas as pd from SarasTable import SarasTableclass SarasLambdaTable(SarasTable):def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9, trace_decay=0.9):super(SarasLambdaTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)# 后向觀測(cè)算法, eligibility trace.self.lambda_ = trace_decay# 空的 eligibility trace 表self.eligibility_trace = self.q_table.copy()'''檢查state是否存在, 不存在則創(chuàng)建state'''def check_state_exit(self, state):# 如果state不在q_table中if state not in self.q_table.index:# 創(chuàng)建一條states = pd.Series([0]*len(self.actions), index = self.q_table.columns, name = state)# 添加到q_tableself.q_table = self.q_table.append(s)# 添加到eligibility_traceself.eligibility_trace = self.eligibility_trace.append(s)'''學(xué)習(xí)s: current statea: actionr: rewards_: next statea_: next actiondone: is final point'''def learn(self, s, a, r, s_, a_, done):# 檢查next state是否存在self.check_state_exit(s_)# 得到預(yù)測(cè)值q_predict = self.q_table.loc[s, a]# 判斷是否到達(dá)終點(diǎn)if not done:# 非終點(diǎn)q_target = r + self.gamma * self.q_table.loc[s_, a_]else:# 終點(diǎn)q_target = r# 誤差error = q_target - q_predict# 對(duì)于經(jīng)歷過(guò)的 state-action, 我們讓它為1, 證明他是得到 reward 路途中不可或缺的一環(huán)self.eligibility_trace.loc[s, :] *= 0self.eligibility_trace.loc[s, a] = 1# 更新q_table, 與之前不一樣, 更新的是所有的state-actionself.q_table += self.lr * self.eligibility_trace * error# 隨著時(shí)間衰減 eligibility trace 的值, 離獲取 reward 越遠(yuǎn)的步, 他的"不可或缺性"越小self.eligibility_trace *= self.gamma * self.lambda_'''################################以下為偽代碼################################ ''' def main():# 創(chuàng)建環(huán)境env = Env()# actions, 表示環(huán)境下可以做的行為RL = SarasLambdaTable(env.actions)# 重復(fù)100次for i in range(100):# 重置env環(huán)境env.reset()# 獲取主角的狀態(tài)s = env.player.state()# 主角在當(dāng)前狀態(tài)下所做的行為a = RL.choose_action(s)while True:# 主角走了一步s_, r, done = env.step(a)# 獲取下一個(gè)狀態(tài)會(huì)做的行為a_ = RL.choose_action(s_)# 學(xué)習(xí)RL.learn(s, a, r, s_, a_, done)# 賦值s = s_a = a_# 結(jié)束if done:breakif __name__ == "__main__":main()

總結(jié)

以上是生活随笔為你收集整理的游戏开发之强化学习的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

游戏开发

上一篇：一道腾讯产品面试题
下一篇： 2020中国奢侈品消费者数字行为洞察报告