日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

玛丽天堂java游戏_用DQN玩超级玛丽

發(fā)布時(shí)間:2023/12/29 编程问答 40 豆豆
生活随笔 收集整理的這篇文章主要介紹了 玛丽天堂java游戏_用DQN玩超级玛丽 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

算法流程

這是我之前畫的一個(gè)體現(xiàn)一五年DQN算法的示意圖:

看這張圖需要注意的一點(diǎn)是,整個(gè)算法是可以看做獨(dú)立進(jìn)行的兩個(gè)過(guò)程:

用價(jià)值網(wǎng)絡(luò)去玩游戲(play)

對(duì)價(jià)值網(wǎng)絡(luò)進(jìn)行更新(updata)

開始編程

所需要的工具:

作為使用pytorch的新手,這次踩過(guò)的最大一個(gè)坑就是,如果ndarray和torch.Tensor之間頻繁轉(zhuǎn)換,關(guān)系比較混亂的話,把Tensor放在GPU里會(huì)特別的慢,而且可能慢到無(wú)法想象。大概是多維數(shù)據(jù)在GPU和CPU之間轉(zhuǎn)移所導(dǎo)致的。多維向量最好在一開始就是以Tensor的形式出現(xiàn)的。

# if gpu is to be used

use_cuda = torch.cuda.is_available()

FloatTensor = torch.cuda.FloatTensor if use_cuda else torch.FloatTensor

LongTensor = torch.cuda.LongTensor if use_cuda else torch.LongTensor

ByteTensor = torch.cuda.ByteTensor if use_cuda else torch.ByteTensor

Tensor = FloatTensor

在以后的代碼中,就都用Tensor定義變量。

價(jià)值網(wǎng)絡(luò)

我在價(jià)值網(wǎng)絡(luò)這個(gè)類中,除了定義了網(wǎng)絡(luò)結(jié)構(gòu),還增加了一些功能:

網(wǎng)絡(luò)結(jié)構(gòu)及前向傳播

動(dòng)作選擇

網(wǎng)絡(luò)參數(shù)更新

網(wǎng)絡(luò)結(jié)構(gòu)

class dqn_net(nn.Module):

def __init__(self,ACTION_NUM):

super(dqn_net,self).__init__()

self.conv1=nn.Conv2d(in_channels=4,out_channels=16,kernel_size=8,stride=4)

self.conv2=nn.Conv2d(in_channels=16,out_channels=32,kernel_size=4,stride=2)

self.fc1=nn.Linear(in_features=9*9*32,out_features=256)

self.fc2=nn.Linear(in_features=256,out_features=ACTION_NUM)

self.action_num=ACTION_NUM

def forward(self,input):

output=F.relu(self.conv1(input))

output=F.relu(self.conv2(output))

output=output.view(-1,9*9*32)

output=F.relu(self.fc1(output))

output=self.fc2(output)

return output

動(dòng)作選擇

def select_action(self,input):

'''

parameters

----------

input : {Tensor} of shape torch.Size([4,84,84])

Return

------

action_button , action_onehot : {int} , {Tensor}

'''

input=Variable(input.unsqueeze(0))

output=self.forward(input)

action_index=output.data.max(1)[1][0]

# action_button , action_onehot

if action_index==0: return 0,Tensor([1,0,0,0,0,0]) # 不動(dòng)

elif action_index==1: return 3,Tensor([0,1,0,0,0,0]) # 左走

elif action_index==2: return 7,Tensor([0,0,1,0,0,0]) # 右走

elif action_index==3: return 11,Tensor([0,0,0,1,0,0]) # 原地跳

elif action_index==4: return 4,Tensor([0,0,0,0,1,0]) # 左跳

elif action_index==5: return 8,Tensor([0,0,0,0,0,1]) # 右跳

這里返回的動(dòng)作指令有相對(duì)應(yīng)的兩個(gè)形式:

1、用來(lái)輸入游戲環(huán)境的動(dòng)作編號(hào)。對(duì)應(yīng)著按鍵情況。實(shí)際上這個(gè)游戲一共有6個(gè)按鍵,原游戲環(huán)境中寫了14種按鍵情況,而我只取了其中6種按鍵組合,對(duì)應(yīng)6種動(dòng)作。

mapping = {

0: [0, 0, 0, 0, 0, 0],

# NO1: [1, 0, 0, 0, 0, 0],

# Up2: [0, 1, 0, 0, 0, 0],

# Down3: [0, 0, 1, 0, 0, 0],

# Left4: [0, 0, 1, 0, 1, 0],

# Left + A5: [0, 0, 1, 0, 0, 1],

# Left + B6: [0, 0, 1, 0, 1, 1],

# Left + A + B7: [0, 0, 0, 1, 0, 0],

# Right8: [0, 0, 0, 1, 1, 0],

# Right + A9: [0, 0, 0, 1, 0, 1],

# Right + B10: [0, 0, 0, 1, 1, 1],

# Right + A + B11: [0, 0, 0, 0, 1, 0],

# A12: [0, 0, 0, 0, 0, 1],

# B13: [0, 0, 0, 0, 1, 1],

# A + B}

2、用來(lái)與網(wǎng)絡(luò)輸出值對(duì)應(yīng)的,one-hot 編碼的形式,每一位代表一種動(dòng)作,為1代表執(zhí)行該動(dòng)作。

網(wǎng)絡(luò)更新

def update(self,samples,loss_func,optim_func,learn_rate,target_net,BATCH_SIZE,GAMMA):

'''update the value network one step

Parameters

----------

samples: {namedtuple}

Transition(obs4=(o1,o2,...),act=(a1,a2,...),

next_ob=(no1,no2,...),reward=(r1,r2,...),done=(d1,d2,...))

loss: string

the loss function of the network

e.g. 'nn.MSELoss'

optim: string

the optimization function of the network

e.g. 'optim.SGD'

learn_rate: float

the learing rate of the optimizer

Functions

---------

update the network one step

'''

obs4_batch=Variable(torch.cat(samples.obs4)) # ([BATCH,4,84,84])

next_obs4_batch=Variable(torch.cat(samples.next_obs4)) # ([BATCH,4,84,84])

action_batch=Variable(torch.cat(samples.act)) # ([BATCH,6])

done_batch=samples.done # {tuple} of bool,len=BATCH

reward_batch=torch.cat(samples.reward) # ([BATCH,1])

### compute the target Q(s,a) value ###

value_batch=target_net(next_obs4_batch)

target=Variable(torch.zeros(BATCH_SIZE).type(Tensor))

for i in range(BATCH_SIZE):

if done_batch[i]==False:

target[i]=reward_batch[i]+GAMMA*Tensor.max(value_batch.data[i])

elif done_batch[i]==True:

target[i]=reward_batch[i]

### compute the current net output value ###

output_all=self.forward(obs4_batch)*action_batch

output=output_all.sum(dim=1) # {Variable contain FloatTensor}

criterion=loss_func()

optimizer=optim_func(self.parameters(),lr=learn_rate)

loss=criterion(output,target)

optimizer.zero_grad()# set gradients of parameters to be optimized to zero

loss.backward()

optimizer.step()

用來(lái)訓(xùn)練的樣本是一個(gè)namedtuple的形式:

{namedtuple}:

Transition(obs4=(o1,o2,...),act=(a1,a2,...),next_ob=(no1,no2,...),reward=(r1,r2,...),done=(d1,d2,...))

訓(xùn)練的過(guò)程是這樣的:

取出觀察值obs4輸入網(wǎng)絡(luò),得到網(wǎng)絡(luò)輸出

取出輸出中與樣本中的動(dòng)作act對(duì)應(yīng)的值

3. 利用樣本里的r和s'計(jì)算目標(biāo)值:

計(jì)算目標(biāo)值時(shí)需要判斷下一個(gè)狀態(tài)是不是終止?fàn)顟B(tài)。

如果不是的話就按照下式計(jì)算目標(biāo)值:

如果是終止?fàn)顟B(tài)則目標(biāo)值為:

4. 第二步中取出的Q值與第三步中計(jì)算的目標(biāo)值,作為計(jì)算損失函數(shù)的兩項(xiàng)。

樣本池

from collections import namedtuple

import random

import numpy as np

class replay_memory:

def __init__(self,capacity):

self.capacity=capacity

self.memory=[]

self.position=0

self.Transition=namedtuple('Transition',

['obs4','act','next_obs4','reward','done'])

def __len__(self):

return len(self.memory)

def add(self,*args):

'''Add a transition to replay memory

Parameters

----------

e.g. repay_memory.add(obs4,action,next_obs4,reward,done)

obs4: {Tensor} of shape torch.Size([4,84,84])

act: {Tensor} of shape torch.Size([6])

next_obs4: {Tensor} of shape torch.Size([4,84,84])

reward: {int}

done: {bool} the next station is the terminal station or not

Function

--------

the replay_memory will save the latest samples

'''

if len(self.memory)

self.memory.append(None)

self.memory[self.position]=self.Transition(*args)

self.position=(self.position+1)%self.capacity

def sample(self,batch_size):

'''Sample a batch from replay memory

Parameters

----------

batch_size: int

How many trasitions you want

Returns

-------

obs_batch: {Tensor} of shape torch.Size([BATCH_SIZE,4,84,84])

batch of observations

act_batch: {Tensor} of shape torch.Size([BATCH_SIZE,6])

batch of actions executed w.r.t observations in obs_batch

nob_batch: {Tensor} of shape torch.Size([BATCH_SIZE,4,84,84])

batch of next observations w.r.t obs_batch and act_batch

rew_batch: {ndarray} of shape

batch of reward received w.r.t obs_batch and act_batch

'''

batch = random.sample(self.memory, batch_size)

batch_zip=self.Transition(*zip(*batch))

return batch_zip

其中的樣本,是以namedtuple的形式存放的:

{Transition}

0:{tuple} of {Tensor}-shape-torch.Size([4,84,84])

1:{tuple} of {Tensor}-shape-torch.Size([6])

2:{tuple} of {Tensor}-shape-torch.Size([4,84,84])

3:{tuple} of {int}

4:{tuple} of {bool}

圖片預(yù)處理

def ob_process(frame):

'''

Parameters

----------

frame: {ndarray} of shape (90,90)

Returns

-------

frame: {Tensor} of shape torch.Size([1,84,84])

'''

frame = cv2.resize(frame, (84, 84), interpolation=cv2.INTER_AREA)

frame=frame.astype('float64')

frame=torch.from_numpy(frame)

frame=frame.unsqueeze(0).type(Tensor)

return frame

學(xué)習(xí)過(guò)程

進(jìn)行各種初始化

### initialization ###

action_space=[(0,Tensor([1,0,0,0,0,0])),

(3,Tensor([0,1,0,0,0,0])),

(7,Tensor([0,0,1,0,0,0])),

(11,Tensor([0,0,0,1,0,0])),

(4,Tensor([0,0,0,0,1,0])),

(8,Tensor([0,0,0,0,0,1]))]

# (action_button , action_onehot)

# 以上動(dòng)作分別為:不動(dòng)、左走、右走、跳、左跳、右跳

value_net = dqn_net(ACTION_NUM)

target_net=dqn_net(ACTION_NUM)

if torch.cuda.is_available():

value_net.cuda()

target_net.cuda()

if os.listdir(PATH):

value_net.load_state_dict(torch.load(PATH))

buffer=replay_memory(REPLAY_MEMORY_CAPACITY)

env.reset()

obs,_,_,_,_,_,_=env.step(0)

obs=ob_process(obs)

obs4=torch.cat(([obs,obs,obs,obs]),dim=0) # {Tensor} of shape torch.Size([4,84,84])

judge_distance=0

episode_total_reward = 0

epi_total_reward_list=[]

# counters #

time_step=0

update_times=0

episode_num=0

history_distance=200

之后進(jìn)入以下循環(huán),開始玩游戲:

while episode_num <= MAX_EPISODE:

進(jìn)行動(dòng)作選擇

### choose an action with epsilon-greedy ###

prob = random.random()

threshold = EPS_END + (EPS_START - EPS_END) * math.exp(-1 * episode_num / EPS_DECAY)

if prob <= threshold:

action_index = np.random.randint(6)

action_button = action_space[action_index][0] # {int}

action_onehot = action_space[action_index][1] # {Tensor}

else:

action_button, action_onehot = value_net.select_action(obs4)

執(zhí)行該動(dòng)作

### do one step ###

obs_next, reward, done, _, max_distance, _, now_distance = env.step(action_button)

obs_next = ob_process(obs_next)

obs4_next = torch.cat(([obs4[1:, :, :],obs_next]),dim=0)

buffer.add(obs4.unsqueeze(0), action_onehot.unsqueeze(0), obs4_next.unsqueeze(0), Tensor([reward]).unsqueeze(0), done)

episode_total_reward+=reward

if now_distance <= history_distance:

judge_distance+=1

else:

judge_distance=0

history_distance=max_distance

這里有一步用來(lái)計(jì)算馬里奧走了多遠(yuǎn)的操作,如果馬里奧原地徘徊了一定時(shí)間,那么也算到達(dá)終止?fàn)顟B(tài)而重新開始下一輪游戲。

轉(zhuǎn)移到下個(gè)狀態(tài)

### go to the next state ###

if done == False:

obs4 = obs4_next

time_step += 1

elif done == True or judge_distance > 50:

env.reset()

obs, _, _, _, _, _, _ = env.step(0)

obs = ob_process(obs)

obs4 = torch.cat(([obs, obs, obs, obs]), dim=0)

episode_num += 1

history_distance = 200

epi_total_reward_list.append(episode_total_reward)

print('episode%dtotal reward=%.2f'%(episode_num,episode_total_reward))

episode_total_reward = 0

這里要判斷是否到達(dá)了終止?fàn)顟B(tài),如果到了終止?fàn)顟B(tài),則進(jìn)行一些初始化準(zhǔn)備開始下一輪游戲。

更新網(wǎng)絡(luò)

### do one step update ###

if len(buffer) == buffer.capacity and time_step % 4 == 0:

batch_transition = buffer.sample(BATCH_SIZE)

value_net.update(samples=batch_transition, loss_func=LOSS_FUNCTION,

optim_func=OPTIM_METHOD, learn_rate=LEARNING_RATE,

target_net=target_net, BATCH_SIZE=BATCH_SIZE,

GAMMA=GAMMA)

update_times += 1

### copy value net parameters to target net ###

if update_times % NET_COPY_STEP == 0:

target_net.load_state_dict(value_net.state_dict())

以上便是整個(gè)過(guò)程。經(jīng)過(guò)一段時(shí)間的訓(xùn)練,可以看到智能體確實(shí)有進(jìn)步,但還遠(yuǎn)遠(yuǎn)不夠“智能”。這個(gè)算法只是復(fù)現(xiàn)的論文中的方法,而對(duì)超級(jí)瑪麗這個(gè)游戲要想玩好,還應(yīng)該做一些有針對(duì)性的分析。并且可以加入更多其他方法去玩這個(gè)游戲。

總結(jié)

以上是生活随笔為你收集整理的玛丽天堂java游戏_用DQN玩超级玛丽的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。