當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

强化学习应用于组合优化问题_如何将强化学习应用于现实生活中的计划问题

發(fā)布時(shí)間：2023/11/29 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了强化学习应用于组合优化问题_如何将强化学习应用于现实生活中的计划问题小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

強(qiáng)化學(xué)習(xí)應(yīng)用于組合優(yōu)化問(wèn)題

by Sterling Osborne, PhD Researcher

作者：斯特林·奧斯本(Sterling Osborne)，博士研究員

如何將強(qiáng)化學(xué)習(xí)應(yīng)用于現(xiàn)實(shí)生活中的計(jì)劃問(wèn)題 (How to apply Reinforcement Learning to real life planning problems)

Recently, I have published some examples where I have created Reinforcement Learning models for some real life problems. For example, using Reinforcement Learning for Meal Planning based on a Set Budget and Personal Preferences.

最近，我發(fā)布了一些示例，其中我針對(duì)一些現(xiàn)實(shí)生活中的問(wèn)題創(chuàng)建了強(qiáng)化學(xué)習(xí)模型。例如，將強(qiáng)化學(xué)習(xí)用于基于計(jì)劃預(yù)算和個(gè)人偏好的膳食計(jì)劃。

Reinforcement Learning can be used in this way for a variety of planning problems including travel plans, budget planning and business strategy. The two advantages of using RL is that it takes into account the probability of outcomes and allows us to control parts of the environment. Therefore, I decided to write a simple example so others may consider how they could start using it to solve some of their day-to-day or work problems.

強(qiáng)化學(xué)習(xí)可以通過(guò)這種方式用于各種計(jì)劃問(wèn)題，包括旅行計(jì)劃，預(yù)算計(jì)劃和業(yè)務(wù)策略。使用RL的兩個(gè)優(yōu)點(diǎn)是它考慮了結(jié)果的可能性，并允許我們控制環(huán)境的某些部分。因此，我決定寫(xiě)一個(gè)簡(jiǎn)單的示例，以便其他人可以考慮如何開(kāi)始使用它來(lái)解決他們的一些日常或工作問(wèn)題。

什么是強(qiáng)化學(xué)習(xí)？ (What is Reinforcement Learning?)

Reinforcement Learning (RL) is the process of testing which actions are best for each state of an environment by essentially trial and error. The model introduces a random policy to start, and each time an action is taken an initial amount (known as a reward) is fed to the model. This continues until an end goal is reached, e.g. you win or lose the game, where that run (or episode) ends and the game resets.

強(qiáng)化學(xué)習(xí)(RL)是通過(guò)本質(zhì)上的反復(fù)試驗(yàn)來(lái)測(cè)試哪種操作最適合環(huán)境的每個(gè)狀態(tài)的過(guò)程。該模型引入了一個(gè)隨機(jī)策略來(lái)啟動(dòng)，并且每次采取行動(dòng)時(shí)，都會(huì)向該模型提供初始金額(稱(chēng)為獎(jiǎng)勵(lì))。這一直持續(xù)到達(dá)到最終目標(biāo)為止，例如，您贏(yíng)了或輸了游戲，游戲結(jié)束(或情節(jié))并重置了游戲。

As the model goes through more and more episodes, it begins to learn which actions are more likely to lead us to a positive outcome. Therefore it finds the best actions in any given state, known as the optimal policy.

隨著模型經(jīng)歷越來(lái)越多的事件，它開(kāi)始了解哪些行動(dòng)更有可能導(dǎo)致我們?nèi)〉梅e極的結(jié)果。因此，它會(huì)在任何給定狀態(tài)下找到最佳操作，稱(chēng)為最佳策略。

Many of the RL applications online train models on a game or virtual environment where the model is able to interact with the environment repeatedly. For example, you let the model play a simulation of tic-tac-toe over and over so that it observes success and failure of trying different moves.

許多RL應(yīng)用程序在線(xiàn)地在游戲或虛擬環(huán)境中訓(xùn)練模型，其中模型能夠與環(huán)境反復(fù)交互。例如，您讓模型反復(fù)模擬井字游戲，以便觀(guān)察嘗試不同動(dòng)作的成功和失敗。

In real life, it is likely we do not have access to train our model in this way. For example, a recommendation system in online shopping needs a person’s feedback to tell us whether it has succeeded or not, and this is limited in its availability based on how many users interact with the shopping site.

在現(xiàn)實(shí)生活中，我們很可能無(wú)法以這種方式訓(xùn)練模型。例如，在線(xiàn)購(gòu)物中的推薦系統(tǒng)需要一個(gè)人的反饋來(lái)告訴我們它是否成功，并且基于有多少用戶(hù)與購(gòu)物網(wǎng)站進(jìn)行交互，其可用性受到限制。

Instead, we may have sample data that shows shopping trends over a time period that we can use to create estimated probabilities. Using these, we can create what is known as a Partially Observed Markov Decision Process (POMDP) as a way to generalise the underlying probability distribution.

取而代之的是，我們可能有一些示例數(shù)據(jù)顯示了一段時(shí)間內(nèi)的購(gòu)物趨勢(shì)，可以用來(lái)創(chuàng)建估計(jì)的概率。使用這些，我們可以創(chuàng)建所謂的部分觀(guān)測(cè)的馬爾可夫決策過(guò)程(POMDP)，作為概括潛在概率分布的一種方法。

部分觀(guān)測(cè)的馬爾可夫決策過(guò)程(POMDP) (Partially Observed Markov Decision Processes (POMDPs))

Markov Decision Processes (MDPs) provide a framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. The key feature of MDPs is that they follow the Markov Property; all future states are independent of the past given the present. In other words, the probability of moving into the next state is only dependent on the current state.

馬爾可夫決策過(guò)程(MDP)提供了一個(gè)框架，用于在結(jié)果部分隨機(jī)且部分受決策者控制的情況下對(duì)決策建模。 MDP的關(guān)鍵特征是它們遵循Markov屬性。所有未來(lái)狀態(tài)都獨(dú)立于過(guò)去給出的當(dāng)前狀態(tài)。換句話(huà)說(shuō)，進(jìn)入下一個(gè)狀態(tài)的概率僅取決于當(dāng)前狀態(tài)。

POMDPs work similarly except it is a generalisation of the MDPs. In short, this means the model cannot simply interact with the environment but is instead given a set probability distribution based on what we have observed. More info can be found here. We could use value iteration methods on our POMDP, but instead I’ve decided to use Monte Carlo Learning in this example.

POMDP的工作方式類(lèi)似，只是它是MDP的概括。簡(jiǎn)而言之，這意味著該模型不能簡(jiǎn)單地與環(huán)境交互，而是根據(jù)我們觀(guān)察到的結(jié)果給定一個(gè)固定的概率分布。更多信息可以在這里找到。我們可以在POMDP上使用值迭代方法，但是我決定在此示例中使用“蒙特卡洛學(xué)習(xí)”。

示例環(huán)境 (Example Environment)

Imagine you are back at school (or perhaps still are) and are in a classroom, the teacher has a strict policy on paper waste and requires that any pieces of scrap paper must be passed to him at the front of the classroom and he will place the waste into the bin (trash can).

想象一下，您回到學(xué)校(或可能仍然在教室里)，老師對(duì)廢紙有嚴(yán)格的政策，要求任何廢紙必須在教室前面?zhèn)鬟f給他，他將把廢物進(jìn)入垃圾箱(垃圾桶)。

However, some students in the class care little for the teacher’s rules and would rather save themselves the trouble of passing the paper round the classroom. Instead, these troublesome individuals may choose to throw the scrap paper into the bin from a distance. Now this angers the teacher and those that do this are punished.

但是，班上的一些學(xué)生不太在乎老師的規(guī)矩，寧愿為自己省去把紙傳到教室的麻煩。相反，這些麻煩的人可能會(huì)選擇從遠(yuǎn)處將廢紙扔進(jìn)垃圾箱。現(xiàn)在，這激怒了老師，而那些這樣做的人將受到懲罰。

This introduces a very basic action-reward concept, and we have an example classroom environment as shown in the following diagram.

這引入了一個(gè)非常基本的行動(dòng)獎(jiǎng)勵(lì)概念，并且我們有一個(gè)示例教室環(huán)境，如下圖所示。

Our aim is to find the best instructions for each person so that the paper reaches the teacher and is placed into the bin and avoids being thrown in the bin.

我們的目標(biāo)是為每個(gè)人找到最好的指導(dǎo)，以便將紙傳到老師手中并放入垃圾箱中，避免將其扔到垃圾箱中。

狀態(tài)與行動(dòng) (States and Actions)

In our environment, each person can be considered a state and they have a variety of actions they can take with the scrap paper. They may choose to pass it to an adjacent class mate, hold onto it or some may choose to throw it into the bin. We can therefore map our environment to a more standard grid layout as shown below.

在我們的環(huán)境中，每個(gè)人都可以被視為一種狀態(tài) ，他們可以對(duì)廢紙采取多種行動(dòng) 。他們可以選擇將其傳遞給相鄰的同伴，抓住它，也可以選擇將其扔到垃圾箱中。因此，我們可以將環(huán)境映射到更標(biāo)準(zhǔn)的網(wǎng)格布局，如下所示。

This is purposefully designed so that each person, or state, has four actions: up, down, left or right and each will have a varied ‘real life’ outcome based on who took the action. An action that puts the person into a wall (including the black block in the middle) indicates that the person holds onto the paper. In some cases, this action is duplicated, but is not an issue in our example.

這樣做的目的是使每個(gè)人或每個(gè)州有四個(gè)動(dòng)作：向上，向下，向左或向右，并且每個(gè)人都會(huì)根據(jù)采取該動(dòng)作的人而有不同的“現(xiàn)實(shí)生活”結(jié)果。將人放到墻壁上的動(dòng)作(包括中間的黑色方塊)表示該人握住紙張。在某些情況下，此操作是重復(fù)的，但在我們的示例中不是問(wèn)題。

For example, person A’s actions result in:

例如，人A的動(dòng)作導(dǎo)致：

Up = Throw into bin
向上=扔進(jìn)垃圾箱
Down = Hold onto paper
向下=握住紙張
Left = Pass to person B
左=傳給B人
Right = Hold onto paper
右=放在紙上

概率環(huán)境 (Probabilistic Environment)

For now, the decision maker that partly controls the environment is us. We will tell each person which action they should take. This is known as the policy.

目前，部分控制環(huán)境的決策者是我們。我們將告訴每個(gè)人應(yīng)該采取的行動(dòng)。這就是所謂的政策。

The first challenge I face in my learning is understanding that the environment is likely probabilistic and what this means. A probabilistic environment is when we instruct a state to take an action under our policy, there is a probability associated as to whether this is successfully followed. In other words, if we tell person A to pass the paper to person B, they can decide not to follow the instructed action in our policy and instead throw the scrap paper into the bin.

我在學(xué)習(xí)中面臨的第一個(gè)挑戰(zhàn)是了解環(huán)境很可能是概率性的，這意味著什么。概率環(huán)境是當(dāng)我們指示某個(gè)國(guó)家根據(jù)我們的政策采取行動(dòng)時(shí)，是否成功遵循該概率存在相關(guān)性。換句話(huà)說(shuō)，如果我們告訴A人將紙張傳遞給B人，他們可以決定不遵循我們政策中的指示操作，而是將廢紙扔進(jìn)垃圾箱。

Another example is if we are recommending online shopping products there is no guarantee that the person will view each one.

另一個(gè)例子是，如果我們建議使用在線(xiàn)購(gòu)物產(chǎn)品，則不能保證該人會(huì)查看每個(gè)產(chǎn)品。

觀(guān)察到的過(guò)渡概率 (Observed Transitional Probabilities)

To find the observed transitional probabilities, we need to collect some sample data about how the environment acts. Before we collect information, we first introduce an initial policy. To start the process, I have randomly chosen one that looks as though it would lead to a positive outcome.

為了找到觀(guān)察到的過(guò)渡概率，我們需要收集一些有關(guān)環(huán)境行為的樣本數(shù)據(jù)。在收集信息之前，我們首先介紹一個(gè)初始政策。為了開(kāi)始這一過(guò)程，我隨機(jī)選擇了一種看起來(lái)會(huì)帶來(lái)積極成果的方法。

Now we observe the actions each person takes given this policy. In other words, say we sat at the back of the classroom and simply observed the class and observed the following results for person A:

現(xiàn)在，我們觀(guān)察每個(gè)人在此政策下采取的行動(dòng)。換句話(huà)說(shuō)，假設(shè)我們坐在教室后面，只是觀(guān)察班級(jí)，并觀(guān)察到A人的以下結(jié)果：

We see that a paper passed through this person 20 times; 6 times they kept hold of it, 8 times they passed it to person B and another 6 times they threw it in the trash. This means that under our initial policy, the probability of keeping hold or throwing it in the trash for this person is 6/20 = 0.3 and likewise 8/20 = 0.4 to pass to person B. We can observe the rest of the class to collect the following sample data:

我們看到一篇論文通過(guò)了這個(gè)人20次；他們保留了6次，將其傳遞給B人8次，還有6次將其扔進(jìn)垃圾桶。這意味著在我們最初的政策下，保持住該人或?qū)⑵淙拥嚼爸械母怕蕿?/20 = 0.3，同樣8/20 = 0.4傳遞給人B的概率。我們可以觀(guān)察到班上的其他人收集以下樣本數(shù)據(jù)：

Likewise, we then calculate the probabilities to be the following matrix and we could use this to simulate experience. The accuracy of this model will depend greatly on whether the probabilities are true representations of the whole environment. In other words, we need to make sure we have a sample that is large and rich enough in data.

同樣，然后我們將概率計(jì)算為以下矩陣，并可以使用它來(lái)模擬經(jīng)驗(yàn)。該模型的準(zhǔn)確性將在很大程度上取決于概率是否是整個(gè)環(huán)境的真實(shí)表示。換句話(huà)說(shuō)，我們需要確保我們有一個(gè)足夠大且數(shù)據(jù)足夠豐富的樣本。

多武裝土匪，情節(jié)，獎(jiǎng)勵(lì)，退貨和折扣率 (Multi-Armed Bandits, Episodes, Rewards, Return and Discount Rate)

So we have our transition probabilities estimated from the sample data under a POMDP. The next step, before we introduce any models, is to introduce rewards. So far, we have only discussed the outcome of the final step; either the paper gets placed in the bin by the teacher and nets a positive reward or gets thrown by A or M and nets a negative rewards. This final reward that ends the episode is known as the Terminal Reward.

因此，我們根據(jù)POMDP下的樣本數(shù)據(jù)估算了轉(zhuǎn)換概率。在介紹任何模型之前，下一步是介紹獎(jiǎng)勵(lì)。到目前為止，我們僅討論了最后一步的結(jié)果；要么是老師將紙放到垃圾箱中，然后獲得正數(shù)獎(jiǎng)勵(lì)，要么被A或M擲出紙張，從而得到負(fù)數(shù)獎(jiǎng)勵(lì)。結(jié)束劇集的最終獎(jiǎng)勵(lì)被稱(chēng)為終端獎(jiǎng)勵(lì) 。

But, there is also third outcome that is less than ideal either; the paper continually gets passed around and never (or takes far longer than we would like) reaches the bin. Therefore, in summary we have three final outcomes

但是，還有第三種結(jié)果也不理想。紙張會(huì)不斷地繞過(guò)，而不會(huì)(或比我們想要的更長(zhǎng)的時(shí)間)到達(dá)垃圾箱。因此，總而言之，我們有三個(gè)最終結(jié)果

Paper gets placed in bin by teacher and nets a positive terminal reward
紙張被老師放進(jìn)垃圾桶，并獲得積極的終端獎(jiǎng)勵(lì)
Paper gets thrown in bin by a student and nets a negative terminal reward
學(xué)生將紙張扔進(jìn)垃圾桶，并最終獲得負(fù)獎(jiǎng)勵(lì)
Paper gets continually passed around room or gets stuck on students for a longer period of time than we would like
紙不斷地流過(guò)房間或卡在學(xué)生身上的時(shí)間比我們想要的更長(zhǎng)

To avoid the paper being thrown in the bin we provide this with a large, negative reward, say -1, and because the teacher is pleased with it being placed in the bin this nets a large positive reward, +1. To avoid the outcome where it continually gets passed around the room, we set the reward for all other actions to be a small, negative value, say -0.04.

為避免將紙張扔到垃圾箱中，我們?yōu)榇颂峁┝溯^大的負(fù)數(shù)獎(jiǎng)勵(lì)(例如-1)，并且由于老師對(duì)將紙張放入垃圾箱感到高興，因此獲得了較大的正數(shù)獎(jiǎng)勵(lì)+1。為了避免結(jié)果不斷在房間中傳遞，我們將所有其他操作的獎(jiǎng)勵(lì)設(shè)置為較小的負(fù)值，例如-0.04。

If we set this as a positive or null number then the model may let the paper go round and round as it would be better to gain small positives than risk getting close to the negative outcome. This number is also very small as it will only collect a single terminal reward but it could take many steps to end the episode and we need to ensure that, if the paper is place in the bin, the positive outcome is not cancelled out.

如果我們將其設(shè)置為正數(shù)或空數(shù)，則該模型可能會(huì)讓論文四處走動(dòng)，因?yàn)楂@得小的正數(shù)比冒險(xiǎn)接近負(fù)數(shù)的結(jié)果更好。這個(gè)數(shù)字也非常小，因?yàn)樗粫?huì)收集單個(gè)終端獎(jiǎng)勵(lì)，但是可能需要采取許多步驟才能結(jié)束劇集，我們需要確保，如果將論文放在垃圾箱中，則不會(huì)取消正面結(jié)果。

Please note: the rewards are always relative to one another and I have chosen arbitrary figures, but these can be changed if the results are not as desired.

請(qǐng)注意：獎(jiǎng)勵(lì)總是彼此相關(guān)的，我選擇了任意數(shù)字，但是如果結(jié)果不理想，可以更改這些數(shù)字。

Although we have inadvertently discussed episodes in the example, we have yet to formally define it. An episode is simply the actions each paper takes through the classroom reaching the bin, which is the terminal state and ends the episode. In other examples, such as playing tic-tac-toe, this would be the end of a game where you win or lose.

盡管我們?cè)谑纠袩o(wú)意中討論了情節(jié)，但我們尚未正式定義它。 情節(jié)只是簡(jiǎn)單地講，每篇論文在教室中到達(dá)教室時(shí)即到達(dá)最終狀態(tài)并結(jié)束情節(jié)的垃圾箱 。在其他示例中，例如打井字游戲，這將是您輸贏(yíng)的游戲結(jié)局。

The paper could in theory start at any state and this introduces why we need enough episodes to ensure that every state and action is tested enough so that our outcome is not being driven by invalid results. However, on the flip side, the more episodes we introduce the longer the computation time will be and, depending on the scale of the environment, we may not have an unlimited amount of resources to do this.

從理論上講，本文可以從任何狀態(tài)開(kāi)始，這介紹了為什么我們需要足夠的情節(jié)來(lái)確保對(duì)每個(gè)狀態(tài)和動(dòng)作進(jìn)行足夠的測(cè)試，以確保我們的結(jié)果不會(huì)受到無(wú)效結(jié)果的驅(qū)動(dòng)。但是，另一方面，我們引入的情節(jié)越多，計(jì)算時(shí)間就越長(zhǎng)，并且根據(jù)環(huán)境的規(guī)模，我們可能沒(méi)有無(wú)限數(shù)量的資源來(lái)執(zhí)行此操作。

This is known as the Multi-Armed Bandit problem; with finite time (or other resources), we need to ensure that we test each state-action pair enough that the actions selected in our policy are, in fact, the optimal ones. In other words, we need to validate that actions that have lead us to good outcomes in the past are not by sheer luck but are in fact in the correct choice, and likewise for the actions that appear poor. In our example this may seem simple with how few states we have, but imagine if we increased the scale and how this becomes more and more of an issue.

這就是所謂的多武裝強(qiáng)盜問(wèn)題 。在有限的時(shí)間(或其他資源)下，我們需要確保對(duì)每個(gè)狀態(tài)操作對(duì)進(jìn)行足夠的測(cè)試，以使策略中選擇的操作實(shí)際上是最佳操作。換句話(huà)說(shuō)，我們需要驗(yàn)證過(guò)去導(dǎo)致我們?nèi)〉昧己贸晒男袆?dòng)并非偶然，而是實(shí)際上是正確的選擇，同樣對(duì)于那些表現(xiàn)不佳的行動(dòng)也是如此。在我們的示例中，這似乎很簡(jiǎn)單，因?yàn)槲覀冎挥袔讉€(gè)州，但是想像一下我們是否擴(kuò)大規(guī)模，以及這如何成為越來(lái)越多的問(wèn)題。

The overall goal of our RL model is to select the actions that maximises the expected cumulative rewards, known as the return. In other words, the Return is simply the total reward obtained for the episode. A simple way to calculate this would be to add up all the rewards, including the terminal reward, in each episode.

我們的RL模型的總體目標(biāo)是選擇能夠最大化預(yù)期累積獎(jiǎng)勵(lì)(稱(chēng)為收益)的行動(dòng)。換句話(huà)說(shuō)， 回報(bào)只是該集獲得的總獎(jiǎng)勵(lì)。一種簡(jiǎn)單的計(jì)算方法是將每個(gè)情節(jié)中的所有獎(jiǎng)勵(lì)加起來(lái)，包括最終獎(jiǎng)勵(lì)。

A more rigorous approach is to consider the first steps to be more important than later ones in the episode by applying a discount factor, gamma, in the following formula:

一種更嚴(yán)格的方法是，通過(guò)在以下公式中應(yīng)用折扣因子gamma，認(rèn)為第一步比后續(xù)步驟更重要。

In other words, we sum all the rewards but weigh down later steps by a factor of gamma to the power of how many steps it took to reach them.

換句話(huà)說(shuō)，我們將所有獎(jiǎng)勵(lì)相加，但是將后續(xù)步驟權(quán)重乘以要達(dá)到這些步驟所需要執(zhí)行的步驟的能力，即伽馬系數(shù)。

If we think about our example, using a discounted return becomes even clearer to imagine as the teacher will reward (or punish accordingly) anyone who was involved in the episode but would scale this based on how far they are from the final outcome.

如果我們考慮我們的示例，那么使用折現(xiàn)收益將變得更容易想象，因?yàn)槔蠋煏?huì)獎(jiǎng)勵(lì)(或相應(yīng)地懲罰)參與該情節(jié)的任何人，但會(huì)根據(jù)他們與最終結(jié)果的距離來(lái)進(jìn)行縮放。

For example, if the paper passed from A to B to M who threw it in the bin, M should be punished most, then B for passing it to him and lastly person A who is still involved in the final outcome but less so than M or B. This also emphasises that the longer it takes (based on the number of steps) to start in a state and reach the bin the less is will either be rewarded or punished but will accumulate negative rewards for taking more steps.

例如，如果論文從A傳遞到B并傳遞給M，然后將其扔進(jìn)垃圾箱，則M應(yīng)該受到最嚴(yán)厲的懲罰，然后B則將其傳遞給他，最后是仍參與最終結(jié)果但小于M的人A或B。這也強(qiáng)調(diào)，開(kāi)始進(jìn)入狀態(tài)并到達(dá)垃圾箱所花費(fèi)的時(shí)間越長(zhǎng)(基于步驟數(shù))，得到獎(jiǎng)勵(lì)或懲罰的次數(shù)越少，但由于采取更多的步驟而積累的負(fù)面獎(jiǎng)勵(lì)就越大。

將模型應(yīng)用于我們的示例 (Applying a Model to our Example)

As our example environment is small, we can apply each and show some of the calculations performed manually and illustrate the impact of changing parameters.

由于示例環(huán)境很小，因此我們可以應(yīng)用每種方法，并展示一些手動(dòng)執(zhí)行的計(jì)算，并說(shuō)明更改參數(shù)的影響。

For any algorithm, we first need to initialise the state value function, V(s), and have decided to set each of these to 0 as shown below.

對(duì)于任何算法，我們首先需要初始化狀態(tài)值函數(shù)V(s)，并已決定將每個(gè)值設(shè)置為0，如下所示。

Next, we let the model simulate experience on the environment based on our observed probability distribution. The model starts a piece of paper in random states and the outcomes of each action under our policy are based on our observed probabilities. So for example, say we have the first three simulated episodes to be the following:

接下來(lái)，我們讓模型根據(jù)觀(guān)察到的概率分布模擬環(huán)境經(jīng)驗(yàn)。該模型以隨機(jī)狀態(tài)開(kāi)始工作，而在我們的政策下，每個(gè)操作的結(jié)果都基于我們觀(guān)察到的概率。舉例來(lái)說(shuō)，假設(shè)我們的前三個(gè)模擬情節(jié)如下：

With these episodes we can calculate our first few updates to our state value function using each of the three models given. For now, we pick arbitrary alpha and gamma values to be 0.5 to make our hand calculations simpler. We will show later the impact this variable has on results.

通過(guò)這些事件，我們可以使用給定的三個(gè)模型分別計(jì)算對(duì)狀態(tài)值函數(shù)的前幾個(gè)更新。現(xiàn)在，我們將任意的alpha和gamma值選擇為0.5，以使我們的手算更加簡(jiǎn)單。稍后我們將顯示此變量對(duì)結(jié)果的影響。

First, we apply temporal difference 0, the simplest of our models and the first three value updates are as follows:

首先，我們應(yīng)用時(shí)間差0，這是我們模型中最簡(jiǎn)單的，并且前三個(gè)值更新如下：

So how have these been calculated? Well because our example is small we can show the calculations by hand.

那么如何計(jì)算這些呢？好吧，因?yàn)槲覀兊氖纠苄?#xff0c;所以我們可以手動(dòng)顯示計(jì)算結(jié)果。

So what can we observe at this early stage? Firstly, using TD(0) appears unfair to some states, for example person D, who, at this stage, has gained nothing from the paper reaching the bin two out of three times. Their update has only been affected by the value of the next stage, but this emphasises how the positive and negative rewards propagate outwards from the corner towards the states.

那么在這個(gè)早期階段我們可以觀(guān)察到什么呢？首先，對(duì)于某些州，使用TD(0)似乎是不公平的，例如，人D，在此階段，三分之二的紙都沒(méi)有進(jìn)入紙箱。他們的更新僅受下一階段的價(jià)值的影響，但這強(qiáng)調(diào)了正面和負(fù)面獎(jiǎng)勵(lì)如何從角落向各州向外傳播。

As we take more episodes the positive and negative terminal rewards will spread out further and further across all states. This is shown roughly in the diagram below where we can see that the two episodes the resulted in a positive result impact the value of states Teacher and G whereas the single negative episode has punished person M.

隨著我們拍攝更多劇集，正面和負(fù)面的最終獎(jiǎng)勵(lì)將在所有州之間越來(lái)越廣泛地分布。這大致顯示在下圖中，我們可以看到，這兩個(gè)事件所產(chǎn)生的積極結(jié)果影響了Teacher和G州的價(jià)值，而單個(gè)消極事件已經(jīng)懲罰了人M。

To show this, we can try more episodes. If we repeat the same three paths already given we produce the following state value function:

為了顯示這一點(diǎn)，我們可以嘗試更多的情節(jié)。如果我們重復(fù)已經(jīng)給出的相同的三個(gè)路徑，我們將產(chǎn)生以下?tīng)顟B(tài)值函數(shù)：

(Please note, we have repeated these three episodes for simplicity in this example but the actual model would have episodes where the outcomes are based on the observed transition probability function.)

(請(qǐng)注意，在本示例中，為簡(jiǎn)單起見(jiàn)，我們重復(fù)了這三個(gè)情節(jié)，但實(shí)際模型中的情節(jié)將基于觀(guān)察到的過(guò)渡概率函數(shù)。)

The diagram above shows the terminal rewards propagating outwards from the top right corner to the states. From this, we may decide to update our policy as it is clear that the negative terminal reward passes through person M and therefore B and C are impacted negatively. Therefore, based on V27, for each state we may decide to update our policy by selecting the next best state value for each state as shown in the figure below

上圖顯示了終端獎(jiǎng)勵(lì)從右上角向外傳播到各州。由此，我們可以決定更新我們的政策，因?yàn)楹苊黠@負(fù)的終端獎(jiǎng)勵(lì)是通過(guò)人M傳遞的，因此B和C受到了負(fù)面影響。因此，基于V27，對(duì)于每個(gè)狀態(tài)，我們可以決定通過(guò)為每個(gè)狀態(tài)選擇下一個(gè)最佳狀態(tài)值來(lái)更新策略，如下圖所示

There are two causes for concerns in this example: the first is that person A’s best action is to throw it into the bin and net a negative reward. This is because none of the episodes have visited this person and emphasises the multi armed bandit problem. In this small example there are very few states so would require many episodes to visit them all, but we need to ensure this is done.

在此示例中，有兩個(gè)令人擔(dān)憂(yōu)的原因：第一個(gè)是人A的最佳動(dòng)作是將其扔到垃圾箱中并獲得負(fù)面獎(jiǎng)勵(lì)。這是因?yàn)檫@些事件都沒(méi)有拜訪(fǎng)過(guò)此人并強(qiáng)調(diào)了多武裝匪徒問(wèn)題。在這個(gè)小例子中，只有很少的州，因此需要很多插曲來(lái)訪(fǎng)問(wèn)它們，但是我們需要確保做到這一點(diǎn)。

The reason this action is better for this person is because neither of the terminal states have a value but rather the positive and negative outcomes are in the terminal rewards. We could then, if our situation required it, initialise V0 with figures for the terminal states based on the outcomes.

這個(gè)動(dòng)作對(duì)這個(gè)人更好的原因是，這兩個(gè)終極狀態(tài)都不具有價(jià)值，而正負(fù)結(jié)果都在終極獎(jiǎng)勵(lì)中。然后，如果我們的情況需要，我們可以根據(jù)結(jié)果用終端狀態(tài)的數(shù)字初始化V0。

Secondly, the state value of person M is flipping back and forth between -0.03 and -0.51 (approx.) after the episodes and we need to address why this is happening. This is caused by our learning rate, alpha. For now, we have only introduced our parameters (the learning rate alpha and discount rate gamma) but have not explained in detail how they will impact results.

其次，在事件發(fā)生之后，人M的狀態(tài)值在-0.03和-0.51(大約)之間來(lái)回翻轉(zhuǎn)，我們需要解決這種情況的原因。這是由我們的學(xué)習(xí)率alpha引起的。目前，我們僅介紹了參數(shù)(學(xué)習(xí)率alpha和折扣率gamma)，但未詳細(xì)說(shuō)明它們將如何影響結(jié)果。

A large learning rate may cause the results to oscillate, but conversely it should not be so small that it takes forever to converge. This is shown further in the figure below that demonstrates the total V(s) for every episode and we can clearly see how, although there is a general increasing trend, it is diverging back and forth between episodes. Another good explanation for learning rate is as follows:

較高的學(xué)習(xí)率可能會(huì)導(dǎo)致結(jié)果振蕩，但反之，則不應(yīng)太小而導(dǎo)致永遠(yuǎn)收斂。下圖進(jìn)一步顯示了這一點(diǎn)，該圖演示了每個(gè)情節(jié)的總V(s)，我們可以清楚地看到，盡管總體趨勢(shì)呈上升趨勢(shì)，但在情節(jié)之間來(lái)回變化。學(xué)習(xí)率的另一個(gè)很好的解釋如下：

“In the game of golf when the ball is far away from the hole, the player hits it very hard to get as close as possible to the hole. Later when he reaches the flagged area, he chooses a different stick to get accurate short shot.

“在高爾夫比賽中，當(dāng)球遠(yuǎn)離球洞時(shí)，球員很難擊中球，使其盡可能靠近球洞。之后，當(dāng)他到達(dá)標(biāo)記區(qū)域時(shí)，他選擇了另一支球桿來(lái)獲得準(zhǔn)確的短射。

So it’s not that he won’t be able to put the ball in the hole without choosing the short shot stick, he may send the ball ahead of the target two or three times. But it would be best if he plays optimally and uses the right amount of power to reach the hole.”

因此，這并不是說(shuō)他不選擇短射棒就無(wú)法將球放入洞中，而是可以將球傳給目標(biāo)兩次或三次。但是，如果他發(fā)揮最佳狀態(tài)并使用適量的力量到達(dá)洞口，那將是最好的選擇。”

Learning rate of a Q learning agentThe question how the learning rate influences the convergence rate and convergence itself. If the learning rate is…stackoverflow.com

Q學(xué)習(xí)代理 的學(xué)習(xí)率學(xué)習(xí)率如何影響收斂率和收斂本身的問(wèn)題。如果學(xué)習(xí)率是… stackoverflow.com

There are some complex methods for establishing the optimal learning rate for a problem but, as with any machine learning algorithm, if the environment is simple enough you iterate over different values until convergence is reached. This is also known as stochastic gradient decent. In a recent RL project, I demonstrated the impact of reducing alpha using an animated visual and this is shown below. This demonstrates the oscillation when alpha is large and how this becomes smoothed as alpha is reduced.

有一些復(fù)雜的方法可以確定問(wèn)題的最佳學(xué)習(xí)率，但是，與任何機(jī)器學(xué)習(xí)算法一樣，如果環(huán)境足夠簡(jiǎn)單，則可以迭代不同的值，直到達(dá)到收斂為止。這也被稱(chēng)為隨機(jī)梯度樣。在最近的RL項(xiàng)目中，我演示了使用動(dòng)畫(huà)效果降低alpha的影響，如下所示。這說(shuō)明了當(dāng)alpha很大時(shí)的振蕩，以及當(dāng)alpha減小時(shí)如何平滑。

Likewise, we must also have our discount rate to be a number between 0 and 1, oftentimes this is taken to be close to 0.9. The discount factor tells us how important rewards in the future are; a large number indicates that they will be considered important whereas moving this towards 0 will make the model consider future steps less and less.

同樣，我們還必須將折現(xiàn)率設(shè)置為0到1之間的數(shù)字，通常將其折算為接近0.9。折扣因素告訴我們未來(lái)的獎(jiǎng)勵(lì)有多重要；較大的數(shù)字表示它們將被視為重要，而將其移至0將使模型越來(lái)越少地考慮將來(lái)的步驟。

With both of these in mind, we can change both alpha from 0.5 to 0.2 and gamma from 0.5 to 0.9 and we achieve the following results:

考慮到這兩種情況，我們可以將alpha值從0.5更改為0.2，將gamma值從0.5更改為0.9，我們將獲得以下結(jié)果：

Because our learning rate is now much smaller the model takes longer to learn and the values are generally smaller. Most noticeably is for the teacher which is clearly the best state. However, this trade-off for increased computation time means our value for M is no longer oscillating to the degree they were before. We can now see this in the diagram below for the sum of V(s) following our updated parameters. Although it is not perfectly smooth, the total V(s) slowly increases at a much smoother rate than before and appears to converge as we would like but requires approximately 75 episodes to do so.

由于我們的學(xué)習(xí)率現(xiàn)在小得多，因此該模型需要更長(zhǎng)的時(shí)間來(lái)學(xué)習(xí)，并且值通常較小。最明顯的是對(duì)于老師來(lái)說(shuō)，這顯然是最好的狀態(tài)。但是，這種折衷是為了增加計(jì)算時(shí)間，這意味著我們的M值不再像以前那樣震蕩。現(xiàn)在，我們可以在下圖中看到更新后的參數(shù)后的V(s)之和。盡管不是很平滑，但總V(s)的增長(zhǎng)速度卻比以前平滑得多，并且似乎可以按照我們的意愿收斂，但大約需要75集。

改變目標(biāo)結(jié)果 (Changing the Goal Outcome)

Another crucial advantage of RL that we haven’t mentioned in too much detail is that we have some control over the environment. Currently, the rewards are based on what we decided would be best to get the model to reach the positive outcome in as few steps as possible.

我們沒(méi)有過(guò)多詳細(xì)提到的RL的另一個(gè)關(guān)鍵優(yōu)勢(shì)是，我們可以控制環(huán)境。目前，獎(jiǎng)勵(lì)基于我們認(rèn)為最好的方法，即使模型以盡可能少的步驟達(dá)到正面結(jié)果。

However, say the teacher changed and the new one didn’t mind the students throwing the paper in the bin so long as it reached it. Then we can change our negative reward around this and the optimal policy will change.

但是，說(shuō)老師換了，新老師不介意學(xué)生只要把紙扔進(jìn)垃圾箱就把它扔進(jìn)垃圾箱。然后，我們可以圍繞此改變我們的負(fù)面獎(jiǎng)勵(lì)，最優(yōu)政策也會(huì)改變。

This is particularly useful for business solutions. For example, say you are planning a strategy and know that certain transitions are less desired than others, then this can be taken into account and changed at will.

這對(duì)于業(yè)務(wù)解決方案特別有用。例如，假設(shè)您正在計(jì)劃一項(xiàng)策略，并且知道某些過(guò)渡要比其他過(guò)渡少，那么可以考慮并隨意更改。

結(jié)論 (Conclusion)

We have now created a simple Reinforcement Learning model from observed data. There are many things that could be improved or taken further, including using a more complex model, but this should be a good introduction for those that wish to try and apply to their own real-life problems.

現(xiàn)在，我們根據(jù)觀(guān)察到的數(shù)據(jù)創(chuàng)建了一個(gè)簡(jiǎn)單的強(qiáng)化學(xué)習(xí)模型。有許多事情可以改進(jìn)或進(jìn)一步，包括使用更復(fù)雜的模型，但這對(duì)于那些希望嘗試并將其應(yīng)用于實(shí)際問(wèn)題的人來(lái)說(shuō)應(yīng)該是一個(gè)很好的介紹。

I hope you enjoyed reading this article, if you have any questions please feel free to comment below.

希望您喜歡閱讀本文，如果有任何疑問(wèn)，請(qǐng)?jiān)谙旅姘l(fā)表評(píng)論。

Thanks

謝謝

Sterling

英鎊

翻譯自: https://www.freecodecamp.org/news/how-to-apply-reinforcement-learning-to-real-life-planning-problems-90f8fa3dc0c5/

強(qiáng)化學(xué)習(xí)應(yīng)用于組合優(yōu)化問(wèn)題

總結(jié)

以上是生活随笔為你收集整理的强化学习应用于组合优化问题_如何将强化学习应用于现实生活中的计划问题的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：营销客户旅程模板_我如何在国外找到开发
下一篇： pb 放弃数据窗口所做修改_为什么我放弃