- Hands-On Q-Learning with Python
- Nazia Habib
- 612字
- 2021-06-24 15:13:14
Markov chains
MDPs are based on Markov chains. A Markov chain is a model of a system of events where the probability of the next event depends only on the state of the previous event (or the current event).
Here's a very simple example of a process that can be modeled as a Markov chain. Suppose that we're playing a game with a binary outcome repeatedly—at the end of every round of the game, we will either win or lose:

The chain has two states and the system moves back and forth between them. The arrows between them are the probabilities that we will move to the corresponding states. If we are in the Win state, there is a 60% chance that we will stay in Win and a 40% chance that we will move to Lose at the next timestep (the time it takes to play another round). Likewise, if we lose this round, there is a 70% chance that we will win the next round and a 30% chance that we will lose again.
As another example, suppose that we're playing a simple stochastic game that pays out rewards. Here is a partial state diagram of that game; we start in the state marked $5:

Every time we flip a fair coin, we win $1 if the coin comes up as heads or lose $1 if it comes up as tails. We start with $5 and flip the coin three times. A more general version of this process is called a random walk. The following tree diagram demonstrates a two-step random walk with a coin flip:

The following are the results when we flip the coin three times:
- First flip: Heads – $1 win
- Second flip: Tails – $1 loss
- Third flip: Heads – $1 win
We now have $6; we can think of this as our current state, which we will call S0. From this state, we can go to one of two states, as follows:
- S1: The state of having $5 – 50% probability
- S2: The state of having $7 – 50% probability
The event that will cause us to go into one of these two states is a coin flip; that is, it is a stochastic event that has a probability of 50% for either outcome (because we have defined this as a fair coin).
We know the current state we are in (the state of having $6) and the two possible outcomes that will result from the coin flip event (either having $5 or having $7). Note that although we have continuously been playing this game and our current state depends on the previous states that we have been in, we don't need to know how much money we started with (or how much we won or lost previously) to know how much we will have on the next flip and what the probability is of either outcome.
In other words, we do not need knowledge of previous states to determine what the next state will be. Whatever our knowledge of previous states, our probability calculation for the next state will always be the same.
This memorylessness is the defining characteristic of a problem that can be modeled as a Markov chain. If I have $X now, I know for sure that I will have either $X+1 or $X-1 after the next coin flip. It doesn't matter if I had $X+1 or $X-1 after the previous coin flip; for the purposes of predicting what I will get next, all that matters is what I have now. Knowing about the past will not change my predictions about the future.
- Unreal Engine:Game Development from A to Z
- 自動(dòng)控制工程設(shè)計(jì)入門
- Microsoft Power BI Quick Start Guide
- 我的J2EE成功之路
- Dreamweaver CS3網(wǎng)頁制作融會(huì)貫通
- Windows內(nèi)核原理與實(shí)現(xiàn)
- 3D Printing for Architects with MakerBot
- Hadoop應(yīng)用開發(fā)基礎(chǔ)
- Cloud Security Automation
- Mastering pfSense
- INSTANT Puppet 3 Starter
- Mastering OpenStack(Second Edition)
- Creating ELearning Games with Unity
- 伺服與運(yùn)動(dòng)控制系統(tǒng)設(shè)計(jì)
- Eclipse全程指南