- Hands-On Q-Learning with Python
- Nazia Habib
- 612字
- 2021-06-24 15:13:14
Markov chains
MDPs are based on Markov chains. A Markov chain is a model of a system of events where the probability of the next event depends only on the state of the previous event (or the current event).
Here's a very simple example of a process that can be modeled as a Markov chain. Suppose that we're playing a game with a binary outcome repeatedly—at the end of every round of the game, we will either win or lose:

The chain has two states and the system moves back and forth between them. The arrows between them are the probabilities that we will move to the corresponding states. If we are in the Win state, there is a 60% chance that we will stay in Win and a 40% chance that we will move to Lose at the next timestep (the time it takes to play another round). Likewise, if we lose this round, there is a 70% chance that we will win the next round and a 30% chance that we will lose again.
As another example, suppose that we're playing a simple stochastic game that pays out rewards. Here is a partial state diagram of that game; we start in the state marked $5:

Every time we flip a fair coin, we win $1 if the coin comes up as heads or lose $1 if it comes up as tails. We start with $5 and flip the coin three times. A more general version of this process is called a random walk. The following tree diagram demonstrates a two-step random walk with a coin flip:

The following are the results when we flip the coin three times:
- First flip: Heads – $1 win
- Second flip: Tails – $1 loss
- Third flip: Heads – $1 win
We now have $6; we can think of this as our current state, which we will call S0. From this state, we can go to one of two states, as follows:
- S1: The state of having $5 – 50% probability
- S2: The state of having $7 – 50% probability
The event that will cause us to go into one of these two states is a coin flip; that is, it is a stochastic event that has a probability of 50% for either outcome (because we have defined this as a fair coin).
We know the current state we are in (the state of having $6) and the two possible outcomes that will result from the coin flip event (either having $5 or having $7). Note that although we have continuously been playing this game and our current state depends on the previous states that we have been in, we don't need to know how much money we started with (or how much we won or lost previously) to know how much we will have on the next flip and what the probability is of either outcome.
In other words, we do not need knowledge of previous states to determine what the next state will be. Whatever our knowledge of previous states, our probability calculation for the next state will always be the same.
This memorylessness is the defining characteristic of a problem that can be modeled as a Markov chain. If I have $X now, I know for sure that I will have either $X+1 or $X-1 after the next coin flip. It doesn't matter if I had $X+1 or $X-1 after the previous coin flip; for the purposes of predicting what I will get next, all that matters is what I have now. Knowing about the past will not change my predictions about the future.
- CSS全程指南
- Learning Apache Cassandra(Second Edition)
- AWS Certified SysOps Administrator:Associate Guide
- Hybrid Cloud for Architects
- Nginx高性能Web服務器詳解
- Machine Learning with Apache Spark Quick Start Guide
- Mastering ServiceNow Scripting
- 面向對象程序設計綜合實踐
- Salesforce Advanced Administrator Certification Guide
- Working with Linux:Quick Hacks for the Command Line
- 單片機技術項目化原理與實訓
- QTP自動化測試實踐
- 互聯網單元測試及實踐
- ARM嵌入式開發實例
- 運動控制系統應用及實例解析