書名： Hands-On Q-Learning with Python
作者名： Nazia Habib
本章字數： 612字
更新時間： 2021-06-24 15:13:14

Markov chains

MDPs are based on Markov chains. A Markov chain is a model of a system of events where the probability of the next event depends only on the state of the previous event (or the current event).

Here's a very simple example of a process that can be modeled as a Markov chain. Suppose that we're playing a game with a binary outcome repeatedly—at the end of every round of the game, we will either win or lose:

The chain has two states and the system moves back and forth between them. The arrows between them are the probabilities that we will move to the corresponding states. If we are in the Win state, there is a 60% chance that we will stay in Win and a 40% chance that we will move to Lose at the next timestep (the time it takes to play another round). Likewise, if we lose this round, there is a 70% chance that we will win the next round and a 30% chance that we will lose again.

As another example, suppose that we're playing a simple stochastic game that pays out rewards. Here is a partial state diagram of that game; we start in the state marked $5:

Every time we flip a fair coin, we win $1 if the coin comes up as heads or lose $1 if it comes up as tails. We start with $5 and flip the coin three times. A more general version of this process is called a random walk. The following tree diagram demonstrates a two-step random walk with a coin flip:

The following are the results when we flip the coin three times:

First flip: Heads – $1 win
Second flip: Tails – $1 loss
Third flip: Heads – $1 win

We now have $6; we can think of this as our current state, which we will call S₀. From this state, we can go to one of two states, as follows:

S₁: The state of having $5 – 50% probability
S₂: The state of having $7 – 50% probability

The event that will cause us to go into one of these two states is a coin flip; that is, it is a stochastic event that has a probability of 50% for either outcome (because we have defined this as a fair coin).

We know the current state we are in (the state of having $6) and the two possible outcomes that will result from the coin flip event (either having $5 or having $7). Note that although we have continuously been playing this game and our current state depends on the previous states that we have been in, we don't need to know how much money we started with (or how much we won or lost previously) to know how much we will have on the next flip and what the probability is of either outcome.

In other words, we do not need knowledge of previous states to determine what the next state will be. Whatever our knowledge of previous states, our probability calculation for the next state will always be the same.

This memorylessness is the defining characteristic of a problem that can be modeled as a Markov chain. If I have $X now, I know for sure that I will have either $X+1 or $X-1 after the next coin flip. It doesn't matter if I had $X+1 or $X-1 after the previous coin flip; for the purposes of predicting what I will get next, all that matters is what I have now. Knowing about the past will not change my predictions about the future.

官术网_书友最值得收藏!

Hands-On Q-Learning with Python

Markov chains