官术网_书友最值得收藏!

Markov decision process (MDP)

A Markov decision process is a framework used to represent the environment of a reinforcement learning problem. It is a graphical model with directed edges (meaning that one node of the graph points to another node). Each node represents a possible state in the environment, and each edge pointing out of a state represents an action that can be taken in the given state. For example, consider the following MDP:

Figure 5: A sample Markov Decision Process

The preceding MDP represents what a typical day of a programmer could look like. Each circle represents a particular state the programmer can be in, where the blue state (Wake Up) is the initial state (or the state the agent is in at t=0), and the orange state (Publish Code) denotes the terminal state. Each arrow represents the transitions that the programmer can make between states. Each state has a reward that is associated with it, and the higher the reward, the more desirable the state is.

We can tabulate the rewards as an adjacency matrix as well:

 

The left column represents the possible states and the top row represents the possible actions. N/A means that the action is not performable from the given state. This system basically represents the decisions that a programmer can make throughout their day.

When the programmer wakes up, they can either decide to work (code and debug the code) or watch Netflix. Notice that the reward for watching Netflix is higher than that of coding and debugging. For the programmer in question, watching Netflix seems like a more rewarding activity, while coding and debugging is perhaps a chore (which, I hope, is not the case for the reader!). However, both actions yield negative rewards, even though our objective is to maximize our cumulative reward. If the programmer chooses to watch Netflix, they will be stuck in an endless loop of binge-watching, which continuously lowers the reward. Rather, more rewarding states will become available to the programmer if they decide to code diligently. Let's look at the possible trajectories, which are the sequence of actions, the programmer can take:

  • Wake Up | Netflix | Netflix | ...
  • Wake Up | Code and debug | Nap | Wake Up | Code and debug | Nap | ...
  • Wake Up | Code and debug | Sleep
  • Wake Up | Code and debug | Deploy | Sleep

Both the first and second trajectories represent infinite loops. Let's calculate the cumulative reward for each, where we set :

It is easy to see that both the first and second trajectories, despite not reaching a terminal state, will never return positive rewards. The fourth trajectory yields the highest reward (successfully deploying code is a highly rewarding accomplishment!).

What we have calculated are the value functions for four policies that a programmer can take to go through their day. Recall that the value function is the expected cumulative reward starting from a given state and following a policy. We have observed four possible policies and have evaluated how each leads to a different cumulative reward; this exercise is also called policy evaluation. Moreover, the equations we have applied to calculate the expected rewards are also known as Bellman expectation equations. The Bellman equations are a set of equations used to evaluate and improve policies and value functions to help a reinforcement learning agent learn better. Though a thorough introduction to Bellman equations is outside the scope of this book, they are foundational to building a theoretical understanding of reinforcement learning. We encourage the reader to look into this further.

While we will not cover Bellman equations in depth, we highly recommend the reader to do so in order to build a solid understanding of reinforcement learning. For more information, refer to  Reinforcement Learning: An Introduction, by Richard S. Sutton and Andrew Barto (reference at the end of this chapter).

Now that you have learned about some the key terms and concepts of reinforcement learning, you may be wondering how we teach a reinforcement learning agent to maximize its reward, or in other words, find that the fourth trajectory is the best. In this book, you will be working on solving this question for numerous tasks and problems, all using deep learning. While we encourage you to be familiar with the basics of deep learning, the following sections will serve as a light refresher to the field.

主站蜘蛛池模板: 开封县| 宁强县| 綦江县| 济南市| 太保市| 长丰县| 龙里县| 建平县| 比如县| 庆元县| 湄潭县| 军事| 响水县| 桃源县| 甘德县| 娱乐| 太湖县| 卓资县| 大方县| 巨鹿县| 怀柔区| 健康| 斗六市| 敦煌市| 拉萨市| 南木林县| 河西区| 东光县| 定西市| 新田县| 罗源县| 博客| 敦化市| 二连浩特市| 砀山县| 普洱| 大足县| 临汾市| 石棉县| 龙游县| 城固县|