- Python Reinforcement Learning Projects
- Sean Saito Yang Wenzhuo Rajalingappaa Shanmugamani
- 744字
- 2021-07-23 19:05:01
Markov decision process (MDP)
A Markov decision process is a framework used to represent the environment of a reinforcement learning problem. It is a graphical model with directed edges (meaning that one node of the graph points to another node). Each node represents a possible state in the environment, and each edge pointing out of a state represents an action that can be taken in the given state. For example, consider the following MDP:
The preceding MDP represents what a typical day of a programmer could look like. Each circle represents a particular state the programmer can be in, where the blue state (Wake Up) is the initial state (or the state the agent is in at t=0), and the orange state (Publish Code) denotes the terminal state. Each arrow represents the transitions that the programmer can make between states. Each state has a reward that is associated with it, and the higher the reward, the more desirable the state is.
We can tabulate the rewards as an adjacency matrix as well:
The left column represents the possible states and the top row represents the possible actions. N/A means that the action is not performable from the given state. This system basically represents the decisions that a programmer can make throughout their day.
When the programmer wakes up, they can either decide to work (code and debug the code) or watch Netflix. Notice that the reward for watching Netflix is higher than that of coding and debugging. For the programmer in question, watching Netflix seems like a more rewarding activity, while coding and debugging is perhaps a chore (which, I hope, is not the case for the reader!). However, both actions yield negative rewards, even though our objective is to maximize our cumulative reward. If the programmer chooses to watch Netflix, they will be stuck in an endless loop of binge-watching, which continuously lowers the reward. Rather, more rewarding states will become available to the programmer if they decide to code diligently. Let's look at the possible trajectories, which are the sequence of actions, the programmer can take:
- Wake Up | Netflix | Netflix | ...
- Wake Up | Code and debug | Nap | Wake Up | Code and debug | Nap | ...
- Wake Up | Code and debug | Sleep
- Wake Up | Code and debug | Deploy | Sleep
Both the first and second trajectories represent infinite loops. Let's calculate the cumulative reward for each, where we set :
It is easy to see that both the first and second trajectories, despite not reaching a terminal state, will never return positive rewards. The fourth trajectory yields the highest reward (successfully deploying code is a highly rewarding accomplishment!).
What we have calculated are the value functions for four policies that a programmer can take to go through their day. Recall that the value function is the expected cumulative reward starting from a given state and following a policy. We have observed four possible policies and have evaluated how each leads to a different cumulative reward; this exercise is also called policy evaluation. Moreover, the equations we have applied to calculate the expected rewards are also known as Bellman expectation equations. The Bellman equations are a set of equations used to evaluate and improve policies and value functions to help a reinforcement learning agent learn better. Though a thorough introduction to Bellman equations is outside the scope of this book, they are foundational to building a theoretical understanding of reinforcement learning. We encourage the reader to look into this further.
Now that you have learned about some the key terms and concepts of reinforcement learning, you may be wondering how we teach a reinforcement learning agent to maximize its reward, or in other words, find that the fourth trajectory is the best. In this book, you will be working on solving this question for numerous tasks and problems, all using deep learning. While we encourage you to be familiar with the basics of deep learning, the following sections will serve as a light refresher to the field.
- 輕松學(xué)Java Web開(kāi)發(fā)
- Getting Started with MariaDB
- 永磁同步電動(dòng)機(jī)變頻調(diào)速系統(tǒng)及其控制(第2版)
- 計(jì)算機(jī)系統(tǒng)結(jié)構(gòu)
- 手機(jī)游戲程序開(kāi)發(fā)
- 統(tǒng)計(jì)挖掘與機(jī)器學(xué)習(xí):大數(shù)據(jù)預(yù)測(cè)建模和分析技術(shù)(原書(shū)第3版)
- Mastering GitLab 12
- 基于ARM9的小型機(jī)器人制作
- Hands-On Dashboard Development with QlikView
- Web璀璨:Silverlight應(yīng)用技術(shù)完全指南
- Java組件設(shè)計(jì)
- 網(wǎng)站規(guī)劃與網(wǎng)頁(yè)設(shè)計(jì)
- DynamoDB Applied Design Patterns
- PostgreSQL 10 High Performance
- Mastering SQL Server 2014 Data Mining