Introduction
In the previous chapter, we studied the main elements of Reinforcement Learning (RL). We described an agent as an entity that can perceive an environment's state and act by modifying the environment state in order to achieve a goal. An agent acts through a policy that represents its behavior, and the way the agent selects an action is based on the environment state. In the second half of the previous chapter, we introduced Gym and Baselines, two Python libraries that simplify the environment representation and the algorithm implementation, respectively.
We mentioned that RL considers problems as Markov Decision Processes (MDPs), without entering into the details and without giving a formal definition.
In this chapter, we will formally describe what an MDP is, its properties, and its characteristics. When facing a new problem in RL, we have to ensure that the problem can be formalized as an MDP; otherwise, applying RL techniques is impossible.
Before presenting a formal definition of MDPs, we need to understand Markov Chains (MCs) and Markov Reward Processes (MRPs). MCs and MRPs are specific cases (simplified) of MDPs. An MC only focuses on state transitions without modeling rewards and actions. Consider the example of the game of snakes and ladders, where the next action is completely dependent on the number displayed on the dice. MRPs also include the reward component in the state transition. MRPs and MCs are useful in understanding the characteristics of MDPs gradually. We will be looking at specific examples of MCs and MRPs later in the chapter.
Along with MDPs, this chapter also presents the concepts of the state-value function and the action-value function, which are used to evaluate how good a state is for an agent and how good an action taken in a given state is. State-value functions and action-value functions are the building blocks of the algorithms used to solve real-world problems. The concepts of state-value functions and action-value functions are highly related to the agent's policy and the environment dynamics, as we will learn later in this chapter.
The final part of this chapter presents two Bellman equations, namely the Bellman expectation equation and the Bellman optimality equation. These equations are helpful in the context of RL in order to evaluate the behavior of an agent and find a policy that maximizes the agent's performance in an MDP.
In this chapter, we will practice with some MDP examples, such as the student MDP and Gridworld. We will implement the solution methods and equations explained in this chapter using Python, SciPy, and NumPy.
- Learning SQL Server Reporting Services 2012
- Instant uTorrent
- 電腦軟硬件維修大全(實(shí)例精華版)
- 電腦維護(hù)與故障排除傻瓜書(Windows 10適用)
- Linux運(yùn)維之道(第2版)
- 微軟互聯(lián)網(wǎng)信息服務(wù)(IIS)最佳實(shí)踐 (微軟技術(shù)開發(fā)者叢書)
- 計(jì)算機(jī)組裝、維護(hù)與維修項(xiàng)目教程
- The Applied Artificial Intelligence Workshop
- 分布式存儲系統(tǒng):核心技術(shù)、系統(tǒng)實(shí)現(xiàn)與Go項(xiàng)目實(shí)戰(zhàn)
- 多媒體應(yīng)用技術(shù)(第2版)
- 計(jì)算機(jī)組裝與維護(hù)
- 3D打印:Geomagic Design X5.1 逆向建模設(shè)計(jì)實(shí)用教程
- Arduino+3D打印創(chuàng)新電子制作2
- 微處理器及控制電路識圖
- FPGA的人工智能之路:基于Intel FPGA開發(fā)的入門到實(shí)踐