書名： Python Reinforcement Learning Projects
作者名： Sean Saito Yang Wenzhuo Rajalingappaa Shanmugamani
本章字數： 491字
更新時間： 2021-07-23 19:05:01

Model

In the previous sections, we discussed how the environment is not fully known to the agent. In other words, the agent usually does not have an idea of how the internal algorithm of the environment looks. The agent thus needs to interact with it to gain information and learn how to maximize its expected cumulative reward. However, it is possible for the agent to have an internal replica, or a model, of the environment. The agent can use the model to predict how the environment would react to some action in a given state. A model of the stock market, for example, is tasked with predicting what the prices will look like in the future. If the model is accurate, the agent can then use its value function to assess how desirable future states look. More formally, a model can be denoted as a function, , that predicts the probability of the next state given the current state and an action:

In other scenarios, the model of the environment can be used to enumerate possible future states. This is commonly used in turn-based games, such as chess and tic-tac-toe, where the rules and scope of possible actions are clearly defined. Trees are often used to illustrate the possible sequence of actions and states in turn-based games:

Figure 4: A model using its value function to assess possible moves

In the preceding example of the tic-tac-toe game,denotes the possible states that taking theaction (represented as the shaded circle) could yield in a given state, . Moreover, we can calculate the value of each state using the agent's value function. The middle and bottom states would yield a high value since the agent would be one step away from victory, whereas the top state would yield a medium value since the agent needs to prevent the opponent from winning.

Let's review the terms we have covered so far:

     
Term           
Description           
What does it output?

      
Policy           
The algorithm or function that outputs decisions the agent makes           
A scalar/single decision (deterministic policy) or a vector of probabilities over possible actions (stochastic policy)

      
Value Function           
The function that describes how good or bad a given state is           
A scalar value representing the expected cumulative reward

      
Model           
An agent's representation of the environment, which predicts how the environment will react to the agent's actions           

          
The probability of the next state given an action and current state, or an enumeration of possible states given the rules of the environment

In the following sections, we will use these concepts to learn about one of the most fundamental frameworks in reinforcement learning: the Markov decision process.

官术网_书友最值得收藏!

Python Reinforcement Learning Projects

Model