官术网_书友最值得收藏!

Model

In the previous sections, we discussed how the environment is not fully known to the agent. In other words, the agent usually does not have an idea of how the internal algorithm of the environment looks. The agent thus needs to interact with it to gain information and learn how to maximize its expected cumulative reward. However, it is possible for the agent to have an internal replica, or a model, of the environment. The agent can use the model to predict how the environment would react to some action in a given state. A model of the stock market, for example, is tasked with predicting what the prices will look like in the future. If the model is accurate, the agent can then use its value function to assess how desirable future states look. More formally, a model can be denoted as a function, , that predicts the probability of the next state given the current state and an action:

In other scenarios, the model of the environment can be used to enumerate possible future states. This is commonly used in turn-based games, such as chess and tic-tac-toe, where the rules and scope of possible actions are clearly defined. Trees are often used to illustrate the possible sequence of actions and states in turn-based games:

Figure 4: A model using its value function to assess possible moves

In the preceding example of the tic-tac-toe game,denotes the possible states that taking theaction (represented as the shaded circle) could yield in a given state, . Moreover, we can calculate the value of each state using the agent's value function. The middle and bottom states would yield a high value since the agent would be one step away from victory, whereas  the top state would yield a medium value since the agent needs to prevent the opponent from winning.

Let's review the terms we have covered so far:

     
Term            Description            What does it output?
Policy            The algorithm or function that outputs decisions the agent makes            A scalar/single decision (deterministic policy) or a vector of probabilities over possible actions (stochastic policy)
Value Function            The function that describes how good or bad a given state is            A scalar value representing the expected cumulative reward
Model            An agent's representation of the environment, which predicts how the environment will react to the agent's actions           
The probability of the next state given an action and current state, or an enumeration of possible states given the rules of the environment

In the following sections, we will use these concepts to learn about one of the most fundamental frameworks in reinforcement learning: the Markov decision process.

主站蜘蛛池模板: 昆山市| 长武县| 上林县| 庆安县| 兰考县| 万全县| 如皋市| 正阳县| 郸城县| 阳春市| 岳阳市| 乌拉特中旗| 武平县| 许昌市| 漳浦县| 嫩江县| 关岭| 区。| 永顺县| 绍兴县| 抚宁县| 巴塘县| 齐齐哈尔市| 陈巴尔虎旗| 玉屏| 孝感市| 宜章县| 南涧| 上犹县| 龙陵县| 黑龙江省| 冀州市| 濮阳市| 九龙县| 万安县| 波密县| 论坛| 南阳市| 通江县| 景泰县| 东乡县|