官术网_书友最值得收藏!

Model

In the previous sections, we discussed how the environment is not fully known to the agent. In other words, the agent usually does not have an idea of how the internal algorithm of the environment looks. The agent thus needs to interact with it to gain information and learn how to maximize its expected cumulative reward. However, it is possible for the agent to have an internal replica, or a model, of the environment. The agent can use the model to predict how the environment would react to some action in a given state. A model of the stock market, for example, is tasked with predicting what the prices will look like in the future. If the model is accurate, the agent can then use its value function to assess how desirable future states look. More formally, a model can be denoted as a function, , that predicts the probability of the next state given the current state and an action:

In other scenarios, the model of the environment can be used to enumerate possible future states. This is commonly used in turn-based games, such as chess and tic-tac-toe, where the rules and scope of possible actions are clearly defined. Trees are often used to illustrate the possible sequence of actions and states in turn-based games:

Figure 4: A model using its value function to assess possible moves

In the preceding example of the tic-tac-toe game,denotes the possible states that taking theaction (represented as the shaded circle) could yield in a given state, . Moreover, we can calculate the value of each state using the agent's value function. The middle and bottom states would yield a high value since the agent would be one step away from victory, whereas  the top state would yield a medium value since the agent needs to prevent the opponent from winning.

Let's review the terms we have covered so far:

     
Term            Description            What does it output?
Policy            The algorithm or function that outputs decisions the agent makes            A scalar/single decision (deterministic policy) or a vector of probabilities over possible actions (stochastic policy)
Value Function            The function that describes how good or bad a given state is            A scalar value representing the expected cumulative reward
Model            An agent's representation of the environment, which predicts how the environment will react to the agent's actions           
The probability of the next state given an action and current state, or an enumeration of possible states given the rules of the environment

In the following sections, we will use these concepts to learn about one of the most fundamental frameworks in reinforcement learning: the Markov decision process.

主站蜘蛛池模板: 南城县| 安仁县| 运城市| 余干县| 扬中市| 崇明县| 西丰县| 浮山县| 库车县| 鹤壁市| 壶关县| 蒲江县| 靖江市| 彭州市| 屯门区| 宜黄县| 阳山县| 英吉沙县| 丹凤县| 崇义县| 蓝田县| 东阳市| 晴隆县| 高尔夫| 嘉荫县| 磐安县| 天台县| 绵阳市| 上蔡县| 资阳市| 屏山县| 武隆县| 大兴区| 新营市| 山阳县| 麻江县| 双牌县| 大新县| 焦作市| 安乡县| 泉州市|