官术网_书友最值得收藏!

Model

In the previous sections, we discussed how the environment is not fully known to the agent. In other words, the agent usually does not have an idea of how the internal algorithm of the environment looks. The agent thus needs to interact with it to gain information and learn how to maximize its expected cumulative reward. However, it is possible for the agent to have an internal replica, or a model, of the environment. The agent can use the model to predict how the environment would react to some action in a given state. A model of the stock market, for example, is tasked with predicting what the prices will look like in the future. If the model is accurate, the agent can then use its value function to assess how desirable future states look. More formally, a model can be denoted as a function, , that predicts the probability of the next state given the current state and an action:

In other scenarios, the model of the environment can be used to enumerate possible future states. This is commonly used in turn-based games, such as chess and tic-tac-toe, where the rules and scope of possible actions are clearly defined. Trees are often used to illustrate the possible sequence of actions and states in turn-based games:

Figure 4: A model using its value function to assess possible moves

In the preceding example of the tic-tac-toe game,denotes the possible states that taking theaction (represented as the shaded circle) could yield in a given state, . Moreover, we can calculate the value of each state using the agent's value function. The middle and bottom states would yield a high value since the agent would be one step away from victory, whereas  the top state would yield a medium value since the agent needs to prevent the opponent from winning.

Let's review the terms we have covered so far:

     
Term            Description            What does it output?
Policy            The algorithm or function that outputs decisions the agent makes            A scalar/single decision (deterministic policy) or a vector of probabilities over possible actions (stochastic policy)
Value Function            The function that describes how good or bad a given state is            A scalar value representing the expected cumulative reward
Model            An agent's representation of the environment, which predicts how the environment will react to the agent's actions           
The probability of the next state given an action and current state, or an enumeration of possible states given the rules of the environment

In the following sections, we will use these concepts to learn about one of the most fundamental frameworks in reinforcement learning: the Markov decision process.

主站蜘蛛池模板: 桑植县| 苍梧县| 乐平市| 太和县| 砚山县| 卓资县| 禹州市| 呼伦贝尔市| 浏阳市| 科尔| 吉水县| 乐陵市| 博爱县| 大荔县| 新巴尔虎右旗| 通渭县| 鹤峰县| 乐东| 汽车| 尉氏县| 苍溪县| 冕宁县| 菏泽市| 阿拉善右旗| 濮阳县| 镇坪县| 涿州市| 平乐县| 伊宁市| 芦溪县| 彭泽县| 区。| 翁源县| 临海市| 永登县| 哈巴河县| 五寨县| 息烽县| 茂名市| 甘孜县| 宽甸|