- Python Reinforcement Learning Projects
- Sean Saito Yang Wenzhuo Rajalingappaa Shanmugamani
- 491字
- 2021-07-23 19:05:01
Model
In the previous sections, we discussed how the environment is not fully known to the agent. In other words, the agent usually does not have an idea of how the internal algorithm of the environment looks. The agent thus needs to interact with it to gain information and learn how to maximize its expected cumulative reward. However, it is possible for the agent to have an internal replica, or a model, of the environment. The agent can use the model to predict how the environment would react to some action in a given state. A model of the stock market, for example, is tasked with predicting what the prices will look like in the future. If the model is accurate, the agent can then use its value function to assess how desirable future states look. More formally, a model can be denoted as a function, , that predicts the probability of the next state given the current state and an action:
In other scenarios, the model of the environment can be used to enumerate possible future states. This is commonly used in turn-based games, such as chess and tic-tac-toe, where the rules and scope of possible actions are clearly defined. Trees are often used to illustrate the possible sequence of actions and states in turn-based games:
In the preceding example of the tic-tac-toe game,denotes the possible states that taking the
action (represented as the shaded circle) could yield in a given state,
. Moreover, we can calculate the value of each state using the agent's value function. The middle and bottom states would yield a high value since the agent would be one step away from victory, whereas the top state would yield a medium value since the agent needs to prevent the opponent from winning.
Let's review the terms we have covered so far:
Term Description What does it output?
Policy The algorithm or function that outputs decisions the agent makes A scalar/single decision (deterministic policy) or a vector of probabilities over possible actions (stochastic policy)
Value Function The function that describes how good or bad a given state is A scalar value representing the expected cumulative reward
Model An agent's representation of the environment, which predicts how the environment will react to the agent's actions
The probability of the next state given an action and current state, or an enumeration of possible states given the rules of the environment
In the following sections, we will use these concepts to learn about one of the most fundamental frameworks in reinforcement learning: the Markov decision process.
- Big Data Analytics with Hadoop 3
- 虛擬儀器設計測控應用典型實例
- Getting Started with Oracle SOA B2B Integration:A Hands-On Tutorial
- 實時流計算系統(tǒng)設計與實現(xiàn)
- 機器自動化控制器原理與應用
- Windows程序設計與架構
- Visual C++編程全能詞典
- 我也能做CTO之程序員職業(yè)規(guī)劃
- 悟透AutoCAD 2009完全自學手冊
- Hands-On Reactive Programming with Reactor
- 云計算和大數(shù)據(jù)的應用
- 單片機技能與實訓
- Mastering MongoDB 3.x
- 網(wǎng)絡服務器搭建與管理
- FANUC工業(yè)機器人配置與編程技術