- PyTorch 1.x Reinforcement Learning Cookbook
- Yuxi (Hayden) Liu
- 243字
- 2021-06-24 12:34:46
Performing policy evaluation
We have just developed an MDP and computed the value function of the optimal policy using matrix inversion. We also mentioned the limitation of inverting an m * m matrix with a large m value (let's say 1,000, 10,000, or 100,000). In this recipe, we will talk about a simpler approach called policy evaluation.
Policy evaluation is an iterative algorithm. It starts with arbitrary policy values and then iteratively updates the values based on the Bellman expectation equation until they converge. In each iteration, the value of a policy, π, for a state, s, is updated as follows:

Here, π(s, a) denotes the probability of taking action a in state s under policy π. T(s, a, s') is the transition probability from state s to state s' by taking action a, and R(s, a) is the reward received in state s by taking action a.
There are two ways to terminate an iterative updating process. One is by setting a fixed number of iterations, such as 1,000 and 10,000, which might be difficult to control sometimes. Another one involves specifying a threshold (usually 0.0001, 0.00001, or something similar) and terminating the process only if the values of all states change to an extent that is lower than the threshold specified.
In the next section, we will perform policy evaluation on the study-sleep-game process under the optimal policy and a random policy.
- ABB工業機器人編程全集
- TIBCO Spotfire:A Comprehensive Primer(Second Edition)
- IoT Penetration Testing Cookbook
- 新手學電腦快速入門
- 中國戰略性新興產業研究與發展:智能制造
- Implementing Oracle API Platform Cloud Service
- 新編計算機組裝與維修
- Chef:Powerful Infrastructure Automation
- Mastering Text Mining with R
- 工業機器人實操進階手冊
- 21天學通Linux嵌入式開發
- ZigBee無線通信技術應用開發
- 智慧未來
- Python語言從入門到精通
- SQL Server 2019 Administrator's Guide