官术网_书友最值得收藏!

Performing policy evaluation

We have just developed an MDP and computed the value function of the optimal policy using matrix inversion. We also mentioned the limitation of inverting an m * m matrix with a large m value (let's say 1,000, 10,000, or 100,000). In this recipe, we will talk about a simpler approach called policy evaluation.

Policy evaluation is an iterative algorithm. It starts with arbitrary policy values and then iteratively updates the values based on the Bellman expectation equation until they converge. In each iteration, the value of a policy, π, for a state, s, is updated as follows:

Here, π(s, a) denotes the probability of taking action a in state s under policy πT(s, a, s') is the transition probability from state s to state s' by taking action a, and R(s, a) is the reward received in state s by taking action a.

There are two ways to terminate an iterative updating process. One is by setting a fixed number of iterations, such as 1,000 and 10,000, which might be difficult to control sometimes. Another one involves specifying a threshold (usually 0.0001, 0.00001, or something similar) and terminating the process only if the values of all states change to an extent that is lower than the threshold specified.

In the next section, we will perform policy evaluation on the study-sleep-game process under the optimal policy and a random policy.

主站蜘蛛池模板: 山阴县| 绥中县| 和硕县| 白山市| 会宁县| 东阿县| 双峰县| 会同县| 北流市| 若尔盖县| 辉南县| 南乐县| 南乐县| 岱山县| 抚顺市| 巴彦淖尔市| 阳西县| 工布江达县| 安乡县| 鹿邑县| 利辛县| 武陟县| 抚远县| 乌拉特前旗| 兰州市| 上蔡县| 三河市| 西青区| 贵阳市| 潜山县| 安多县| 南召县| 门头沟区| 定州市| 息烽县| 江山市| 八宿县| 静乐县| 南丰县| 西林县| 湟源县|