官术网_书友最值得收藏!

Performing policy evaluation

We have just developed an MDP and computed the value function of the optimal policy using matrix inversion. We also mentioned the limitation of inverting an m * m matrix with a large m value (let's say 1,000, 10,000, or 100,000). In this recipe, we will talk about a simpler approach called policy evaluation.

Policy evaluation is an iterative algorithm. It starts with arbitrary policy values and then iteratively updates the values based on the Bellman expectation equation until they converge. In each iteration, the value of a policy, π, for a state, s, is updated as follows:

Here, π(s, a) denotes the probability of taking action a in state s under policy πT(s, a, s') is the transition probability from state s to state s' by taking action a, and R(s, a) is the reward received in state s by taking action a.

There are two ways to terminate an iterative updating process. One is by setting a fixed number of iterations, such as 1,000 and 10,000, which might be difficult to control sometimes. Another one involves specifying a threshold (usually 0.0001, 0.00001, or something similar) and terminating the process only if the values of all states change to an extent that is lower than the threshold specified.

In the next section, we will perform policy evaluation on the study-sleep-game process under the optimal policy and a random policy.

主站蜘蛛池模板: 化隆| 西贡区| 探索| 肇东市| 嘉荫县| 肥东县| 左云县| 井研县| 济源市| 崇义县| 庄河市| 万山特区| 施秉县| 仪陇县| 巨鹿县| 崇州市| 辽阳市| 宣汉县| 大庆市| 白河县| 泉州市| 威信县| 武宣县| 原平市| 老河口市| 伊宁县| 成安县| 上犹县| 阿巴嘎旗| 习水县| 教育| 华安县| 德州市| 科尔| 苗栗县| 敖汉旗| 五寨县| 彭水| 石楼县| 巴楚县| 祁连县|