官术网_书友最值得收藏!

Performing policy evaluation

We have just developed an MDP and computed the value function of the optimal policy using matrix inversion. We also mentioned the limitation of inverting an m * m matrix with a large m value (let's say 1,000, 10,000, or 100,000). In this recipe, we will talk about a simpler approach called policy evaluation.

Policy evaluation is an iterative algorithm. It starts with arbitrary policy values and then iteratively updates the values based on the Bellman expectation equation until they converge. In each iteration, the value of a policy, π, for a state, s, is updated as follows:

Here, π(s, a) denotes the probability of taking action a in state s under policy πT(s, a, s') is the transition probability from state s to state s' by taking action a, and R(s, a) is the reward received in state s by taking action a.

There are two ways to terminate an iterative updating process. One is by setting a fixed number of iterations, such as 1,000 and 10,000, which might be difficult to control sometimes. Another one involves specifying a threshold (usually 0.0001, 0.00001, or something similar) and terminating the process only if the values of all states change to an extent that is lower than the threshold specified.

In the next section, we will perform policy evaluation on the study-sleep-game process under the optimal policy and a random policy.

主站蜘蛛池模板: 孝昌县| 蓝田县| 凉山| 永德县| 丰原市| 伊金霍洛旗| 紫云| 沙田区| 台南市| 喀喇| 赤城县| 杭锦后旗| 华安县| 台州市| 上虞市| 河北区| 平凉市| 西乡县| 西平县| 扶风县| 酒泉市| 德清县| 城步| 新丰县| 安福县| 武胜县| 平乐县| 图们市| 沙湾县| 东明县| 杭锦后旗| 漳平市| 东台市| 鹤山市| 电白县| 万年县| 富川| 白城市| 闽清县| 福鼎市| 双城市|