官术网_书友最值得收藏!

How it works...

We have just seen how effective it is to compute the value of a policy using policy evaluation. It is a simple convergent iterative approach, in the dynamic programming family, or to be more specific, approximate dynamic programming. It starts with random guesses as to the values and then iteratively updates them according to the Bellman expectation equation until they converge.

In Step 5, the policy evaluation function does the following tasks:

  • Initializes the policy values as all zeros.
  • Updates the values based on the Bellman expectation equation.
  • Computes the maximal change of the values across all states.
  • If the maximal change is greater than the threshold, it keeps updating the values. Otherwise, it terminates the evaluation process and returns the latest values.

Since policy evaluation uses iterative approximation, its result might not be exactly the same as the result of the matrix inversion method, which uses exact computation. In fact, we don't really need the value function to be that precise. Also, it can solve the curses of dimensionality problem, which can result in scaling up the computation to thousands of millions of states. Therefore, we usually prefer policy evaluation over the other.

One more thing to remember is that policy evaluation is used to predict how great a we will get from a given policy; it is not used for control problems.

主站蜘蛛池模板: 新津县| 夏邑县| 利辛县| 勐海县| 景泰县| 岳普湖县| 临西县| 通辽市| 措勤县| 平远县| 东乡县| 满城县| 平昌县| 永修县| 葵青区| 醴陵市| 顺义区| 肥东县| 西乌| 长葛市| 建水县| 金山区| 靖西县| 通州区| 兰西县| 梓潼县| 辛集市| 东海县| 嘉善县| 亳州市| 安平县| 北安市| 团风县| 东丽区| 乌拉特前旗| 绵阳市| 三台县| 丰城市| 石台县| 左权县| 墨玉县|