官术网_书友最值得收藏!

How it works...

We have just seen how effective it is to compute the value of a policy using policy evaluation. It is a simple convergent iterative approach, in the dynamic programming family, or to be more specific, approximate dynamic programming. It starts with random guesses as to the values and then iteratively updates them according to the Bellman expectation equation until they converge.

In Step 5, the policy evaluation function does the following tasks:

  • Initializes the policy values as all zeros.
  • Updates the values based on the Bellman expectation equation.
  • Computes the maximal change of the values across all states.
  • If the maximal change is greater than the threshold, it keeps updating the values. Otherwise, it terminates the evaluation process and returns the latest values.

Since policy evaluation uses iterative approximation, its result might not be exactly the same as the result of the matrix inversion method, which uses exact computation. In fact, we don't really need the value function to be that precise. Also, it can solve the curses of dimensionality problem, which can result in scaling up the computation to thousands of millions of states. Therefore, we usually prefer policy evaluation over the other.

One more thing to remember is that policy evaluation is used to predict how great a we will get from a given policy; it is not used for control problems.

主站蜘蛛池模板: 梁山县| 六枝特区| 陆河县| 丹东市| 云浮市| 瑞丽市| 台北县| 保靖县| 香河县| 仙游县| 富民县| 启东市| 凌源市| 施秉县| 博客| 保山市| 桐柏县| 蒲江县| 勃利县| 炉霍县| 萨迦县| 奉节县| 荆门市| 天长市| 承德市| 新营市| 岗巴县| 铜陵市| 肥乡县| 尉氏县| 法库县| 怀化市| 射阳县| 博野县| 连平县| 扶沟县| 富裕县| 汉寿县| 衡阳县| 娄烦县| 巴林右旗|