- PyTorch 1.x Reinforcement Learning Cookbook
- Yuxi (Hayden) Liu
- 238字
- 2021-06-24 12:34:46
How it works...
We have just seen how effective it is to compute the value of a policy using policy evaluation. It is a simple convergent iterative approach, in the dynamic programming family, or to be more specific, approximate dynamic programming. It starts with random guesses as to the values and then iteratively updates them according to the Bellman expectation equation until they converge.
In Step 5, the policy evaluation function does the following tasks:
- Initializes the policy values as all zeros.
- Updates the values based on the Bellman expectation equation.
- Computes the maximal change of the values across all states.
- If the maximal change is greater than the threshold, it keeps updating the values. Otherwise, it terminates the evaluation process and returns the latest values.
Since policy evaluation uses iterative approximation, its result might not be exactly the same as the result of the matrix inversion method, which uses exact computation. In fact, we don't really need the value function to be that precise. Also, it can solve the curses of dimensionality problem, which can result in scaling up the computation to thousands of millions of states. Therefore, we usually prefer policy evaluation over the other.
One more thing to remember is that policy evaluation is used to predict how great a we will get from a given policy; it is not used for control problems.
- 大數(shù)據(jù)導(dǎo)論:思維、技術(shù)與應(yīng)用
- Dreamweaver CS3 Ajax網(wǎng)頁設(shè)計入門與實例詳解
- Internet接入·網(wǎng)絡(luò)安全
- 計算機網(wǎng)絡(luò)應(yīng)用基礎(chǔ)
- 最簡數(shù)據(jù)挖掘
- Pig Design Patterns
- Visual C++編程全能詞典
- 會聲會影X4中文版從入門到精通
- 典型Hadoop云計算
- 未來學(xué)徒:讀懂人工智能飛馳時代
- 步步驚“芯”
- 深度學(xué)習(xí)之模型優(yōu)化:核心算法與案例實踐
- ASP.NET學(xué)習(xí)手冊
- NetSuite ERP for Administrators
- 案例解說單片機C語言開發(fā)