- PyTorch 1.x Reinforcement Learning Cookbook
- Yuxi (Hayden) Liu
- 238字
- 2021-06-24 12:34:45
How it works...
In this oversimplified study-sleep-game process, the optimal policy, that is, the policy that achieves the highest total reward, is choosing action a0 in all steps. However, it won't be that straightforward in most cases. Also, the actions taken in individual steps won't necessarily be the same. They are usually dependent on states. So, we will have to solve an MDP by finding the optimal policy in real-world cases.
The value function of a policy measures how good it is for an agent to be in each state, given the policy being followed. The greater the value, the better the state.
In Step 4, we calculated the value, V, of the optimal policy using matrix inversion. According to the Bellman Equation, the relationship between the value at step t+1 and that at step t can be expressed as follows:

When the value converges, which means Vt+1 = Vt, we can derive the value, V, as follows:

Here, I is the identity matrix with 1s on the main diagonal.
One advantage of solving an MDP with matrix inversion is that you always get an exact answer. But the downside is its scalability. As we need to compute the inversion of an m * m matrix (where m is the number of possible states), the computation will become costly if there is a large number of states.
- 面向STEM的mBlock智能機器人創(chuàng)新課程
- 21天學通JavaScript
- 蕩胸生層云:C語言開發(fā)修行實錄
- 極簡AI入門:一本書讀懂人工智能思維與應用
- 圖解PLC控制系統(tǒng)梯形圖和語句表
- Mastering Salesforce CRM Administration
- Google App Inventor
- SharePoint 2010開發(fā)最佳實踐
- Windows環(huán)境下32位匯編語言程序設計
- CompTIA Linux+ Certification Guide
- 計算機網絡原理與技術
- 悟透AutoCAD 2009案例自學手冊
- 單片機C語言程序設計完全自學手冊
- Word 2007,Excel 2007辦公應用融會貫通
- 基于企業(yè)網站的顧客感知服務質量評價理論模型與實證研究