- PyTorch 1.x Reinforcement Learning Cookbook
- Yuxi (Hayden) Liu
- 238字
- 2021-06-24 12:34:45
How it works...
In this oversimplified study-sleep-game process, the optimal policy, that is, the policy that achieves the highest total reward, is choosing action a0 in all steps. However, it won't be that straightforward in most cases. Also, the actions taken in individual steps won't necessarily be the same. They are usually dependent on states. So, we will have to solve an MDP by finding the optimal policy in real-world cases.
The value function of a policy measures how good it is for an agent to be in each state, given the policy being followed. The greater the value, the better the state.
In Step 4, we calculated the value, V, of the optimal policy using matrix inversion. According to the Bellman Equation, the relationship between the value at step t+1 and that at step t can be expressed as follows:

When the value converges, which means Vt+1 = Vt, we can derive the value, V, as follows:

Here, I is the identity matrix with 1s on the main diagonal.
One advantage of solving an MDP with matrix inversion is that you always get an exact answer. But the downside is its scalability. As we need to compute the inversion of an m * m matrix (where m is the number of possible states), the computation will become costly if there is a large number of states.
- Splunk 7 Essentials(Third Edition)
- 程序設(shè)計語言與編譯
- Photoshop CS3圖層、通道、蒙版深度剖析寶典
- Implementing Oracle API Platform Cloud Service
- C語言寶典
- Linux服務(wù)與安全管理
- Storm應(yīng)用實踐:實時事務(wù)處理之策略
- Google SketchUp for Game Design:Beginner's Guide
- 網(wǎng)絡(luò)安全技術(shù)及應(yīng)用
- INSTANT Munin Plugin Starter
- 手機游戲策劃設(shè)計
- 貫通開源Web圖形與報表技術(shù)全集
- 數(shù)字多媒體技術(shù)基礎(chǔ)
- 電動汽車驅(qū)動與控制技術(shù)
- 案例解說虛擬儀器典型控制應(yīng)用