官术网_书友最值得收藏!

How it works...

In this oversimplified study-sleep-game process, the optimal policy, that is, the policy that achieves the highest total reward, is choosing action a0 in all steps. However, it won't be that straightforward in most cases. Also, the actions taken in individual steps won't necessarily be the same. They are usually dependent on states. So, we will have to solve an MDP by finding the optimal policy in real-world cases.

The value function of a policy measures how good it is for an agent to be in each state, given the policy being followed. The greater the value, the better the state.

In Step 4, we calculated the value, V, of the optimal policy using matrix inversion. According to the Bellman Equation, the relationship between the value at step t+1 and that at step t can be expressed as follows:

When the value converges, which means Vt+1 = Vt, we can derive the value, V, as follows:

Here, I is the identity matrix with 1s on the main diagonal.

One advantage of solving an MDP with matrix inversion is that you always get an exact answer. But the downside is its scalability. As we need to compute the inversion of an m * m matrix (where m is the number of possible states), the computation will become costly if there is a large number of states.

主站蜘蛛池模板: 平度市| 奉新县| 洛川县| 区。| 黄平县| 六枝特区| 西盟| 阿巴嘎旗| 兴业县| 扶风县| 临夏市| 岑巩县| 沅江市| 临澧县| 古蔺县| 关岭| 观塘区| 泰安市| 胶南市| 柘荣县| 松桃| 阿拉尔市| 鄂托克前旗| 福泉市| 莫力| 碌曲县| 西充县| 墨玉县| 犍为县| 广河县| 德清县| 通道| 乌兰县| 乌兰县| 随州市| 长顺县| 阿荣旗| 宝兴县| 区。| 桂平市| 湖口县|