官术网_书友最值得收藏!

There's more...

We decide to experiment with different values for the discount factor. Let's start with 0, which means we only care about the immediate reward:

 >>> gamma = 0
>>> V = cal_value_matrix_inversion(gamma, trans_matrix, R)
>>> print("The value function under the optimal policy is:\n{}".format(V))
The value function under the optimal policy is:
tensor([[ 1.],
[ 0.],
[-1.]])

This is consistent with the reward function as we only look at the reward received in the next move.

As the discount factor increases toward 1, future rewards are considered. Let's take a look at ??=0.99:

 >>> gamma = 0.99
>>> V = cal_value_matrix_inversion(gamma, trans_matrix, R)
>>> print("The value function under the optimal policy is:\n{}".format(V))
The value function under the optimal policy is:
tensor([[65.8293],
[64.7194],
[63.4876]])
主站蜘蛛池模板: 双桥区| 定远县| 化州市| 揭阳市| 淄博市| 伊春市| 岳池县| 和顺县| 黎川县| 锦屏县| 柳林县| 广宁县| 峨边| 忻州市| 万载县| 习水县| 泰兴市| 资中县| 雷山县| 云林县| 木兰县| 玉山县| 玉山县| 上栗县| 涟水县| 喀喇沁旗| 惠安县| 崇明县| 太仆寺旗| 曲阳县| 武威市| 盖州市| 寻甸| 白玉县| 拉萨市| 东莞市| 遂川县| 海盐县| 平乡县| 元谋县| 江门市|