官术网_书友最值得收藏!

There's more...

We decide to experiment with different values for the discount factor. Let's start with 0, which means we only care about the immediate reward:

 >>> gamma = 0
>>> V = cal_value_matrix_inversion(gamma, trans_matrix, R)
>>> print("The value function under the optimal policy is:\n{}".format(V))
The value function under the optimal policy is:
tensor([[ 1.],
[ 0.],
[-1.]])

This is consistent with the reward function as we only look at the reward received in the next move.

As the discount factor increases toward 1, future rewards are considered. Let's take a look at ??=0.99:

 >>> gamma = 0.99
>>> V = cal_value_matrix_inversion(gamma, trans_matrix, R)
>>> print("The value function under the optimal policy is:\n{}".format(V))
The value function under the optimal policy is:
tensor([[65.8293],
[64.7194],
[63.4876]])
主站蜘蛛池模板: 义马市| 潮安县| 锡林浩特市| 长岭县| 宁武县| 灯塔市| 班玛县| 香格里拉县| 清原| 咸阳市| 筠连县| 五莲县| 墨江| 东兴市| 扬中市| 溧阳市| 池州市| 林周县| 秀山| 宜丰县| 黄龙县| 牡丹江市| 新龙县| 河曲县| 集贤县| 桃江县| 改则县| 宝山区| 山东省| 营山县| 科技| 沅江市| 开封县| 洪湖市| 云浮市| 富裕县| 上虞市| 建阳市| 鱼台县| 彝良县| 都江堰市|