- PyTorch 1.x Reinforcement Learning Cookbook
- Yuxi (Hayden) Liu
- 127字
- 2021-06-24 12:34:45
There's more...
We decide to experiment with different values for the discount factor. Let's start with 0, which means we only care about the immediate reward:
>>> gamma = 0
>>> V = cal_value_matrix_inversion(gamma, trans_matrix, R)
>>> print("The value function under the optimal policy is:\n{}".format(V))
The value function under the optimal policy is:
tensor([[ 1.],
[ 0.],
[-1.]])
This is consistent with the reward function as we only look at the reward received in the next move.
As the discount factor increases toward 1, future rewards are considered. Let's take a look at ??=0.99:
>>> gamma = 0.99
>>> V = cal_value_matrix_inversion(gamma, trans_matrix, R)
>>> print("The value function under the optimal policy is:\n{}".format(V))
The value function under the optimal policy is:
tensor([[65.8293],
[64.7194],
[63.4876]])
推薦閱讀
- PPT,要你好看
- 平面設(shè)計(jì)初步
- 2018西門(mén)子工業(yè)專家會(huì)議論文集(上)
- Photoshop CS3圖像處理融會(huì)貫通
- Bayesian Analysis with Python
- 單片機(jī)原理實(shí)用教程
- 精通LabVIEW程序設(shè)計(jì)
- Learning Apache Apex
- MATLAB-Simulink系統(tǒng)仿真超級(jí)學(xué)習(xí)手冊(cè)
- 網(wǎng)絡(luò)服務(wù)器搭建與管理
- 數(shù)據(jù)要素:全球經(jīng)濟(jì)社會(huì)發(fā)展的新動(dòng)力
- 大數(shù)據(jù)素質(zhì)讀本
- 計(jì)算機(jī)硬件技術(shù)基礎(chǔ)(第2版)
- Kubernetes on AWS
- 網(wǎng)管員世界2009超值精華本