- PyTorch 1.x Reinforcement Learning Cookbook
- Yuxi (Hayden) Liu
- 192字
- 2021-06-24 12:34:45
How to do it...
Creating an MDP can be done via the following steps:
- Import PyTorch and define the transition matrix:
>>> import torch
>>> T = torch.tensor([[[0.8, 0.1, 0.1],
... [0.1, 0.6, 0.3]],
... [[0.7, 0.2, 0.1],
... [0.1, 0.8, 0.1]],
... [[0.6, 0.2, 0.2],
... [0.1, 0.4, 0.5]]]
... )
- Define the reward function and the discount factor:
>>> R = torch.tensor([1., 0, -1.])
>>> gamma = 0.5
- The optimal policy in this case is selecting action a0 in all circumstances:
>>> action = 0
- We calculate the value, V, of the optimal policy using the matrix inversion method in the following function:
>>> def cal_value_matrix_inversion(gamma, trans_matrix, rewards):
... inv = torch.inverse(torch.eye(rewards.shape[0])
- gamma * trans_matrix)
... V = torch.mm(inv, rewards.reshape(-1, 1))
... return V
We will demonstrate how to derive the value in the next section.
- We feed all variables we have to the function, including the transition probabilities associated with action a0:
>>> trans_matrix = T[:, action]
>>> V = cal_value_matrix_inversion(gamma, trans_matrix, R)
>>> print("The value function under the optimal
policy is:\n{}".format(V))
The value function under the optimal policy is:
tensor([[ 1.6787],
[ 0.6260],
[-0.4820]])
推薦閱讀
- JavaScript實例自學(xué)手冊
- Linux Mint System Administrator’s Beginner's Guide
- 圖形圖像處理(Photoshop)
- PyTorch深度學(xué)習(xí)實戰(zhàn)
- 21天學(xué)通C#
- JSF2和RichFaces4使用指南
- 網(wǎng)絡(luò)組建與互聯(lián)
- 人工智能趣味入門:光環(huán)板程序設(shè)計
- 空間站多臂機器人運動控制研究
- 基于神經(jīng)網(wǎng)絡(luò)的監(jiān)督和半監(jiān)督學(xué)習(xí)方法與遙感圖像智能解譯
- Mastering GitLab 12
- 穿越計算機的迷霧
- 菜鳥起飛電腦組裝·維護與故障排查
- 運動控制系統(tǒng)
- QTP自動化測試實踐