官术网_书友最值得收藏!

How to do it...

Creating an MDP can be done via the following steps:

  1. Import PyTorch and define the transition matrix:
 >>> import torch
>>> T = torch.tensor([[[0.8, 0.1, 0.1],
... [0.1, 0.6, 0.3]],
... [[0.7, 0.2, 0.1],
... [0.1, 0.8, 0.1]],
... [[0.6, 0.2, 0.2],
... [0.1, 0.4, 0.5]]]
... )
  1. Define the reward function and the discount factor:
 >>> R = torch.tensor([1., 0, -1.])
>>> gamma = 0.5
  1. The optimal policy in this case is selecting action a0 in all circumstances:
>>> action = 0
  1. We calculate the value, V, of the optimal policy using the matrix inversion method in the following function:
 >>> def cal_value_matrix_inversion(gamma, trans_matrix, rewards):
... inv = torch.inverse(torch.eye(rewards.shape[0])
- gamma * trans_matrix)
... V = torch.mm(inv, rewards.reshape(-1, 1))
... return V

We will demonstrate how to derive the value in the next section.

  1. We feed all variables we have to the function, including the transition probabilities associated with action a0:
 >>> trans_matrix = T[:, action]
>>> V = cal_value_matrix_inversion(gamma, trans_matrix, R)
>>> print("The value function under the optimal
policy is:\n{}".format(V))
The value function under the optimal policy is:
tensor([[ 1.6787],
[ 0.6260],
[-0.4820]])
主站蜘蛛池模板: 临沂市| 山丹县| 安国市| 新绛县| 中山市| 闸北区| 安化县| 嘉峪关市| 洛宁县| 车致| 沂源县| 田林县| 会泽县| 昌黎县| 饶阳县| 锡林郭勒盟| 宁陕县| 和顺县| 金昌市| 南靖县| 安化县| 马鞍山市| 罗定市| 新宁县| 岑溪市| 武城县| 万全县| 灵石县| 达孜县| 盐边县| 慈溪市| 牙克石市| 涿鹿县| 恩施市| 赫章县| 昆山市| 酒泉市| 通州市| 济源市| 开鲁县| 米易县|