官术网_书友最值得收藏!

How to do it...

Let's develop a policy evaluation algorithm and apply it to our study-sleep-game process as follows:

  1. Import PyTorch and define the transition matrix:
 >>> import torch
>>> T = torch.tensor([[[0.8, 0.1, 0.1],
... [0.1, 0.6, 0.3]],
... [[0.7, 0.2, 0.1],
... [0.1, 0.8, 0.1]],
... [[0.6, 0.2, 0.2],
... [0.1, 0.4, 0.5]]]
... )
  1. Define the reward function and the discount factor (let's use 0.5 for now):
 >>> R = torch.tensor([1., 0, -1.])
>>> gamma = 0.5
  1. Define the threshold used to determine when to stop the evaluation process:
 >>> threshold = 0.0001
  1. Define the optimal policy where action a0 is chosen under all circumstances:
 >>> policy_optimal = torch.tensor([[1.0, 0.0],
... [1.0, 0.0],
... [1.0, 0.0]])
  1. Develop a policy evaluation function that takes in a policy, transition matrix, rewards, discount factor, and a threshold and computes the value function:
>>> def policy_evaluation(
policy, trans_matrix, rewards, gamma, threshold):
... """
... Perform policy evaluation
... @param policy: policy matrix containing actions and their
probability in each state
... @param trans_matrix: transformation matrix
... @param rewards: rewards for each state
... @param gamma: discount factor
... @param threshold: the evaluation will stop once values
for all states are less than the threshold
... @return: values of the given policy for all possible states
... """
... n_state = policy.shape[0]
... V = torch.zeros(n_state)
... while True:
... V_temp = torch.zeros(n_state)
... for state, actions in enumerate(policy):
... for action, action_prob in enumerate(actions):
... V_temp[state] += action_prob * (R[state] +
gamma * torch.dot(
trans_matrix[state, action], V))
... max_delta = torch.max(torch.abs(V - V_temp))
... V = V_temp.clone()
... if max_delta <= threshold:
... break
... return V
  1. Now let's plug in the optimal policy and all other variables:
>>> V = policy_evaluation(policy_optimal, T, R, gamma, threshold)
>>> print(
"The value function under the optimal policy is:\n{}".format(V))
The value function under the optimal policy is:
tensor([ 1.6786, 0.6260, -0.4821])

This is almost the same as what we got using matrix inversion.

  1. We now experiment with another policy, a random policy where actions are picked with the same probabilities:
>>> policy_random = torch.tensor([[0.5, 0.5],
... [0.5, 0.5],
... [0.5, 0.5]])
  1. Plug in the random policy and all other variables:
>>> V = policy_evaluation(policy_random, T, R, gamma, threshold)
>>> print(
"The value function under the random policy is:\n{}".format(V))
The value function under the random policy is:
tensor([ 1.2348, 0.2691, -0.9013])
主站蜘蛛池模板: 左权县| 剑河县| 喀喇沁旗| 平邑县| 阿拉善盟| 江安县| 遵化市| 潍坊市| 南雄市| 奎屯市| 长乐市| 什邡市| 泾阳县| 晋中市| 姜堰市| 宁化县| 兰州市| 景泰县| 新竹县| 房山区| 潞城市| 门头沟区| 阿坝| 滦平县| 太谷县| 永年县| 永善县| 信宜市| 尚志市| 临江市| 安乡县| 舒兰市| 仪陇县| 上栗县| 静乐县| 德州市| 西贡区| 石门县| 彰武县| 美姑县| 永济市|