官术网_书友最值得收藏!

How to do it...

Let's develop a policy evaluation algorithm and apply it to our study-sleep-game process as follows:

  1. Import PyTorch and define the transition matrix:
 >>> import torch
>>> T = torch.tensor([[[0.8, 0.1, 0.1],
... [0.1, 0.6, 0.3]],
... [[0.7, 0.2, 0.1],
... [0.1, 0.8, 0.1]],
... [[0.6, 0.2, 0.2],
... [0.1, 0.4, 0.5]]]
... )
  1. Define the reward function and the discount factor (let's use 0.5 for now):
 >>> R = torch.tensor([1., 0, -1.])
>>> gamma = 0.5
  1. Define the threshold used to determine when to stop the evaluation process:
 >>> threshold = 0.0001
  1. Define the optimal policy where action a0 is chosen under all circumstances:
 >>> policy_optimal = torch.tensor([[1.0, 0.0],
... [1.0, 0.0],
... [1.0, 0.0]])
  1. Develop a policy evaluation function that takes in a policy, transition matrix, rewards, discount factor, and a threshold and computes the value function:
>>> def policy_evaluation(
policy, trans_matrix, rewards, gamma, threshold):
... """
... Perform policy evaluation
... @param policy: policy matrix containing actions and their
probability in each state
... @param trans_matrix: transformation matrix
... @param rewards: rewards for each state
... @param gamma: discount factor
... @param threshold: the evaluation will stop once values
for all states are less than the threshold
... @return: values of the given policy for all possible states
... """
... n_state = policy.shape[0]
... V = torch.zeros(n_state)
... while True:
... V_temp = torch.zeros(n_state)
... for state, actions in enumerate(policy):
... for action, action_prob in enumerate(actions):
... V_temp[state] += action_prob * (R[state] +
gamma * torch.dot(
trans_matrix[state, action], V))
... max_delta = torch.max(torch.abs(V - V_temp))
... V = V_temp.clone()
... if max_delta <= threshold:
... break
... return V
  1. Now let's plug in the optimal policy and all other variables:
>>> V = policy_evaluation(policy_optimal, T, R, gamma, threshold)
>>> print(
"The value function under the optimal policy is:\n{}".format(V))
The value function under the optimal policy is:
tensor([ 1.6786, 0.6260, -0.4821])

This is almost the same as what we got using matrix inversion.

  1. We now experiment with another policy, a random policy where actions are picked with the same probabilities:
>>> policy_random = torch.tensor([[0.5, 0.5],
... [0.5, 0.5],
... [0.5, 0.5]])
  1. Plug in the random policy and all other variables:
>>> V = policy_evaluation(policy_random, T, R, gamma, threshold)
>>> print(
"The value function under the random policy is:\n{}".format(V))
The value function under the random policy is:
tensor([ 1.2348, 0.2691, -0.9013])
主站蜘蛛池模板: 凤庆县| 刚察县| 容城县| 同德县| 房产| 新河县| 乡宁县| 琼中| 铜梁县| 衡南县| 枣阳市| 二连浩特市| 云阳县| 上杭县| 蒙自县| 乌海市| 会宁县| 吉隆县| 庐江县| 靖远县| 锦州市| 古蔺县| 黔南| 泗水县| 甘孜县| 宜兰县| 新巴尔虎左旗| 溆浦县| 蒙山县| 烟台市| 英德市| 县级市| 城固县| 永寿县| 盖州市| 宝坻区| 凉城县| 民和| 醴陵市| 桃园市| 黄陵县|