How to do it...

Let's develop a policy evaluation algorithm and apply it to our study-sleep-game process as follows:

Import PyTorch and define the transition matrix:

 >>> import torch
 >>> T = torch.tensor([[[0.8, 0.1, 0.1],
 ...                    [0.1, 0.6, 0.3]],
 ...                   [[0.7, 0.2, 0.1],
 ...                    [0.1, 0.8, 0.1]],
 ...                   [[0.6, 0.2, 0.2],
 ...                    [0.1, 0.4, 0.5]]]
 ...                  )

Define the reward function and the discount factor (let's use 0.5 for now):

 >>> R = torch.tensor([1., 0, -1.])
 >>> gamma = 0.5

Define the threshold used to determine when to stop the evaluation process:

 >>> threshold = 0.0001

Define the optimal policy where action a0 is chosen under all circumstances:

 >>> policy_optimal = torch.tensor([[1.0, 0.0],
 ...                                [1.0, 0.0],
 ...                                [1.0, 0.0]])

Develop a policy evaluation function that takes in a policy, transition matrix, rewards, discount factor, and a threshold and computes the value function:

>>> def policy_evaluation(
                policy, trans_matrix, rewards, gamma, threshold):
...     """
...     Perform policy evaluation
...     @param policy: policy matrix containing actions and their 
                        probability in each state
...     @param trans_matrix: transformation matrix
...     @param rewards: rewards for each state
...     @param gamma: discount factor
...     @param threshold: the evaluation will stop once values 
                       for all states are less than the threshold
...     @return: values of the given policy for all possible states
...     """
...     n_state = policy.shape[0]
...     V = torch.zeros(n_state)
...     while True:
...         V_temp = torch.zeros(n_state)
...         for state, actions in enumerate(policy):
...             for action, action_prob in enumerate(actions):
...                 V_temp[state] += action_prob * (R[state] + 
                                    gamma * torch.dot(
                                  trans_matrix[state, action], V))
...         max_delta = torch.max(torch.abs(V - V_temp))
...         V = V_temp.clone()
...         if max_delta <= threshold:
...             break
...     return V

Now let's plug in the optimal policy and all other variables:

>>> V = policy_evaluation(policy_optimal, T, R, gamma, threshold)
>>> print(
   "The value function under the optimal policy is:\n{}".format(V))
The value function under the optimal policy is:
tensor([ 1.6786,  0.6260, -0.4821])

This is almost the same as what we got using matrix inversion.

We now experiment with another policy, a random policy where actions are picked with the same probabilities:

>>> policy_random = torch.tensor([[0.5, 0.5],
...                               [0.5, 0.5],
...                               [0.5, 0.5]])

Plug in the random policy and all other variables:

>>> V = policy_evaluation(policy_random, T, R, gamma, threshold)
>>> print(
     "The value function under the random policy is:\n{}".format(V))
The value function under the random policy is:
tensor([ 1.2348,  0.2691, -0.9013])

官术网_书友最值得收藏!

PyTorch 1.x Reinforcement Learning Cookbook

How to do it...