- PyTorch 1.x Reinforcement Learning Cookbook
- Yuxi (Hayden) Liu
- 380字
- 2021-06-24 12:34:46
How to do it...
Let's develop a policy evaluation algorithm and apply it to our study-sleep-game process as follows:
- Import PyTorch and define the transition matrix:
>>> import torch
>>> T = torch.tensor([[[0.8, 0.1, 0.1],
... [0.1, 0.6, 0.3]],
... [[0.7, 0.2, 0.1],
... [0.1, 0.8, 0.1]],
... [[0.6, 0.2, 0.2],
... [0.1, 0.4, 0.5]]]
... )
- Define the reward function and the discount factor (let's use 0.5 for now):
>>> R = torch.tensor([1., 0, -1.])
>>> gamma = 0.5
- Define the threshold used to determine when to stop the evaluation process:
>>> threshold = 0.0001
- Define the optimal policy where action a0 is chosen under all circumstances:
>>> policy_optimal = torch.tensor([[1.0, 0.0],
... [1.0, 0.0],
... [1.0, 0.0]])
- Develop a policy evaluation function that takes in a policy, transition matrix, rewards, discount factor, and a threshold and computes the value function:
>>> def policy_evaluation(
policy, trans_matrix, rewards, gamma, threshold):
... """
... Perform policy evaluation
... @param policy: policy matrix containing actions and their
probability in each state
... @param trans_matrix: transformation matrix
... @param rewards: rewards for each state
... @param gamma: discount factor
... @param threshold: the evaluation will stop once values
for all states are less than the threshold
... @return: values of the given policy for all possible states
... """
... n_state = policy.shape[0]
... V = torch.zeros(n_state)
... while True:
... V_temp = torch.zeros(n_state)
... for state, actions in enumerate(policy):
... for action, action_prob in enumerate(actions):
... V_temp[state] += action_prob * (R[state] +
gamma * torch.dot(
trans_matrix[state, action], V))
... max_delta = torch.max(torch.abs(V - V_temp))
... V = V_temp.clone()
... if max_delta <= threshold:
... break
... return V
- Now let's plug in the optimal policy and all other variables:
>>> V = policy_evaluation(policy_optimal, T, R, gamma, threshold)
>>> print(
"The value function under the optimal policy is:\n{}".format(V))
The value function under the optimal policy is:
tensor([ 1.6786, 0.6260, -0.4821])
This is almost the same as what we got using matrix inversion.
- We now experiment with another policy, a random policy where actions are picked with the same probabilities:
>>> policy_random = torch.tensor([[0.5, 0.5],
... [0.5, 0.5],
... [0.5, 0.5]])
- Plug in the random policy and all other variables:
>>> V = policy_evaluation(policy_random, T, R, gamma, threshold)
>>> print(
"The value function under the random policy is:\n{}".format(V))
The value function under the random policy is:
tensor([ 1.2348, 0.2691, -0.9013])