書名： PyTorch 1.x Reinforcement Learning Cookbook
作者名： Yuxi (Hayden) Liu
本章字數： 707字
更新時間： 2021-06-24 12:34:43

How to do it...

Now, it is time to implement the policy gradient algorithm with PyTorch:

As before, import the necessary packages, create an environment instance, and obtain the dimensions of the observation and action space:

>>> import gym
>>> import torch
>>> env = gym.make('CartPole-v0')
>>> n_state = env.observation_space.shape[0]
>>> n_action = env.action_space.n

We define the run_episode function, which simulates an episode given the input weight and returns the total reward and the gradients computed. More specifically, it does the following tasks in each step:

Calculates the probabilities, probs, for both actions based on the current state and input weight
Samples an action, action, based on the resulting probabilities
Computes the derivatives, d_softmax, of the softmax function with the probabilities as input
Divides the resulting derivatives, d_softmax, by the probabilities, probs, to get the derivatives, d_log, of the log term with respect to the policy
Applies the chain rule to compute the gradient, grad, of the weights
Records the resulting gradient, grad
Performs the action, accumulates the reward, and updates the state

Putting all of this into code, we have the following:

 >>> def run_episode(env, weight):
 ...     state = env.reset()
 ...     grads = []
 ...     total_reward = 0
 ...     is_done = False
 ...     while not is_done:
 ...         state = torch.from_numpy(state).float()
 ...         z = torch.matmul(state, weight)
 ...         probs = torch.nn.Softmax()(z)
 ...         action = int(torch.bernoulli(probs[1]).item())
 ...         d_softmax = torch.diag(probs) -
                             probs.view(-1, 1) * probs
 ...         d_log = d_softmax[action] / probs[action]
 ...         grad = state.view(-1, 1) * d_log
 ...         grads.append(grad)
 ...         state, reward, is_done, _ = env.step(action)
 ...         total_reward += reward
 ...         if is_done:
 ...             break
 ...     return total_reward, grads

After an episode finishes, it returns the total reward obtained in this episode and the gradients computed for the individual steps. These two outputs will be used to update the weight.

Let's make it 1,000 episodes for now:

>>> n_episode = 1000

This means we will run run_episode and n_episodetimes.

Initiate the weight:

>>> weight = torch.rand(n_state, n_action)

We will also record the total reward for every episode:

>>> total_rewards = []

At the end of each episode, we need to update the weight using the computed gradients. For every step of the episode, the weight moves by learning rate * gradient calculated in this step * total reward in the remaining steps. Here, we choose 0.001 as the learning rate:

>>> learning_rate = 0.001

Now, we can run n_episodeepisodes:

 >>> for episode in range(n_episode):
 ...     total_reward, gradients = run_episode(env, weight)
 ...     print('Episode {}: {}'.format(episode + 1, total_reward))
 ...     for i, gradient in enumerate(gradients):
 ...         weight += learning_rate * gradient * (total_reward - i)
 ...     total_rewards.append(total_reward)
 ……
 ……
 Episode 101: 200.0
 Episode 102: 200.0
 Episode 103: 200.0
 Episode 104: 190.0
 Episode 105: 133.0
 ……
 ……
 Episode 996: 200.0
 Episode 997: 200.0
 Episode 998: 200.0
 Episode 999: 200.0
 Episode 1000: 200.0

Now, we calculate the average total reward achieved by the policy gradient algorithm:

 >>> print('Average total reward over {} episode: {}'.format(
          n_episode, sum(total_rewards) / n_episode))
 Average total reward over 1000 episode: 179.728

We also plot the total reward for every episode as follows:

 >>> import matplotlib.pyplot as plt
 >>> plt.plot(total_rewards)
 >>> plt.xlabel('Episode')
 >>> plt.ylabel('Reward')
 >>> plt.show()

In the resulting plot, we can see a clear upward trend before it stays at the maximum value:

We can also see that the rewards oscillate even after it converges. This is because the policy gradient algorithm is a stochastic policy.

Now, let's see how the learned policy performs on 100 new episodes:

 >>> n_episode_eval = 100
 >>> total_rewards_eval = []
 >>> for episode in range(n_episode_eval):
 ...     total_reward, _ = run_episode(env, weight)
 ...     print('Episode {}: {}'.format(episode+1, total_reward))
 ...     total_rewards_eval.append(total_reward)
 ...
 Episode 1: 200.0
 Episode 2: 200.0
 Episode 3: 200.0
 Episode 4: 200.0
 Episode 5: 200.0
 ……
 ……
 Episode 96: 200.0
 Episode 97: 200.0
 Episode 98: 200.0
 Episode 99: 200.0
 Episode 100: 200.0

Let's see the average performance:

>>> print('Average total reward over {} episode: {}'.format(n_episode, sum(total_rewards) / n_episode))
 Average total reward over 1000 episode: 199.78

The average reward for the testing episodes is close to the maximum value of 200 for the learned policy. You can re-run the evaluation multiple times. The results are pretty consistent.

官术网_书友最值得收藏!

PyTorch 1.x Reinforcement Learning Cookbook

How to do it...