How to do it...

Let's go ahead and implement the hill-climbing algorithm with PyTorch:

As before, import the necessary packages, create an environment instance, and obtain the dimensions of the observation and action space:

>>> import gym
>>> import torch
>>> env = gym.make('CartPole-v0')
>>> n_state = env.observation_space.shape[0]
>>> n_action = env.action_space.n

We will reuse the run_episode function we defined in the previous recipe, so we will not repeat it here. Again, given the input weight, it simulates an episode and returns the total reward.
Let's make it 1,000 episodes for now:

>>> n_episode = 1000

We need to keep track of the best total reward on the fly, as well as the corresponding weight. So, let's specify their starting values:

>>> best_total_reward = 0
>>> best_weight = torch.rand(n_state, n_action)

We will also record the total reward for every episode:

>>> total_rewards = []

As we mentioned, we will add some noise to the weight for each episode. In fact, we will apply a scale to the noise so that the noise won't overwhelm the weight. Here, we will choose 0.01 as the noise scale:

>>> noise_scale = 0.01

Now, we can run the n_episode function. After we randomly pick an initial weight, for each episode, we do the following:

Add random noise to the weight
Let the agent take actions according to the linear mapping
An episode terminates and returns the total reward
If the current reward is greater than the best one obtained so far, update the best reward and the weight
Otherwise, the best reward and the weight remain unchanged
Also, keep a record of the total reward

Put this into code as follows:

 >>> for episode in range(n_episode):
 ...     weight = best_weight +
                     noise_scale * torch.rand(n_state, n_action)
 ...     total_reward = run_episode(env, weight)
 ...     if total_reward >= best_total_reward:
 ...         best_total_reward = total_reward
 ...         best_weight = weight
 ...     total_rewards.append(total_reward)
 ...     print('Episode {}: {}'.format(episode + 1, total_reward))
 ...
 Episode 1: 56.0
 Episode 2: 52.0
 Episode 3: 85.0
 Episode 4: 106.0
 Episode 5: 41.0
 ……
 ……
 Episode 996: 39.0
 Episode 997: 51.0
 Episode 998: 49.0
 Episode 999: 54.0
 Episode 1000: 41.0

We also calculate the average total reward achieved by the hill-climbing version of linear mapping:

 >>> print('Average total reward over {} episode: {}'.format(
          n_episode, sum(total_rewards) / n_episode))
 Average total reward over 1000 episode: 50.024

To assess the training using the hill-climbing algorithm, we repeat the training process multiple times (by running the code from Step 4 to Step 6 multiple times). We observe that the average total reward fluctuates a lot. The following are the results we got when running it 10 times:

Average total reward over 1000 episode: 9.261   
Average total reward over 1000 episode: 88.565
Average total reward over 1000 episode: 51.796
Average total reward over 1000 episode: 9.41
Average total reward over 1000 episode: 109.758
Average total reward over 1000 episode: 55.787
Average total reward over 1000 episode: 189.251
Average total reward over 1000 episode: 177.624
Average total reward over 1000 episode: 9.146
Average total reward over 1000 episode: 102.311

What could cause such variance? It turns out that if the initial weight is bad, adding noise at a small scale will have little effect on improving the performance. This will cause poor convergence. On the other hand, if the initial weight is good, adding noise at a big scale might move the weight away from the optimal weight and jeopardize the performance. How can we make the training of the hill-climbing model more stable and reliable? We can actually make the noise scale adaptive to the performance, just like the adaptive learning rate in gradient descent. Let's see Step 8 for more details.

To make the noise adaptive, we do the following:

Specify a starting noise scale.
If the performance in an episode improves, decrease the noise scale. In our case, we take half of the scale, but set 0.0001 as the lower bound.
If the performance in an episode drops, increase the noise scale. In our case, we double the scale, but set 2 as the upper bound.

Put this into code:

 >>> noise_scale = 0.01
 >>> best_total_reward = 0
 >>> total_rewards = []
 >>> for episode in range(n_episode):
 ...     weight = best_weight +
                       noise_scale * torch.rand(n_state, n_action)
 ...     total_reward = run_episode(env, weight)
 ...     if total_reward >= best_total_reward:
 ...         best_total_reward = total_reward
 ...         best_weight = weight
 ...         noise_scale = max(noise_scale / 2, 1e-4)
 ...     else:
 ...         noise_scale = min(noise_scale * 2, 2)
 ...     print('Episode {}: {}'.format(episode + 1, total_reward))
 ...     total_rewards.append(total_reward)
 ...
 Episode 1: 9.0
 Episode 2: 9.0
 Episode 3: 9.0
 Episode 4: 10.0
 Episode 5: 10.0
 ……
 ……
 Episode 996: 200.0
 Episode 997: 200.0
 Episode 998: 200.0
 Episode 999: 200.0
 Episode 1000: 200.0

The reward is increasing as the episodes progress. It reaches the maximum of 200 within the first 100 episodes and stays there. The average total reward also looks promising:

 >>> print('Average total reward over {} episode: {}'.format(
          n_episode, sum(total_rewards) / n_episode))
 Average total reward over 1000 episode: 186.11

We also plot the total reward for every episode as follows:

>>> import matplotlib.pyplot as plt
>>> plt.plot(total_rewards)
>>> plt.xlabel('Episode')
>>> plt.ylabel('Reward')
>>> plt.show()

In the resulting plot, we can see a clear upward trend before it plateaus at the maximum value:

Feel free to run the new training process a few times. The results are very stable compared to learning with a constant noise scale.

Now, let's see how the learned policy performs on 100 new episodes:

 >>> n_episode_eval = 100
 >>> total_rewards_eval = []
 >>> for episode in range(n_episode_eval):
 ...     total_reward = run_episode(env, best_weight)
 ...     print('Episode {}: {}'.format(episode+1, total_reward))
 ...     total_rewards_eval.append(total_reward)
 ...
 Episode 1: 200.0
 Episode 2: 200.0
 Episode 3: 200.0
 Episode 4: 200.0
 Episode 5: 200.0
 ……
 ……
 Episode 96: 200.0
 Episode 97: 200.0
 Episode 98: 200.0
 Episode 99: 200.0
 Episode 100: 200.0

Let's see the average performance:

>>> print('Average total reward over {} episode: {}'.format(n_episode, sum(total_rewards) / n_episode))
 Average total reward over 1000 episode: 199.94

The average reward for the testing episodes is close to the maximum of 200 that we obtained with the learned policy. You can re-run the evaluation multiple times. The results are pretty consistent.

官术网_书友最值得收藏!

PyTorch 1.x Reinforcement Learning Cookbook

How to do it...