官术网_书友最值得收藏!

How to do it...

Now, it is time to implement the policy gradient algorithm with PyTorch:

  1. As before, import the necessary packages, create an environment instance, and obtain the dimensions of the observation and action space:
>>> import gym
>>> import torch
>>> env = gym.make('CartPole-v0')
>>> n_state = env.observation_space.shape[0]
>>> n_action = env.action_space.n
  1. We define the run_episode function, which simulates an episode given the input weight and returns the total reward and the gradients computed. More specifically, it does the following tasks in each step:
  • Calculates the probabilities, probs, for both actions based on the current state and input weight
  • Samples an action, action, based on the resulting probabilities
  • Computes the derivatives, d_softmax, of the softmax function with the probabilities as input
  • Divides the resulting derivatives, d_softmax, by the probabilities, probs, to get the derivatives, d_log, of the log term with respect to the policy
  • Applies the chain rule to compute the gradient, grad, of the weights
  • Records the resulting gradient, grad
  • Performs the action, accumulates the reward, and updates the state

Putting all of this into code, we have the following:

 >>> def run_episode(env, weight):
... state = env.reset()
... grads = []
... total_reward = 0
... is_done = False
... while not is_done:
... state = torch.from_numpy(state).float()
... z = torch.matmul(state, weight)
... probs = torch.nn.Softmax()(z)
... action = int(torch.bernoulli(probs[1]).item())
... d_softmax = torch.diag(probs) -
probs.view(-1, 1) * probs
... d_log = d_softmax[action] / probs[action]
... grad = state.view(-1, 1) * d_log
... grads.append(grad)
... state, reward, is_done, _ = env.step(action)
... total_reward += reward
... if is_done:
... break
... return total_reward, grads

After an episode finishes, it returns the total reward obtained in this episode and the gradients computed for the individual steps. These two outputs will be used to update the weight.

  1. Let's make it 1,000 episodes for now:
>>> n_episode = 1000

This means we will run run_episode and n_episodetimes.

  1. Initiate the weight:
>>> weight = torch.rand(n_state, n_action)

We will also record the total reward for every episode:

>>> total_rewards = []
  1. At the end of each episode, we need to update the weight using the computed gradients. For every step of the episode, the weight moves by learning rate * gradient calculated in this step * total reward in the remaining steps. Here, we choose 0.001 as the learning rate:
>>> learning_rate = 0.001

Now, we can run n_episodeepisodes:

 >>> for episode in range(n_episode):
... total_reward, gradients = run_episode(env, weight)
... print('Episode {}: {}'.format(episode + 1, total_reward))
... for i, gradient in enumerate(gradients):
... weight += learning_rate * gradient * (total_reward - i)
... total_rewards.append(total_reward)
……
……
Episode 101: 200.0
Episode 102: 200.0
Episode 103: 200.0
Episode 104: 190.0
Episode 105: 133.0
……
……
Episode 996: 200.0
Episode 997: 200.0
Episode 998: 200.0
Episode 999: 200.0
Episode 1000: 200.0
  1. Now, we calculate the average total reward achieved by the policy gradient algorithm:
 >>> print('Average total reward over {} episode: {}'.format(
n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 179.728
  1. We also plot the total reward for every episode as follows:
 >>> import matplotlib.pyplot as plt
>>> plt.plot(total_rewards)
>>> plt.xlabel('Episode')
>>> plt.ylabel('Reward')
>>> plt.show()

In the resulting plot, we can see a clear upward trend before it stays at the maximum value:

We can also see that the rewards oscillate even after it converges. This is because the policy gradient algorithm is a stochastic policy.

  1. Now, let's see how the learned policy performs on 100 new episodes:  
 >>> n_episode_eval = 100
>>> total_rewards_eval = []
>>> for episode in range(n_episode_eval):
... total_reward, _ = run_episode(env, weight)
... print('Episode {}: {}'.format(episode+1, total_reward))
... total_rewards_eval.append(total_reward)
...
Episode 1: 200.0
Episode 2: 200.0
Episode 3: 200.0
Episode 4: 200.0
Episode 5: 200.0
……
……
Episode 96: 200.0
Episode 97: 200.0
Episode 98: 200.0
Episode 99: 200.0
Episode 100: 200.0

Let's see the average performance:

>>> print('Average total reward over {} episode: {}'.format(n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 199.78

The average reward for the testing episodes is close to the maximum value of 200 for the learned policy. You can re-run the evaluation multiple times. The results are pretty consistent.

主站蜘蛛池模板: 广饶县| 军事| 乳源| 句容市| 伊春市| 盘锦市| 隆德县| 建始县| 安达市| 息烽县| 济阳县| 长子县| 新和县| 横峰县| 页游| SHOW| 南陵县| 北辰区| 罗甸县| 顺平县| 济南市| 城口县| 泊头市| 沁源县| 富蕴县| 高碑店市| 安远县| 泰安市| 县级市| 邛崃市| 昌都县| 成都市| 玉林市| 丹棱县| 彭州市| 洛南县| 远安县| 都江堰市| 板桥市| 句容市| 西城区|