官术网_书友最值得收藏!

How to do it...

Now, it is time to implement the policy gradient algorithm with PyTorch:

  1. As before, import the necessary packages, create an environment instance, and obtain the dimensions of the observation and action space:
>>> import gym
>>> import torch
>>> env = gym.make('CartPole-v0')
>>> n_state = env.observation_space.shape[0]
>>> n_action = env.action_space.n
  1. We define the run_episode function, which simulates an episode given the input weight and returns the total reward and the gradients computed. More specifically, it does the following tasks in each step:
  • Calculates the probabilities, probs, for both actions based on the current state and input weight
  • Samples an action, action, based on the resulting probabilities
  • Computes the derivatives, d_softmax, of the softmax function with the probabilities as input
  • Divides the resulting derivatives, d_softmax, by the probabilities, probs, to get the derivatives, d_log, of the log term with respect to the policy
  • Applies the chain rule to compute the gradient, grad, of the weights
  • Records the resulting gradient, grad
  • Performs the action, accumulates the reward, and updates the state

Putting all of this into code, we have the following:

 >>> def run_episode(env, weight):
... state = env.reset()
... grads = []
... total_reward = 0
... is_done = False
... while not is_done:
... state = torch.from_numpy(state).float()
... z = torch.matmul(state, weight)
... probs = torch.nn.Softmax()(z)
... action = int(torch.bernoulli(probs[1]).item())
... d_softmax = torch.diag(probs) -
probs.view(-1, 1) * probs
... d_log = d_softmax[action] / probs[action]
... grad = state.view(-1, 1) * d_log
... grads.append(grad)
... state, reward, is_done, _ = env.step(action)
... total_reward += reward
... if is_done:
... break
... return total_reward, grads

After an episode finishes, it returns the total reward obtained in this episode and the gradients computed for the individual steps. These two outputs will be used to update the weight.

  1. Let's make it 1,000 episodes for now:
>>> n_episode = 1000

This means we will run run_episode and n_episodetimes.

  1. Initiate the weight:
>>> weight = torch.rand(n_state, n_action)

We will also record the total reward for every episode:

>>> total_rewards = []
  1. At the end of each episode, we need to update the weight using the computed gradients. For every step of the episode, the weight moves by learning rate * gradient calculated in this step * total reward in the remaining steps. Here, we choose 0.001 as the learning rate:
>>> learning_rate = 0.001

Now, we can run n_episodeepisodes:

 >>> for episode in range(n_episode):
... total_reward, gradients = run_episode(env, weight)
... print('Episode {}: {}'.format(episode + 1, total_reward))
... for i, gradient in enumerate(gradients):
... weight += learning_rate * gradient * (total_reward - i)
... total_rewards.append(total_reward)
……
……
Episode 101: 200.0
Episode 102: 200.0
Episode 103: 200.0
Episode 104: 190.0
Episode 105: 133.0
……
……
Episode 996: 200.0
Episode 997: 200.0
Episode 998: 200.0
Episode 999: 200.0
Episode 1000: 200.0
  1. Now, we calculate the average total reward achieved by the policy gradient algorithm:
 >>> print('Average total reward over {} episode: {}'.format(
n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 179.728
  1. We also plot the total reward for every episode as follows:
 >>> import matplotlib.pyplot as plt
>>> plt.plot(total_rewards)
>>> plt.xlabel('Episode')
>>> plt.ylabel('Reward')
>>> plt.show()

In the resulting plot, we can see a clear upward trend before it stays at the maximum value:

We can also see that the rewards oscillate even after it converges. This is because the policy gradient algorithm is a stochastic policy.

  1. Now, let's see how the learned policy performs on 100 new episodes:  
 >>> n_episode_eval = 100
>>> total_rewards_eval = []
>>> for episode in range(n_episode_eval):
... total_reward, _ = run_episode(env, weight)
... print('Episode {}: {}'.format(episode+1, total_reward))
... total_rewards_eval.append(total_reward)
...
Episode 1: 200.0
Episode 2: 200.0
Episode 3: 200.0
Episode 4: 200.0
Episode 5: 200.0
……
……
Episode 96: 200.0
Episode 97: 200.0
Episode 98: 200.0
Episode 99: 200.0
Episode 100: 200.0

Let's see the average performance:

>>> print('Average total reward over {} episode: {}'.format(n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 199.78

The average reward for the testing episodes is close to the maximum value of 200 for the learned policy. You can re-run the evaluation multiple times. The results are pretty consistent.

主站蜘蛛池模板: 张北县| 昌吉市| 昌都县| 隆尧县| 南投市| 沙坪坝区| 兴山县| 柳州市| 乌兰浩特市| 青铜峡市| 惠州市| 东台市| 兴海县| 长丰县| 元阳县| 土默特右旗| 文昌市| 新和县| 长海县| 榆树市| 新野县| 揭西县| 白银市| 阳信县| 荣昌县| 年辖:市辖区| 宜兴市| 远安县| 兴业县| 汉源县| 合川市| 蕲春县| 兴化市| 黑山县| 平阳县| 兴化市| 承德县| 石泉县| 周口市| 临潭县| 石泉县|