- PyTorch 1.x Reinforcement Learning Cookbook
- Yuxi (Hayden) Liu
- 707字
- 2021-06-24 12:34:43
How to do it...
Now, it is time to implement the policy gradient algorithm with PyTorch:
- As before, import the necessary packages, create an environment instance, and obtain the dimensions of the observation and action space:
>>> import gym
>>> import torch
>>> env = gym.make('CartPole-v0')
>>> n_state = env.observation_space.shape[0]
>>> n_action = env.action_space.n
- We define the run_episode function, which simulates an episode given the input weight and returns the total reward and the gradients computed. More specifically, it does the following tasks in each step:
- Calculates the probabilities, probs, for both actions based on the current state and input weight
- Samples an action, action, based on the resulting probabilities
- Computes the derivatives, d_softmax, of the softmax function with the probabilities as input
- Divides the resulting derivatives, d_softmax, by the probabilities, probs, to get the derivatives, d_log, of the log term with respect to the policy
- Applies the chain rule to compute the gradient, grad, of the weights
- Records the resulting gradient, grad
- Performs the action, accumulates the reward, and updates the state
Putting all of this into code, we have the following:
>>> def run_episode(env, weight):
... state = env.reset()
... grads = []
... total_reward = 0
... is_done = False
... while not is_done:
... state = torch.from_numpy(state).float()
... z = torch.matmul(state, weight)
... probs = torch.nn.Softmax()(z)
... action = int(torch.bernoulli(probs[1]).item())
... d_softmax = torch.diag(probs) -
probs.view(-1, 1) * probs
... d_log = d_softmax[action] / probs[action]
... grad = state.view(-1, 1) * d_log
... grads.append(grad)
... state, reward, is_done, _ = env.step(action)
... total_reward += reward
... if is_done:
... break
... return total_reward, grads
After an episode finishes, it returns the total reward obtained in this episode and the gradients computed for the individual steps. These two outputs will be used to update the weight.
- Let's make it 1,000 episodes for now:
>>> n_episode = 1000
This means we will run run_episode and n_episodetimes.
- Initiate the weight:
>>> weight = torch.rand(n_state, n_action)
We will also record the total reward for every episode:
>>> total_rewards = []
- At the end of each episode, we need to update the weight using the computed gradients. For every step of the episode, the weight moves by learning rate * gradient calculated in this step * total reward in the remaining steps. Here, we choose 0.001 as the learning rate:
>>> learning_rate = 0.001
Now, we can run n_episodeepisodes:
>>> for episode in range(n_episode):
... total_reward, gradients = run_episode(env, weight)
... print('Episode {}: {}'.format(episode + 1, total_reward))
... for i, gradient in enumerate(gradients):
... weight += learning_rate * gradient * (total_reward - i)
... total_rewards.append(total_reward)
……
……
Episode 101: 200.0
Episode 102: 200.0
Episode 103: 200.0
Episode 104: 190.0
Episode 105: 133.0
……
……
Episode 996: 200.0
Episode 997: 200.0
Episode 998: 200.0
Episode 999: 200.0
Episode 1000: 200.0
- Now, we calculate the average total reward achieved by the policy gradient algorithm:
>>> print('Average total reward over {} episode: {}'.format(
n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 179.728
- We also plot the total reward for every episode as follows:
>>> import matplotlib.pyplot as plt
>>> plt.plot(total_rewards)
>>> plt.xlabel('Episode')
>>> plt.ylabel('Reward')
>>> plt.show()
In the resulting plot, we can see a clear upward trend before it stays at the maximum value:

We can also see that the rewards oscillate even after it converges. This is because the policy gradient algorithm is a stochastic policy.
- Now, let's see how the learned policy performs on 100 new episodes:
>>> n_episode_eval = 100
>>> total_rewards_eval = []
>>> for episode in range(n_episode_eval):
... total_reward, _ = run_episode(env, weight)
... print('Episode {}: {}'.format(episode+1, total_reward))
... total_rewards_eval.append(total_reward)
...
Episode 1: 200.0
Episode 2: 200.0
Episode 3: 200.0
Episode 4: 200.0
Episode 5: 200.0
……
……
Episode 96: 200.0
Episode 97: 200.0
Episode 98: 200.0
Episode 99: 200.0
Episode 100: 200.0
Let's see the average performance:
>>> print('Average total reward over {} episode: {}'.format(n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 199.78
The average reward for the testing episodes is close to the maximum value of 200 for the learned policy. You can re-run the evaluation multiple times. The results are pretty consistent.
- PPT,要你好看
- 機器學習實戰:基于Sophon平臺的機器學習理論與實踐
- Hands-On Internet of Things with MQTT
- 火格局的時空變異及其在電網防火中的應用
- 21天學通PHP
- 計算機應用基礎·基礎模塊
- 西門子S7-200 SMART PLC從入門到精通
- Visual C# 2008開發技術實例詳解
- 工業機器人工程應用虛擬仿真教程:MotoSim EG-VRC
- ROS機器人編程與SLAM算法解析指南
- Python Data Science Essentials
- Data Wrangling with Python
- Maya極速引擎:材質篇
- PyTorch Deep Learning Hands-On
- 中國戰略性新興產業研究與發展·智能制造