官术网_书友最值得收藏!

There's more...

We can also plot the total reward for every episode in the training phase:

>>> import matplotlib.pyplot as plt
>>> plt.plot(total_rewards)
>>> plt.xlabel('Episode')
>>> plt.ylabel('Reward')
>>> plt.show()

This will generate the following plot:

If you have not installed matplotlib, you can do so via the following command:

conda install matplotlib

We can see that the reward for each episode is pretty random, and that there is no trend of improvement as we go through the episodes. This is basically what we expected.

In the plot of reward versus episodes, we can see that there are some episodes in which the reward reaches 200. We can end the training phase whenever this occurs since there is no room to improve. Incorporating this change, we now have the following for the training phase:

 >>> n_episode = 1000
>>> best_total_reward = 0
>>> best_weight = None
>>> total_rewards = []
>>> for episode in range(n_episode):
... weight = torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... print('Episode {}: {}'.format(episode+1, total_reward))
... if total_reward > best_total_reward:
... best_weight = weight
... best_total_reward = total_reward
... total_rewards.append(total_reward)
... if best_total_reward == 200:
... break
Episode 1: 9.0
Episode 2: 8.0
Episode 3: 10.0
Episode 4: 10.0
Episode 5: 10.0
Episode 6: 9.0
Episode 7: 17.0
Episode 8: 10.0
Episode 9: 43.0
Episode 10: 10.0
Episode 11: 10.0
Episode 12: 106.0
Episode 13: 8.0
Episode 14: 32.0
Episode 15: 98.0
Episode 16: 10.0
Episode 17: 200.0

The policy achieving the maximal reward is found in episode 17. Again, this may vary a lot because the weights are generated randomly for each episode. To compute the expectation of training episodes needed, we can repeat the preceding training process 1,000 times and take the average of the training episodes:

 >>> n_training = 1000
>>> n_episode_training = []
>>> for _ in range(n_training):
... for episode in range(n_episode):
... weight = torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... if total_reward == 200:
... n_episode_training.append(episode+1)
... break
>>> print('Expectation of training episodes needed: ',
sum(n_episode_training) / n_training)
Expectation of training episodes needed: 13.442

On average, we expect that it takes around 13 episodes to find the best policy.

主站蜘蛛池模板: 汽车| 德安县| 忻州市| 商洛市| 通榆县| 蓝山县| 武汉市| 北流市| 垦利县| 汝南县| 沈丘县| 资源县| 田林县| 遂平县| 乌兰察布市| 定陶县| 霍山县| 道孚县| 定安县| 林口县| 乌兰浩特市| 肃南| 嘉峪关市| 从江县| 青川县| 沛县| 垣曲县| 思南县| 江山市| 宜兰市| 江津市| 会昌县| 禹城市| 格尔木市| 乐安县| 永仁县| 宣化县| 苍梧县| 牟定县| 榆社县| 甘南县|