官术网_书友最值得收藏!

There's more...

We can observe that the reward can reach the maximum value within the first 100 episodes. Can we just stop training when the reward reaches 200, as we did with the random search policy? That might not be a good idea. Remember that the agent is making continuous improvements in hill climbing. Even if it finds a weight that generates the maximum reward, it can still search around this weight for the optimal point. Here, we define the optimal policy as the one that can solve the CartPole problem. According to the following wiki page,

We refine the stopping criterion accordingly:

 >>> noise_scale = 0.01
>>> best_total_reward = 0
>>> total_rewards = []
>>> for episode in range(n_episode):
... weight = best_weight + noise_scale * torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... if total_reward >= best_total_reward:
... best_total_reward = total_reward
... best_weight = weight
... noise_scale = max(noise_scale / 2, 1e-4)
... else:
... noise_scale = min(noise_scale * 2, 2)
... print('Episode {}: {}'.format(episode + 1, total_reward))
... total_rewards.append(total_reward)
... if episode >= 99 and sum(total_rewards[-100:]) >= 19500:
... break
...
Episode 1: 9.0
Episode 2: 9.0
Episode 3: 10.0
Episode 4: 10.0
Episode 5: 9.0
……
……
Episode 133: 200.0
Episode 134: 200.0
Episode 135: 200.0
Episode 136: 200.0
Episode 137: 200.0

At episode 137, the problem is considered solved.

主站蜘蛛池模板: 广元市| 黔西| 柘城县| 黄大仙区| 北京市| 阿鲁科尔沁旗| 黑龙江省| 恩施市| 镇平县| 苏尼特右旗| 沽源县| 天祝| 咸阳市| 邵阳市| 顺昌县| 五原县| 扶沟县| 乐东| 龙门县| 开鲁县| 芒康县| 闽侯县| 涞水县| 虞城县| 伊春市| 昭平县| 葫芦岛市| 洪湖市| 百色市| 广河县| 朝阳县| 盱眙县| 梧州市| 平陆县| 玉龙| 甘肃省| 塘沽区| 博湖县| 东丰县| 德州市| 伊金霍洛旗|