官术网_书友最值得收藏!

There's more...

We can observe that the reward can reach the maximum value within the first 100 episodes. Can we just stop training when the reward reaches 200, as we did with the random search policy? That might not be a good idea. Remember that the agent is making continuous improvements in hill climbing. Even if it finds a weight that generates the maximum reward, it can still search around this weight for the optimal point. Here, we define the optimal policy as the one that can solve the CartPole problem. According to the following wiki page,

We refine the stopping criterion accordingly:

 >>> noise_scale = 0.01
>>> best_total_reward = 0
>>> total_rewards = []
>>> for episode in range(n_episode):
... weight = best_weight + noise_scale * torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... if total_reward >= best_total_reward:
... best_total_reward = total_reward
... best_weight = weight
... noise_scale = max(noise_scale / 2, 1e-4)
... else:
... noise_scale = min(noise_scale * 2, 2)
... print('Episode {}: {}'.format(episode + 1, total_reward))
... total_rewards.append(total_reward)
... if episode >= 99 and sum(total_rewards[-100:]) >= 19500:
... break
...
Episode 1: 9.0
Episode 2: 9.0
Episode 3: 10.0
Episode 4: 10.0
Episode 5: 9.0
……
……
Episode 133: 200.0
Episode 134: 200.0
Episode 135: 200.0
Episode 136: 200.0
Episode 137: 200.0

At episode 137, the problem is considered solved.

主站蜘蛛池模板: 灌阳县| 海阳市| 鹿邑县| 澎湖县| 三河市| 明溪县| 扎赉特旗| 花莲县| 安义县| 留坝县| 博乐市| 佳木斯市| 高台县| 玉环县| 台东市| 绥芬河市| 汝州市| 伊宁县| 合肥市| 象山县| 天等县| 银川市| 乌兰察布市| 仪陇县| 仁布县| 长顺县| 大港区| 枣强县| 宁强县| 张家界市| 新和县| 岐山县| 定安县| 卢湾区| 延边| 佛冈县| 东宁县| 句容市| 凌海市| 格尔木市| 灵武市|