官术网_书友最值得收藏!

There's more...

We can observe that the reward can reach the maximum value within the first 100 episodes. Can we just stop training when the reward reaches 200, as we did with the random search policy? That might not be a good idea. Remember that the agent is making continuous improvements in hill climbing. Even if it finds a weight that generates the maximum reward, it can still search around this weight for the optimal point. Here, we define the optimal policy as the one that can solve the CartPole problem. According to the following wiki page,

We refine the stopping criterion accordingly:

 >>> noise_scale = 0.01
>>> best_total_reward = 0
>>> total_rewards = []
>>> for episode in range(n_episode):
... weight = best_weight + noise_scale * torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... if total_reward >= best_total_reward:
... best_total_reward = total_reward
... best_weight = weight
... noise_scale = max(noise_scale / 2, 1e-4)
... else:
... noise_scale = min(noise_scale * 2, 2)
... print('Episode {}: {}'.format(episode + 1, total_reward))
... total_rewards.append(total_reward)
... if episode >= 99 and sum(total_rewards[-100:]) >= 19500:
... break
...
Episode 1: 9.0
Episode 2: 9.0
Episode 3: 10.0
Episode 4: 10.0
Episode 5: 9.0
……
……
Episode 133: 200.0
Episode 134: 200.0
Episode 135: 200.0
Episode 136: 200.0
Episode 137: 200.0

At episode 137, the problem is considered solved.

主站蜘蛛池模板: 曲松县| 伊春市| 五常市| 防城港市| 屯留县| 大埔县| 兴安盟| 永定县| 湖北省| 大渡口区| 沙雅县| 沙雅县| 南澳县| 汝南县| 北流市| 星子县| 龙门县| 嘉鱼县| 徐州市| 杨浦区| 浑源县| 太仆寺旗| 禹城市| 大洼县| 林西县| 交口县| 剑川县| 阳江市| 常州市| 陕西省| 德化县| 连平县| 山西省| 静安区| 绥江县| 莆田市| 治多县| 德庆县| 交口县| 遂平县| 南投市|