- PyTorch 1.x Reinforcement Learning Cookbook
- Yuxi (Hayden) Liu
- 241字
- 2021-06-24 12:34:42
There's more...
We can observe that the reward can reach the maximum value within the first 100 episodes. Can we just stop training when the reward reaches 200, as we did with the random search policy? That might not be a good idea. Remember that the agent is making continuous improvements in hill climbing. Even if it finds a weight that generates the maximum reward, it can still search around this weight for the optimal point. Here, we define the optimal policy as the one that can solve the CartPole problem. According to the following wiki page,
We refine the stopping criterion accordingly: At episode 137, the problem is considered solved. >>> noise_scale = 0.01
>>> best_total_reward = 0
>>> total_rewards = []
>>> for episode in range(n_episode):
... weight = best_weight + noise_scale * torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... if total_reward >= best_total_reward:
... best_total_reward = total_reward
... best_weight = weight
... noise_scale = max(noise_scale / 2, 1e-4)
... else:
... noise_scale = min(noise_scale * 2, 2)
... print('Episode {}: {}'.format(episode + 1, total_reward))
... total_rewards.append(total_reward)
... if episode >= 99 and sum(total_rewards[-100:]) >= 19500:
... break
...
Episode 1: 9.0
Episode 2: 9.0
Episode 3: 10.0
Episode 4: 10.0
Episode 5: 9.0
……
……
Episode 133: 200.0
Episode 134: 200.0
Episode 135: 200.0
Episode 136: 200.0
Episode 137: 200.0
- 實時流計算系統設計與實現
- 極簡AI入門:一本書讀懂人工智能思維與應用
- Creo Parametric 1.0中文版從入門到精通
- Pig Design Patterns
- 人工智能實踐錄
- Enterprise PowerShell Scripting Bootcamp
- LMMS:A Complete Guide to Dance Music Production Beginner's Guide
- 生成對抗網絡項目實戰
- 機器人制作入門(第4版)
- 工程地質地學信息遙感自動提取技術
- 大話數據科學:大數據與機器學習實戰(基于R語言)
- fastText Quick Start Guide
- Building Analytics Teams
- ASP.NET 4.0 MVC敏捷開發給力起飛
- 瘋狂Java實戰演義