官术网_书友最值得收藏!

The random CartPole agent

Although the environment is much more complex than our first example in The anatomy of the agent section, the code of the agent is much shorter. This is the power of reusability, abstractions, and third-party libraries!

So, here is the code (you can find it in Chapter02/02_cartpole_random.py):

import gym

if __name__ == "__main__":
    env = gym.make("CartPole-v0")
    total_reward = 0.0
    total_steps = 0
    obs = env.reset()

Here, we create the environment and initialize the counter of steps and the reward accumulator. On the last line, we reset the environment to obtain the first observation (which we'll not use, as our agent is stochastic):

   while True:
        action = env.action_space.sample()
        obs, reward, done, _ = env.step(action)
        total_reward += reward
        total_steps += 1
        if done:
            break

   print("Episode done in %d steps, total reward %.2f" % (total_steps, total_reward))

In this loop, we sample a random action, then ask the environment to execute it and return to us the next observation(obs), the reward, and the done flag. If the episode is over, we stop the loop and show how many steps we've done and how much reward has been accumulated. If you start this example, you'll see something like this (not exactly, due to the agent's randomness):

rl_book_samples/Chapter02$ python 02_cartpole_random.py
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
Episode done in 12 steps, total reward 12.00

As with the interactive session, the warning is not related to our code, but to Gym's internals. On average, our random agent makes 12–15 steps before the pole falls and the episode ends. Most of the environments in Gym have a "reward boundary," which is the average reward that the agent should gain during 100 consecutive episodes to "solve" the environment. For CartPole, this boundary is 195, which means that on average, the agent must hold the stick during 195-time steps or longer. Using this perspective, our random agent's performance looks poor. However, don't be disappointed too early, because we are just at the beginning, and soon we will solve CartPole and many other much more interesting and challenging environments.

主站蜘蛛池模板: 衡阳县| 沈阳市| 苍溪县| 建宁县| 怀集县| 祥云县| 简阳市| 沅陵县| 麻栗坡县| 桑植县| 沾益县| 镇沅| 张家口市| 兴化市| 定南县| 福州市| 漳平市| 天气| 册亨县| 安顺市| 即墨市| 新野县| 克山县| 广灵县| 长宁县| 东山县| 贡嘎县| 苏尼特左旗| 时尚| 监利县| 军事| 穆棱市| 上高县| 邵东县| 营山县| 德钦县| 三明市| 常宁市| 岳阳市| 青川县| 昭觉县|