書名： Hands-On Q-Learning with Python
作者名： Nazia Habib
本章字數： 683字
更新時間： 2021-06-24 15:13:16

States and actions in Taxi-v2

So, what are the 500 states that the taxi environment can be in, and what are the actions it can take from those states? Let's take a look at this in action.

You instantiate a taxi environment in OpenAI Gym. In the following screenshot, the small grid underneath the code block is your game environment. The yellow rectangle represents the taxi agent, and the four letters indicate the pickup and drop-off locations for your passengers:

We have a 5 x 5 game grid, which entails the following:

There are 25 possible spaces for the taxi agent to be in at any time.
There are 5 locations for the passenger to be in (such as inside the taxi or at any of the 4 drop-off points).
There are 4 possible correct destinations (as opposed to locations that the passenger does not want to be dropped off at).

This gives 25 x 5 x 4 = 500 possible states.

The state we are enumerating could, therefore, be represented with the following state vector:

S = <taxi location, passenger location, destination location>

The three variables in the state vector represent the three factors that could change in each state.

Some of the states that we'll enumerate in our list of 500 are unreachable. For example, if the passenger is at the correct destination, then that iteration of the game is over. The taxi must also be at the destination at that point, since this is a Terminal state and the taxi will not make any additional moves. So, any state that has the passenger at the destination and the taxi at a different location will never be encountered, but we still represent these states in the state space for simplicity.

The six possible actions in Taxi-v2 are as follows:

South (0)
North (1)
East (2)
West (3)
Pickup (4)
Drop-off (5)

These actions are discrete and deterministic; at each step, we choose an action to take based on the Q-learning algorithm we will design and implement. If we have no algorithm in place yet, we can choose to take a random action. Notice that we cannot take every possible action from every state. We cannot turn left (that is, west) if there is a wall to our left, for example.

We reset the environment and go to a random state when we start an episode:

state = env.reset()

The agent chooses an action to take in the environment, and every time it does, the environment returns four variables:

observation: This refers to the new state that we are in.
reward: This indicates the reward that we have received.
done: This tells us whether we have successfully dropped off the passenger at the correct location.
info: This provides us with any additional information that we may need for debugging.

We collect these variables as follows:

observation, reward, done, info = env.step(env.action_space.sample())

This will cause the agent to take one step through the task loop and have the environment return the required variables. It will be useful to keep track of these variables and report them as we step through our taxi model, which we'll start building in the next chapter.

A potential taxi reward structure might work as follows:

A reward of +20 for successfully dropping off the passenger
A penalty of 10 for an incorrect pickup or drop-off (that is, a reward of -10)
A reward of -1 for all other actions, such as moving forward one step

The longer the agent takes to execute a successful drop-off, the more points it will lose. We don't want it to take unnecessary steps, but we also don't want it to make illegal moves by trying to reach the destination faster, so the penalty for taking a non-drop-off step is small. An optimal solution will be to have the agent reach the correct destination in the minimum number of timesteps.

Again, once the Q-table converges to its final values and stops updating with each iteration, the agent will have discovered the optimal path to reaching the destination as quickly as possible.

官术网_书友最值得收藏!

Hands-On Q-Learning with Python

States and actions in Taxi-v2