- Hands-On Q-Learning with Python
- Nazia Habib
- 627字
- 2021-06-24 15:13:10
Actions and rewards
An action is any decision that we make from a state. The state that we are in determines the actions we can take. If we are in a maze and to the right of a wall, we can't turn left, but in other locations, we can turn left. Turning left may or may not be in the list of possible actions that we can take in any particular state.
A reward is the outcome we receive for making a decision in an environment. Our Q-learning agent will keep track of the rewards it receives and will try to maximize the future rewards that it expects to receive with each action it takes.
The reward function for a driving simulator can be something straightforward, such as the following:
- +1 for moving one block
- +10 for making the correct move that will get you closer to the destination
- +100 for reaching the destination
- -20 for violating traffic laws
- -50 for hitting another vehicle
Every action we take leads to a reward, and each reward is noted by the learning agent as it explores its environment and learns what the best actions are to take in each state. For a Q-learning agent, these rewards are stored in a Q-table, which is a simple lookup table mapping states to actions. We will be creating a Q-table as part of our first project, which will be the OpenAI Gym Taxi-v2 environment shown here. The following ASCII screenshot shows a representation of the environment:

The Taxi-v2 environment simulates a taxicab driving around a small grid, picking up passengers and dropping them off at the correct locations. Retrieving the action space and state space from our taxi environment lets us know how many discrete actions and states we have:

The following is a representation of a Q-table for Taxi-v2. Note that it lists 500 states and 6 actions (South, North, East, West, Pickup, and Dropoff):

When the Q-table is initialized, each state-action pair has a value of zero, because the agent has not seen any rewards yet and has not set a value for any action. Once it explores its environment, it starts to fill in values for each state-action pair and to use those values to decide what actions to take the next time it is in those states.
If we, as the agent, are in a particular state, such as state 300, and have 6 possible actions to choose from (as in the taxi example), depending on how much exploration we have done and how many iterations we have gone through, each action will have a different value. Let's say that East has a value of 2, Pickup has a value of 9, and all other actions have a value of 0.
Pickup is, therefore, the highest-valued action, and when we plug it into our argmax function, we have a high probability of choosing Pickup as our next action, depending on our hyperparameter values. Given that Pickup is highly valued at this point above all other actions, it is very likely that it is the correct action to take.
Depending on our agent's policy (that is, the function it uses to choose actions based on states), however, it may or may not actually choose Pickup. If it is using an epsilon-greedy strategy, for example, it might choose a random action instead, which could turn out to be completely wrong.
This is important to bear in mind as we choose a decision-making strategy. We do not always want to choose the current highest-valued action, as there may be other higher-valued actions that we haven't discovered yet. This process is called exploration, and we'll discuss several methodologies for using it to find optimal reward paths.