- Hands-On Q-Learning with Python
- Nazia Habib
- 319字
- 2021-06-24 15:13:16
Your Q-learning agent in its environment
Let's talk about the self-driving taxi agent that we'll be building. Recall that the Taxi-v2 environment has 500 states, and 6 possible actions that can be taken from each state.
Your objective in the taxi environment is to pick up a passenger at one location, and drop them off at their desired destination in as few timesteps as possible.
You receive points for a successful drop-off, and lose points for the time it takes to complete the task, so your goal is to complete the task in as little time as possible. You also lose points for incorrect actions, such as dropping a passenger off at the wrong location.
Because your goal is to get to both the pickup and drop-off locations as quickly as possible, you lose one point for every move you make per timestep.
Your agent's goal in solving this problem is to find the optimal policy for getting the passenger to their destination as efficiently as possible, netting the maximum reward for itself. While it navigates the environment, it will learn the best action to take from each state, which will serve as its policy function.
Remember that because Q-learning is value-based and not policy-based, it will not take your agent's actual policy into account, and we will not explicitly enumerate this policy. Instead, the Q-learning algorithm will calculate the value for each state-action pair based on the highest possible value of the next action that your agent could take, therefore assuming that your agent is already following the optimal policy.
We will continue to explore this concept in more detail with the functions that you will write for your agent. The OpenAI Gym package that we will use will provide the game environment, and you will implement the Q-learning algorithm yourself. You can then use the same environment to implement other RL algorithms and compare their performance.