- Hands-On Q-Learning with Python
- Nazia Habib
- 513字
- 2021-06-24 15:13:08
What is RL?
An RL agent is an optimization process that learns from experience, using data from its environment that it has collected through its own observations. It starts out knowing nothing about a task explicitly, learns by trial and error about what happens when it makes decisions, keeps track of successful decisions, and makes those same decisions under the same circumstances in the future.
In fields other than AI, RL is also referred to as dynamic programming. It takes much of its basic operating structure from behavioral psychology, and many of its mathematical constructs such as utility functions are taken from fields such as economics and game theory.
Let's get familiar with some key concepts in RL:
- Agent: This is the decision-making entity.
- Environment: This is the world in which the agent operates, such as a game to win or task to accomplish.
- State: This is where the agent is in its environment. When you define the states that an agent can be in, think about what it needs to know about its environment. For example, a self-driving car will need to know whether the next traffic light is red or green and whether there are pedestrians in the crosswalk; these are defined as state variables.
- Action: This is the next move that the agent chooses to take.
- Reward: This is the feedback that the agent gets from the environment for taking that action.
- Policy: This is a function to map the agent's states to its actions. For your first RL agent, this will be as simple as a lookup table, called the Q-table. It will operate as your agent's brain.
- Value: This is the future reward that an agent would receive by taking an action based on the future actions it could take. This is separate from the immediate reward it will get from taking that action (the value is also commonly called the utility).
The first type of RL agent that you will create is a model-free agent. A model-free RL agent does not know anything about a state that it has not seen, and so will not be able to estimate the value of the reward that it will receive from an unknown state. In other words, it cannot generalize about its environment. We will explore the differences between model-free learning and model-based learning in greater depth later in the book.
The two major model-free RL algorithms are called Q-learning and state-action-reward-state-action (SARSA). The algorithm that we will use throughout the book is Q-learning.
As we will see in the SARSA versus Q-learning – on-policy or off? section comparing the two algorithms, Q-learning can be treated as a variant of SARSA. We choose to use Q-learning as our introductory RL algorithm because it is relatively simple and straightforward to learn. As we build on and increase our RL skills, we can branch out into other algorithms that may be more complicated to learn, but they will give us better results.