書名： Hands-On Q-Learning with Python
作者名： Nazia Habib
本章字數： 548字
更新時間： 2021-06-24 15:13:09

The decision-making process

A learning agent's high-level algorithm looks like the following:

Take note of what state you're in.
Take an action based on your policy and receive a reward.
Take note of the reward you received by taking that action in that state.

We can express this mathematically using a Markov decision process (MDP). We'll discuss MDPs in more detail throughout the book. For now, we need to be aware that an MDP describes an environment for RL in which the current state tells us everything we need to know about future states.

What this means, in short, is that if we know the current state of the environment in an MDP, we don't need to know anything about any past states in order to determine what future states will be, or decide what actions to take in this current state. The following diagram shows an illustration of an MDP:

The preceding diagram shows a stochastic MDP with three states: S₀, S₁, and S₂. When we are in state S₀, we can take either action a₀ or action a₁. If we take action a₀, there is a 50% chance that we will end up in state S₂ and a 50% chance we will end up back in state S₀. If we take action a₁, there is a 100% chance we will end up in state S₂, and so on.

The different actions we choose can have different probable outcomes that we will determine over time through observation. In the MDPs that we work with, there will be rewards associated with each step, and our goal will be to maximize the rewards we receive by knowing the outcomes of each action we choose to take. The more we learn about this environment over time, the better the position we are in to take high-reward actions we've seen before.

Recall that because this environment is stochastic, we do not always end up in the same state based on each action we take. If the environment had been deterministic, we would always end up in the same state after each action we took.

Stochastic environments are also referred to as probabilistic. That is, they incorporate inherent randomness so that the same parameter values and initial conditions can lead to different outcomes. Virtually all natural processes in the real world are stochastic to some degree and involve some level of randomness.

As we'll discuss later in the book, sources of randomness and probability can be modeled on our own uncertainty about an environment rather than on a property that is inherent to the environment itself. In other words, an event will either happen or not happen. Probability and randomness are not properties inherent to that event; they exist only in our perception of the event. In this model, therefore, probability is inherently subjective.

This formulation of stochastic processes is a foundational concept of Bayesian reasoning and the source of many useful mathematical models of agency that are driven by belief and the continual updating of knowledge based on observation. We'll dive deeper into these topics in Chapter 8, Further Q-Learning Research and Future Projects, when we talk about multi-armed bandits and optimization processes, but they are useful to investigate in other contexts as well.

官术网_书友最值得收藏!

Hands-On Q-Learning with Python

The decision-making process