- Hands-On Q-Learning with Python
- Nazia Habib
- 548字
- 2021-06-24 15:13:09
The decision-making process
A learning agent's high-level algorithm looks like the following:
- Take note of what state you're in.
- Take an action based on your policy and receive a reward.
- Take note of the reward you received by taking that action in that state.
We can express this mathematically using a Markov decision process (MDP). We'll discuss MDPs in more detail throughout the book. For now, we need to be aware that an MDP describes an environment for RL in which the current state tells us everything we need to know about future states.
What this means, in short, is that if we know the current state of the environment in an MDP, we don't need to know anything about any past states in order to determine what future states will be, or decide what actions to take in this current state. The following diagram shows an illustration of an MDP:

The preceding diagram shows a stochastic MDP with three states: S0, S1, and S2. When we are in state S0, we can take either action a0 or action a1. If we take action a0, there is a 50% chance that we will end up in state S2 and a 50% chance we will end up back in state S0. If we take action a1, there is a 100% chance we will end up in state S2, and so on.
The different actions we choose can have different probable outcomes that we will determine over time through observation. In the MDPs that we work with, there will be rewards associated with each step, and our goal will be to maximize the rewards we receive by knowing the outcomes of each action we choose to take. The more we learn about this environment over time, the better the position we are in to take high-reward actions we've seen before.
Recall that because this environment is stochastic, we do not always end up in the same state based on each action we take. If the environment had been deterministic, we would always end up in the same state after each action we took.
Stochastic environments are also referred to as probabilistic. That is, they incorporate inherent randomness so that the same parameter values and initial conditions can lead to different outcomes. Virtually all natural processes in the real world are stochastic to some degree and involve some level of randomness.
As we'll discuss later in the book, sources of randomness and probability can be modeled on our own uncertainty about an environment rather than on a property that is inherent to the environment itself. In other words, an event will either happen or not happen. Probability and randomness are not properties inherent to that event; they exist only in our perception of the event. In this model, therefore, probability is inherently subjective.
This formulation of stochastic processes is a foundational concept of Bayesian reasoning and the source of many useful mathematical models of agency that are driven by belief and the continual updating of knowledge based on observation. We'll dive deeper into these topics in Chapter 8, Further Q-Learning Research and Future Projects, when we talk about multi-armed bandits and optimization processes, but they are useful to investigate in other contexts as well.
- Mastering Mesos
- 大數(shù)據(jù)專業(yè)英語
- 計算機應用復習與練習
- IoT Penetration Testing Cookbook
- 圖形圖像處理(Photoshop)
- 網(wǎng)上生活必備
- Supervised Machine Learning with Python
- 21天學通Java Web開發(fā)
- Lightning Fast Animation in Element 3D
- 筆記本電腦維修90個精選實例
- 數(shù)字多媒體技術(shù)基礎
- 與人共融機器人的關節(jié)力矩測量技術(shù)
- 網(wǎng)頁設計與制作
- 數(shù)據(jù)結(jié)構(gòu)與實訓
- Learning OpenShift