官术网_书友最值得收藏!

The decision-making process 

A learning agent's high-level algorithm looks like the following:

  1. Take note of what state you're in.
  2. Take an action based on your policy and receive a reward.
  3. Take note of the reward you received by taking that action in that state.

We can express this mathematically using a Markov decision process (MDP). We'll discuss MDPs in more detail throughout the book. For now, we need to be aware that an MDP describes an environment for RL in which the current state tells us everything we need to know about future states. 

What this means, in short, is that if we know the current state of the environment in an MDP, we don't need to know anything about any past states in order to determine what future states will be, or decide what actions to take in this current state. The following diagram shows an illustration of an MDP:

The preceding diagram shows a stochastic MDP with three states: S0, S1, and S2. When we are in state S0, we can take either action a0 or action a1. If we take action a0, there is a 50% chance that we will end up in state S2 and a 50% chance we will end up back in state S0. If we take action a1, there is a 100% chance we will end up in state S2, and so on.

The different actions we choose can have different probable outcomes that we will determine over time through observation. In the MDPs that we work with, there will be rewards associated with each step, and our goal will be to maximize the rewards we receive by knowing the outcomes of each action we choose to take. The more we learn about this environment over time, the better the position we are in to take high-reward actions we've seen before.

Recall that because this environment is stochastic, we do not always end up in the same state based on each action we take. If the environment had been deterministic, we would always end up in the same state after each action we took.

Stochastic environments are also referred to as probabilistic. That is, they incorporate inherent randomness so that the same parameter values and initial conditions can lead to different outcomes. Virtually all natural processes in the real world are stochastic to some degree and involve some level of randomness.

As we'll discuss later in the book, sources of randomness and probability can be modeled on our own uncertainty about an environment rather than on a property that is inherent to the environment itself. In other words, an event will either happen or not happen. Probability and randomness are not properties inherent to that event; they exist only in our perception of the event. In this model, therefore, probability is inherently subjective.

This formulation of stochastic processes is a foundational concept of Bayesian reasoning and the source of many useful mathematical models of agency that are driven by belief and the continual updating of knowledge based on observation. We'll dive deeper into these topics in Chapter 8Further Q-Learning Research and Future Projects, when we talk about multi-armed bandits and optimization processes, but they are useful to investigate in other contexts as well. 

主站蜘蛛池模板: 龙口市| 巩义市| 台东市| 赞皇县| 镶黄旗| 涡阳县| 梓潼县| 古交市| 连江县| 伊金霍洛旗| 宁明县| 紫阳县| 洱源县| 台中市| 乌兰县| 阜城县| 达尔| 宣化县| 三都| 山西省| 都江堰市| 府谷县| 玉环县| 永丰县| 延安市| 宝鸡市| 五华县| 嘉兴市| 赞皇县| 宁津县| 泸西县| 绩溪县| 赤壁市| 蛟河市| 喀什市| 炉霍县| 广平县| 铜鼓县| 长阳| 商南县| 沈阳市|