官术网_书友最值得收藏!

The decision-making process 

A learning agent's high-level algorithm looks like the following:

  1. Take note of what state you're in.
  2. Take an action based on your policy and receive a reward.
  3. Take note of the reward you received by taking that action in that state.

We can express this mathematically using a Markov decision process (MDP). We'll discuss MDPs in more detail throughout the book. For now, we need to be aware that an MDP describes an environment for RL in which the current state tells us everything we need to know about future states. 

What this means, in short, is that if we know the current state of the environment in an MDP, we don't need to know anything about any past states in order to determine what future states will be, or decide what actions to take in this current state. The following diagram shows an illustration of an MDP:

The preceding diagram shows a stochastic MDP with three states: S0, S1, and S2. When we are in state S0, we can take either action a0 or action a1. If we take action a0, there is a 50% chance that we will end up in state S2 and a 50% chance we will end up back in state S0. If we take action a1, there is a 100% chance we will end up in state S2, and so on.

The different actions we choose can have different probable outcomes that we will determine over time through observation. In the MDPs that we work with, there will be rewards associated with each step, and our goal will be to maximize the rewards we receive by knowing the outcomes of each action we choose to take. The more we learn about this environment over time, the better the position we are in to take high-reward actions we've seen before.

Recall that because this environment is stochastic, we do not always end up in the same state based on each action we take. If the environment had been deterministic, we would always end up in the same state after each action we took.

Stochastic environments are also referred to as probabilistic. That is, they incorporate inherent randomness so that the same parameter values and initial conditions can lead to different outcomes. Virtually all natural processes in the real world are stochastic to some degree and involve some level of randomness.

As we'll discuss later in the book, sources of randomness and probability can be modeled on our own uncertainty about an environment rather than on a property that is inherent to the environment itself. In other words, an event will either happen or not happen. Probability and randomness are not properties inherent to that event; they exist only in our perception of the event. In this model, therefore, probability is inherently subjective.

This formulation of stochastic processes is a foundational concept of Bayesian reasoning and the source of many useful mathematical models of agency that are driven by belief and the continual updating of knowledge based on observation. We'll dive deeper into these topics in Chapter 8Further Q-Learning Research and Future Projects, when we talk about multi-armed bandits and optimization processes, but they are useful to investigate in other contexts as well. 

主站蜘蛛池模板: 承德市| 湛江市| 红桥区| 长兴县| 和平区| 团风县| 台湾省| 林州市| 克东县| 平罗县| 逊克县| 年辖:市辖区| 高雄县| 晋州市| 新宁县| 安西县| 阳高县| 沁阳市| 茶陵县| 郸城县| 化州市| 博罗县| 漯河市| 五指山市| 上思县| 红原县| 宜阳县| 全南县| 汨罗市| 古田县| 巴彦淖尔市| 舒兰市| 都兰县| 察雅县| 贵港市| 凤城市| 和政县| 荆门市| 南京市| 大厂| 澄迈县|