官术网_书友最值得收藏!

Markov Decision Process

To avoid load problems and computational difficulties, the agent-environment interaction is considered an MDP. An MDP is a discrete-time stochastic control process.

Stochastic processes are mathematical models used to study the evolution of phenomena following random or probabilistic laws. It is known that in all natural phenomena, both by their very nature and by observational errors, a random or accidental component is present. This component causes the following: at every instance of t, the result of the observation on the phenomenon is a random number or random variable st. It is not possible to predict with certainty what the result will be; one can only state that it will take one of several possible values, each of which has a given probability.

A stochastic process is called Markovian when, having chosen a certain instance of t for observation, the evolution of the process, starting with t, depends only on t and does not depend in any way on the previous instances. Thus, a process is Markovian when, given the moment of observation, only this instance determines the future evolution of the process, while this evolution does not depend on the past.

In a Markov process, at each time step, the process is in a state s € S, and the decision maker may choose any action a € A that is available in state s. The process responds at the next timestamp by randomly moving into a new state s', and giving the decision maker a corresponding reward r(s,s').

The following diagram shows the agent-environment interaction in a MDP:

The agent-environment interaction shown in the preceding diagram can be schematized as follows:

  • The agent and the environment interact at discrete intervals over time, t = 0, 1, 2… n.
  • At each interval, the agent receives a representation of the state st of the environment.
  • Each element st ∈ S, where S is the set of possible states.
  • Once the state is recognized, the agent must take an action at ∈ A(st), where A(st) is the set of possible actions in the state st.
  • The choice of the action to be taken depends on the objective to be achieved and is mapped through the policy indicated with the symbol π (discounted cumulative reward), which associates the action with a∈ A(s) for each state s. The term πt(s,a) represents the probability that action a is carried out in the state s.
  • During the next time interval t + 1, as part of the consequence of the action at, the agent receives a numerical reward rt + 1 ∈ R corresponding to the action previously taken at.
  • The consequence of the action represents, instead, the new state st. At this point the agent must again code the state and make the choice of the action.
  • This iteration repeats itself until the achievement of the objective by the agent.

The definition of the status st + 1 depends on the previous state and the action taken (MDP), that is as follows:

st + 1 = δ (st,at)

Here, δ represents the status function.

In summary:

  • In an MDP, the agent can perceive the state s ∈ S in which it is and has an A set of actions at its disposal
  • At each discrete interval of time t, the agent detects the current status st and decides to implement an action at ∈ A
  • The environment responds by providing a reward (a reinforcement) rt = r (st, at) and moving into the state st + 1 = δ (st, at)
  • The r and δ functions are part of the environment; they depend only on the current state and action (not the previous ones) and are not necessarily known to the agent
  • The goal of reinforcement learning is to learn a policy that, for each state s in which the system is located, indicates to the agent an action to maximize the total reinforcement received during the entire action sequence

Let's go deeper into some of the terms used:

  • A reward function defines the goal in a reinforcement learning problem. It maps the detected states of the environment into a single number, thereby defining a reward. As already mentioned, the only goal is to maximize the total reward it receives in the long term. The reward function then defines what the good and bad events are for the agent. The reward function has the need to be correct, and it can be used as a basis for changing the policy. For example, if an action selected by the policy is followed by a low reward, the policy can be changed to select other actions in that situation in the next step.
  • A policy defines the behavior of the learning agent at a given time. It maps both the detected states of the environment and the actions to take when they are in those states. This corresponds to what, in psychology, would be called a set of rules or associations of stimulus response. The policy is the fundamental part of a reinforcing learning agent, in the sense that it alone is enough to determine behavior.
  • A value function represents how good a state is for an agent. It is equal to the total reward expected for an agent from the status s. The value function depends on the policy with which the agent selects the actions to be performed.
  • An action-value function returns the value, that is, the expected return (overall reward) for using action a in a certain state s, following a policy.
主站蜘蛛池模板: 阿拉善盟| 沐川县| 安康市| 江川县| 三门县| 揭东县| 沂水县| 河北区| 鄂托克前旗| 宁波市| 镇远县| 叶城县| 行唐县| 宁海县| 肥东县| 孝昌县| 夹江县| 天镇县| 怀宁县| 荔浦县| 大渡口区| 安多县| 吴忠市| 海晏县| 微山县| 辉南县| 霍邱县| 凉城县| 手游| 德保县| 连城县| 无极县| 北海市| 屏南县| 甘德县| 陇西县| 宜宾市| 阳泉市| 宁乡县| 藁城市| 晋江市|