書名： Keras Reinforcement Learning Projects
作者名： Giuseppe Ciaburro
本章字數： 937字
更新時間： 2021-08-13 15:26:03

Markov Decision Process

To avoid load problems and computational difficulties, the agent-environment interaction is considered an MDP. An MDP is a discrete-time stochastic control process.

Stochastic processes are mathematical models used to study the evolution of phenomena following random or probabilistic laws. It is known that in all natural phenomena, both by their very nature and by observational errors, a random or accidental component is present. This component causes the following: at every instance of t, the result of the observation on the phenomenon is a random number or random variable s_t. It is not possible to predict with certainty what the result will be; one can only state that it will take one of several possible values, each of which has a given probability.

A stochastic process is called Markovian when, having chosen a certain instance of t for observation, the evolution of the process, starting with t, depends only on t and does not depend in any way on the previous instances. Thus, a process is Markovian when, given the moment of observation, only this instance determines the future evolution of the process, while this evolution does not depend on the past.

In a Markov process, at each time step, the process is in a state s € S, and the decision maker may choose any action a € A that is available in state s. The process responds at the next timestamp by randomly moving into a new state s', and giving the decision maker a corresponding reward r(s,s').

The following diagram shows the agent-environment interaction in a MDP:

The agent-environment interaction shown in the preceding diagram can be schematized as follows:

The agent and the environment interact at discrete intervals over time, t = 0, 1, 2… n.
At each interval, the agent receives a representation of the state s_t of the environment.
Each element s_t ∈ S, where S is the set of possible states.
Once the state is recognized, the agent must take an action a_t ∈ A(s_t), where A(s_t) is the set of possible actions in the state s_t.
The choice of the action to be taken depends on the objective to be achieved and is mapped through the policy indicated with the symbol π (discounted cumulative reward), which associates the action with a_t∈ A(s) for each state s. The term π_t(s,a) represents the probability that action a is carried out in the state s.
During the next time interval t + 1, as part of the consequence of the action a_t, the agent receives a numerical reward r_{t + 1} ∈ R corresponding to the action previously taken a_t.
The consequence of the action represents, instead, the new state s_t. At this point the agent must again code the state and make the choice of the action.
This iteration repeats itself until the achievement of the objective by the agent.

The definition of the status s_{t + 1} depends on the previous state and the action taken (MDP), that is as follows:

s_{t + 1} = δ (s_t,a_t)

Here, δ represents the status function.

In summary:

In an MDP, the agent can perceive the state s ∈ S in which it is and has an A set of actions at its disposal
At each discrete interval of time t, the agent detects the current status s_t and decides to implement an action a_t ∈ A
The environment responds by providing a reward (a reinforcement) r_t = r (st, at) and moving into the state s_{t + 1} = δ (st, at)
The r and δ functions are part of the environment; they depend only on the current state and action (not the previous ones) and are not necessarily known to the agent
The goal of reinforcement learning is to learn a policy that, for each state s in which the system is located, indicates to the agent an action to maximize the total reinforcement received during the entire action sequence

Let's go deeper into some of the terms used:

A reward function defines the goal in a reinforcement learning problem. It maps the detected states of the environment into a single number, thereby defining a reward. As already mentioned, the only goal is to maximize the total reward it receives in the long term. The reward function then defines what the good and bad events are for the agent. The reward function has the need to be correct, and it can be used as a basis for changing the policy. For example, if an action selected by the policy is followed by a low reward, the policy can be changed to select other actions in that situation in the next step.
A policy defines the behavior of the learning agent at a given time. It maps both the detected states of the environment and the actions to take when they are in those states. This corresponds to what, in psychology, would be called a set of rules or associations of stimulus response. The policy is the fundamental part of a reinforcing learning agent, in the sense that it alone is enough to determine behavior.
A value function represents how good a state is for an agent. It is equal to the total reward expected for an agent from the status s. The value function depends on the policy with which the agent selects the actions to be performed.
An action-value function returns the value, that is, the expected return (overall reward) for using action a in a certain state s, following a policy.

官术网_书友最值得收藏!

Keras Reinforcement Learning Projects

Markov Decision Process