官术网_书友最值得收藏!

Learning the Markov decision process 

The Markov property is widely used in RL, and it states that the environment's response at time t+1 depends only on the state and action at time t. In other words, the immediate future only depends on the present and not on the past. This is a useful property that simplifies the math considerably, and is widely used in many fields such as RL and robotics.

Consider a system that transitions from state s0 to s1 by taking an action a0 and receiving a reward r1, and thereafter from s1 to s2 taking action a1, and so on until time t. If the probability of being in a state s' at time t+1 can be represented mathematically as in the following function, then the system is said to follow the Markov property:

Note that the probability of being in state st+1 depends only on st and at and not on the past. An environment that satisfies the following state transition property and reward function as follows is said to be a Markov Decision Process (MDP):

Let's now define the very foundation of RL: the Bellman equation. This equation will help in providing an iterative solution to obtaining value functions.

主站蜘蛛池模板: 临澧县| 台前县| 泾源县| 桂阳县| 西昌市| 库尔勒市| 龙门县| 永仁县| 陇川县| 乃东县| 合水县| 惠水县| 尼玛县| 兴宁市| 璧山县| 潮安县| 禹城市| 濉溪县| 德清县| 浑源县| 报价| 封开县| 如皋市| 涿州市| 铜梁县| 唐海县| 蒲江县| 荥经县| 东港市| 南华县| 巩义市| 麻江县| 通化市| 敦煌市| 通化市| 阳西县| 宝鸡市| 宣武区| 石门县| 临潭县| 墨竹工卡县|