官术网_书友最值得收藏!

Learning the Markov decision process 

The Markov property is widely used in RL, and it states that the environment's response at time t+1 depends only on the state and action at time t. In other words, the immediate future only depends on the present and not on the past. This is a useful property that simplifies the math considerably, and is widely used in many fields such as RL and robotics.

Consider a system that transitions from state s0 to s1 by taking an action a0 and receiving a reward r1, and thereafter from s1 to s2 taking action a1, and so on until time t. If the probability of being in a state s' at time t+1 can be represented mathematically as in the following function, then the system is said to follow the Markov property:

Note that the probability of being in state st+1 depends only on st and at and not on the past. An environment that satisfies the following state transition property and reward function as follows is said to be a Markov Decision Process (MDP):

Let's now define the very foundation of RL: the Bellman equation. This equation will help in providing an iterative solution to obtaining value functions.

主站蜘蛛池模板: 环江| 新兴县| 宝兴县| 徐州市| 闸北区| 葵青区| 绥阳县| 嵊州市| 镇安县| 舞阳县| 龙泉市| 横峰县| 会东县| 农安县| 赤峰市| 平果县| 平凉市| 门头沟区| 固镇县| 广元市| 八宿县| 荣昌县| 武冈市| 雷州市| 济源市| 靖宇县| 甘谷县| 淮安市| 那坡县| 资中县| 衢州市| 墨玉县| 丽水市| 巴马| 汽车| 会同县| 西乌珠穆沁旗| 荆州市| 德昌县| 清涧县| 云龙县|