官术网_书友最值得收藏!

Understanding policy, value, and advantage functions

A policy defines the guidelines for an agent's behavior at a given state. In mathematical terms, a policy is a mapping from a state of the agent to the action to be taken at that state. It is like a stimulus-response rule that the agent follows as it learns to explore the environment. In RL literature, it is usually denoted as π(at|st) – that is, it is a conditional probability distribution of taking an action at in a given state st. Policies can be deterministic, wherein the exact value of at is known at st, or can be stochastic where at is sampled from a distribution – typically this is a Gaussian distribution, but it can also be any other probability distribution.

In RL, value functions are used to define how good a state of an agent is. They are typically denoted by V(s) at state s and represent the expected long-term average rewards for being in that state. V(s) is given by the following expression where E[.] is an expectation over samples:

Note that V(s) does not care about the optimum actions that an agent needs to take at the state s. Instead, it is a measure of how good a state is. So, how can an agent figure out the most optimum action at to take in a given state st at time instant t? For this, you can also define an action-value function given by the following expression:

Note that Q(s,a) is a measure of how good is it to take action a in state s and follow the same policy thereafter. So, t is different from V(s), which is a measure of how good a given state is. We will see in the following chapters how the value function is used to train the agent under the RL setting. 

The advantage function is defined as the following:

A(s,a) = Q(s,a) - V(s)

This advantage function is known to reduce the variance of policy gradients, a topic that will be discussed in depth in a later chapter.

The classic RL textbook is  Reinforcement Lea rning: An Introduction by Richard S Sutton and Andrew G Barto, The MIT Press, 1998.

We will now define what an episode is and its significance in an RL context.

主站蜘蛛池模板: 城固县| 平安县| 长岭县| 呼图壁县| 武隆县| 汤原县| 长丰县| 喀喇| 扬中市| 无极县| 常德市| 汉中市| 行唐县| 卫辉市| 九龙城区| 菏泽市| 珠海市| 屏东县| 深水埗区| 金平| 陇南市| 西乡县| 泽普县| 新宾| 师宗县| 新郑市| 大庆市| 壶关县| 阜康市| 吉安县| 乌鲁木齐市| 怀来县| 囊谦县| 丹巴县| 绍兴县| 巧家县| 馆陶县| 顺义区| 天门市| 乌鲁木齐县| 公安县|