官术网_书友最值得收藏!

The value function for optimality

Agents should be able to think about both immediate and future rewards. Therefore, a value is assigned to each encountered state that reflects this future information too. This is called value function. Here comes the concept of delayed rewards, where being at present what actions taken now will lead to potential rewards in future.

V(s), that is, value of the state is defined as the expected value of rewards to be received in future for all the actions taken from this state to subsequent states until the agent reaches the goal state. Basically, value functions tell us how good it is to be in this state. The higher the value, the better the state.

Rewards assigned to each (s,a,s') triple is fixed. This is not the case with the value of the state; it is subjected to change with every action in the episode and with different episodes too.

One solution comes in mind, instead of the value function, why don't we store the knowledge of every possible state?

The answer is simple: it's time-consuming and expensive, and this cost grows exponentially. Therefore, it's better to store the knowledge of the current state, that is, V(s):

V(s) = E[all future rewards discounted | S(t)=s]

More details on the value function will be covered in Chapter 3, The Markov Decision Process and Partially Observable MDP.

主站蜘蛛池模板: 吉木萨尔县| 蕉岭县| 新乡县| 乐清市| 侯马市| SHOW| 平谷区| 商河县| 获嘉县| 虎林市| 花莲市| 蕉岭县| 霍州市| 滨州市| 沛县| 萨迦县| 吕梁市| 海伦市| 南漳县| 于田县| 宿迁市| 贵南县| 庆安县| 大同县| 平山县| 玉树县| 安多县| 太保市| 白沙| 赣榆县| 永川市| 海门市| 高清| 桃源县| 涞水县| 江永县| 天门市| 米脂县| 报价| 北安市| 泌阳县|