官术网_书友最值得收藏!

The value function for optimality

Agents should be able to think about both immediate and future rewards. Therefore, a value is assigned to each encountered state that reflects this future information too. This is called value function. Here comes the concept of delayed rewards, where being at present what actions taken now will lead to potential rewards in future.

V(s), that is, value of the state is defined as the expected value of rewards to be received in future for all the actions taken from this state to subsequent states until the agent reaches the goal state. Basically, value functions tell us how good it is to be in this state. The higher the value, the better the state.

Rewards assigned to each (s,a,s') triple is fixed. This is not the case with the value of the state; it is subjected to change with every action in the episode and with different episodes too.

One solution comes in mind, instead of the value function, why don't we store the knowledge of every possible state?

The answer is simple: it's time-consuming and expensive, and this cost grows exponentially. Therefore, it's better to store the knowledge of the current state, that is, V(s):

V(s) = E[all future rewards discounted | S(t)=s]

More details on the value function will be covered in Chapter 3, The Markov Decision Process and Partially Observable MDP.

主站蜘蛛池模板: 永康市| 泗洪县| 平遥县| 绥中县| 满洲里市| 黄大仙区| 尼玛县| 宁国市| 大港区| 彭州市| 北安市| 西林县| 抚顺县| 电白县| 娄底市| 乡宁县| 竹山县| 酒泉市| 乐东| 旌德县| 河源市| 武山县| 临猗县| 天台县| 嵊州市| 兖州市| 黔西县| 新闻| 赣榆县| 九寨沟县| 皋兰县| 民丰县| 肃宁县| 长岭县| 建平县| 曲周县| 松潘县| 临颍县| 土默特右旗| 沾化县| 满洲里市|