官术网_书友最值得收藏!

Value-based versus policy-based iteration

We'll be using value-based iteration for the projects in this book. The description of the Bellman equation given previously offers a very high-level understanding of how value-based iteration works. The main difference is that in value-based iteration, the agent learns the expected reward value of each state-action pair, and in policy-based iteration, the agent learns the function that maps states to actions.

One simple way to describe this difference is that a value-based agent, when it has mastered its environment, will not explicitly be able to simulate that environment. It will not be able to give an actual function that maps states to actions. A policy-based agent, on the other hand, will be able to give that function. 

Note that Q-learning and SARSA are both value-based algorithms. Because we are working with Q-learning in this book, we will not study policy-based iteration in detail here. The main thing to bear in mind about policy-based iteration is that it gives us the ability to learn stochastic policies and it is more useful for working with continuous action spaces.

主站蜘蛛池模板: 滁州市| 汝州市| 喀喇沁旗| 瓮安县| 安国市| 武乡县| 龙口市| 阿荣旗| 克山县| 竹北市| 亚东县| 象州县| 永春县| 敦化市| 井陉县| 新营市| 洪雅县| 安图县| 双流县| 方正县| 株洲市| 隆林| 射阳县| 旬邑县| 泰安市| 康平县| 澄江县| 策勒县| 陇川县| 瓦房店市| 湖北省| 阿坝县| 治多县| 丁青县| 都江堰市| 长岭县| 南安市| 遂溪县| 临湘市| 宣恩县| 上虞市|