官术网_书友最值得收藏!

Understanding Q-learning

Q-learning is an off-policy algorithm that was first proposed by Christopher Watkins in 1989, and is a widely used RL algorithm. Q-learning, such as SARSA, keeps an update of the state-action value function for each state-action pair, and recursively updates it using the Bellman equation of dynamic programming as new experiences are collected. Note that it is an off-policy algorithm as it uses the state-action value function evaluated at the action, which will maximize the value. Q-learning is used for problems where the actions are discrete – for example, if we have the actions move north, move south, move east, move west, and we are to decide the optimum action in a given state, then Q-learning is applicable in such settings.

In the classical Q-learning approach, the update is given as follows, where the max is performed over actions, that is, we choose the action a corresponding to the maximum value of Q at state st+1:

The α is the learning rate, which is a hyper-parameter that the user can specify.

Before we code the algorithms in Python, let's find out what kind of problems will be considered.

主站蜘蛛池模板: 揭东县| 增城市| 崇礼县| 桐城市| 四子王旗| 神池县| 龙陵县| 长春市| 阳西县| 若尔盖县| 郧西县| 玛纳斯县| 拉孜县| 东乡县| 楚雄市| 富锦市| 商都县| 始兴县| 红安县| 甘孜| 青浦区| 从化市| 西华县| 宜宾县| 开平市| 麻阳| 芮城县| 枞阳县| 鄢陵县| 上饶县| 桦甸市| 西城区| 大化| 乡城县| 资源县| 桐柏县| 沙洋县| 武邑县| 贵南县| 佛学| 阿克陶县|