官术网_书友最值得收藏!

Q-learning

Q-learning is one of the most used reinforcement learning algorithms. This is due to its ability to compare the expected utility of the available actions without requiring an environment model. Thanks to this technique, it is possible to find an optimal action for every given state in a finished MDP.

A general solution to the reinforcement learning problem is to estimate, thanks to the learning process, an evaluation function. This function must be able to evaluate, through the sum of the rewards, the optimality/utility or otherwise of a particular policy. In fact, Q-learning tries to maximize the value of the Q function (action-value function), which represents the maximum discounted future reward when we perform actions a in the state s.

Q-learning, like SARSA, estimates the function value q (s, a) incrementally, updating the value of the state-action pair at each step of the environment, following the logic of updating the general formula for estimating the values for the TD methods. Q-learning, unlike SARSA, has off-policy characteristics, that is, while the policy is improved according to the values estimated by q (s, a), the value function updates the estimates following a strictly greedy secondary policy: given a state, the chosen action is always the one that maximizes the value max q (s, a). However, the π policy has an important role in estimating values because, through it, the state-action pairs to be visited and updated are determined.

The following is a pseudocode for a Q-learning algorithm:

Initialize
arbitrary action-value function
Repeat (for each episode)
Initialize s
choose a from s using policy from action-value function
Repeat (for each step in episode)
take action a
observe r, s'
update action-value function
update s

Q-learning uses a table to store each state-action pair. At each step, the agent observes the current state of the environment and, using the π policy, selects and executes the action. By executing the action, the agent obtains the reward Rt+1 and the new state St+1. At this point the agent is able to calculate Q(St, at), updating the estimate.

主站蜘蛛池模板: 京山县| 丹江口市| 平乡县| 尼勒克县| 武冈市| 容城县| 桓台县| 五台县| 疏附县| 南木林县| 五莲县| 蓝田县| 通州区| 岳阳市| 英德市| 太仆寺旗| 英德市| 贵州省| 隆德县| 邢台县| 安泽县| 江源县| 杨浦区| 南京市| 陆川县| 雅安市| 甘谷县| 清涧县| 长寿区| 吴江市| 万荣县| 保定市| 梓潼县| 阳原县| 中西区| 綦江县| 新巴尔虎右旗| 济源市| 浮山县| 中阳县| 湘潭县|