- Hands-On Q-Learning with Python
- Nazia Habib
- 373字
- 2021-06-24 15:13:12
SARSA and the cliff-walking problem
In Q-learning, the agent starts out in state S, performs action A, sees what the highest possible reward is for taking any action from its new state, T, and updates its value for the state S-action A pair based on this new highest possible value. In SARSA, the agent starts in state S, takes action A and gets a reward, then moves to state T, takes action B and gets a reward, and then goes back to update the value for S-A based on the actual value of the reward it received from taking action B.
A famous illustration of the differences in performance between Q-learning and SARSA is the cliff-walking example from Sutton and Barto's Reinforcement Learning: An Introduction (1998):

There is a penalty of -1 for each step that the agent takes, and a penalty of -100 for falling off the cliff. The optimal path is, therefore, to run exactly along the edge of the cliff and reach the reward as quickly as possible. This minimizes the number of steps the agent takes and maximizes its reward as long as it does not fall into the cliff at any point.
Q-learning takes the optimal path in this example, while SARSA takes the safe path. The result is that there is a nonzero risk (with an epsilon-greedy or other exploration-based policy) that at any point a Q-learning agent will fall off the cliff as a result of choosing exploration.
SARSA, unlike Q-learning, looks ahead to the next action to see what the agent will actually do at the next step and updates the Q-value of its current state-action pair accordingly. For this reason, it learns that the agent might fall into the cliff and that this would lead to a large negative reward, so it lowers the Q-values of those state-action pairs accordingly.
The result is that Q-learning assumes that the agent is following the best possible policy without attempting to resolve what that policy actually is, while SARSA takes into account the agent's actual policy (that is, what it ends up doing when it moves to the next state as opposed to the best possible thing it could be assumed to do).
- 輕輕松松自動(dòng)化測(cè)試
- 數(shù)據(jù)挖掘?qū)嵱冒咐治?/a>
- 數(shù)控銑削(加工中心)編程與加工
- Zabbix Network Monitoring(Second Edition)
- 運(yùn)動(dòng)控制器與交流伺服系統(tǒng)的調(diào)試和應(yīng)用
- Linux:Powerful Server Administration
- 電氣控制與PLC技術(shù)應(yīng)用
- Excel 2007技巧大全
- DevOps Bootcamp
- Applied Data Visualization with R and ggplot2
- Salesforce Advanced Administrator Certification Guide
- 實(shí)用網(wǎng)絡(luò)流量分析技術(shù)
- 電腦上網(wǎng)入門(mén)
- 案例解說(shuō)Delphi典型控制應(yīng)用
- fastText Quick Start Guide