- TensorFlow Reinforcement Learning Quick Start Guide
- Kaushik Balakrishnan
- 151字
- 2021-06-24 15:29:09
Learning SARSA
SARSA is another on-policy algorithm that was very popular, particularly in the 1990s. It is an extension of TD-learning, which we saw previously, and is an on-policy algorithm. SARSA keeps an update of the state-action value function, and as new experiences are encountered, this state-action value function is updated using the Bellman equation of dynamic programming. We extend the preceding TD algorithm to state-action value function, Q(st,at), and this approach is called SARSA:

Here, from a given state st, we take action at, receive a reward rt+1, transition to a new state st+1, and thereafter take an action at+1 that then continues on and on. This quintuple (st, at, rt+1, st+1, at+1) gives the algorithm the name SARSA. SARSA is an on-policy algorithm, as the same policy is updated as was used to estimate Q. For exploration, you can use, say, ε-greedy.