- Keras Reinforcement Learning Projects
- Giuseppe Ciaburro
- 318字
- 2021-08-13 15:26:04
SARSA
The State-action-reward-state-action (SARSA) algorithm implements an on-policy time differences method, in which the update of the action-value function is performed based on the outcome of the transition from state s to state s' through action a, based on a selected policy π (s, a).
There are policies that always choose the action that provides the maximum reward and non-deterministic policies (ε-greedy, ε-soft, softmax), which ensure an element of exploration in the learning phase.
In SARSA, it is necessary to estimate the action-value function q (s, a), because the total value of a state v (s) (value function) is not sufficient in the absence of an environment model to allow the policy to determine, given a state, which action is best performed. In this case, however, the values are estimated step by step following the Bellman equation with the update parameter v (s), considering, however, in place of a state, the state-action pair.
Being of an on-policy nature, SARSA estimates the action-value function based on the behavior of the π policy, and at the same time modifies the greedy behavior of the policy with respect to the updated estimates from the action-value function. The convergence of SARSA, and more generally of all TD methods, depends on the nature of policies.
The following is a pseudocode for the SARSA algorithm:
Initialize
arbitrary action-value function
Repeat (for each episode)
Initialize s
choose a from s using policy from action-value function
Repeat (for each step in episode)
take action a
observe r, s'
choose a' from s' using policy from action-value function
update action-value function
update s,a
The update rule of the action-value function uses all five elements (st, at, rt + 1, st + 1, at + 1); for this reason, it is called SARSA.
- Practical Ansible 2
- Learning Microsoft Azure Storage
- Seven NoSQL Databases in a Week
- 7天精通Dreamweaver CS5網(wǎng)頁(yè)設(shè)計(jì)與制作
- Implementing Splunk 7(Third Edition)
- 大數(shù)據(jù)驅(qū)動(dòng)的設(shè)備健康預(yù)測(cè)及維護(hù)決策優(yōu)化
- Flink原理與實(shí)踐
- Mastering Ansible(Second Edition)
- Visual Basic項(xiàng)目開(kāi)發(fā)案例精粹
- RealFlow流體制作經(jīng)典實(shí)例解析
- 深度學(xué)習(xí)之模型優(yōu)化:核心算法與案例實(shí)踐
- fastText Quick Start Guide
- 從實(shí)踐中學(xué)嵌入式Linux操作系統(tǒng)
- Kubernetes Design Patterns and Extensions
- 開(kāi)源技術(shù)選型手冊(cè)