官术网_书友最值得收藏!

Temporal difference learning

TD learning algorithms are based on reducing the differences between estimates made by the agent at different times. TD algorithms try to predict a quantity that depends on the future values of a given signal. Its name derives from the differences used in predictions on successive time steps to guide the learning process. The prediction at any time is updated to bring it closer to the prediction of the same quantity at the next time step. In reinforcement learning, they are used to predict a measure of the total amount of reward expected in the future.

It is a combination of the ideas of the MC method and the DP.

MC methods allow solving reinforcement learning problems based on the average of the results obtained.  DP represents a set of algorithms that can be used to calculate an optimal policy given a perfect model of the environment in the form of a MDP.

TD algorithm can learn directly from raw data, without a model of the dynamics of the environment (such as MC). This algorithm updates the estimates based partly on previously learned estimates, without waiting for the final result (bootstrap, like DP). Converge (using a fixed policy) if the time step is sufficiently small, or if it reduces over time.

The consecutive predictions are often related to each other; the TD methods are based on this assumption. These methods try to minimize the error of consecutive time forecasts. To do this, calculate the value function update using the Bellman equation. As already mentioned, to improve the prediction, the bootstrap technique is used, thereby reducing the variance of the prediction in each update step.

The different types of algorithms based on time differences can be distinguished on the basis of the methodology of choosing an action adopted. There are methods of time differences on-policy, in which the update is made on the basis of the results of actions determined by the selected policy and off-policy methods, in which various policies can be assessed through hypothetical actions, not actually undertaken. Unlike on-policy methods, the latter can separate the problem of exploration from that of control, learning tactics not necessarily applied during the learning phase.

The most used TD learning algorithms are the following:

  • SARSA
  • Q-learning
  • Deep Q-learning

In the following sections, we will analyze the main characteristics of the two algorithms and the substantial differences.

主站蜘蛛池模板: 西青区| 临夏市| 平阳县| 军事| 常山县| 太和县| 嵊州市| 休宁县| 西藏| 利辛县| 民权县| 根河市| 乐至县| 鄂尔多斯市| 灵石县| 宣恩县| 广丰县| 新和县| 麦盖提县| 凤庆县| 灵石县| 永修县| 敦煌市| 江门市| 丰城市| 灌阳县| 静安区| 田林县| 来宾市| 垫江县| 通城县| 临武县| 乌兰察布市| 绥滨县| 获嘉县| 师宗县| 乐东| 文登市| 旺苍县| 文化| 赣榆县|