官术网_书友最值得收藏!

Temporal difference learning

TD learning algorithms are based on reducing the differences between estimates made by the agent at different times. TD algorithms try to predict a quantity that depends on the future values of a given signal. Its name derives from the differences used in predictions on successive time steps to guide the learning process. The prediction at any time is updated to bring it closer to the prediction of the same quantity at the next time step. In reinforcement learning, they are used to predict a measure of the total amount of reward expected in the future.

It is a combination of the ideas of the MC method and the DP.

MC methods allow solving reinforcement learning problems based on the average of the results obtained.  DP represents a set of algorithms that can be used to calculate an optimal policy given a perfect model of the environment in the form of a MDP.

TD algorithm can learn directly from raw data, without a model of the dynamics of the environment (such as MC). This algorithm updates the estimates based partly on previously learned estimates, without waiting for the final result (bootstrap, like DP). Converge (using a fixed policy) if the time step is sufficiently small, or if it reduces over time.

The consecutive predictions are often related to each other; the TD methods are based on this assumption. These methods try to minimize the error of consecutive time forecasts. To do this, calculate the value function update using the Bellman equation. As already mentioned, to improve the prediction, the bootstrap technique is used, thereby reducing the variance of the prediction in each update step.

The different types of algorithms based on time differences can be distinguished on the basis of the methodology of choosing an action adopted. There are methods of time differences on-policy, in which the update is made on the basis of the results of actions determined by the selected policy and off-policy methods, in which various policies can be assessed through hypothetical actions, not actually undertaken. Unlike on-policy methods, the latter can separate the problem of exploration from that of control, learning tactics not necessarily applied during the learning phase.

The most used TD learning algorithms are the following:

  • SARSA
  • Q-learning
  • Deep Q-learning

In the following sections, we will analyze the main characteristics of the two algorithms and the substantial differences.

主站蜘蛛池模板: 甘孜县| 洪洞县| 长沙市| 揭东县| 达拉特旗| 汉阴县| 禹州市| 珠海市| 洛扎县| 福贡县| 嘉义县| 黔南| 依安县| 中西区| 定南县| 江北区| 潮州市| 建宁县| 龙江县| 朝阳区| 剑阁县| 远安县| 长岛县| 大田县| 承德县| 哈巴河县| 大兴区| 阆中市| 祁连县| 阳朔县| 夏津县| 乳源| 玛多县| 招远市| 玉田县| 泸溪县| 阿荣旗| 峡江县| 青川县| 壶关县| 潮安县|