官术网_书友最值得收藏!

Rewards

In RL literature, rewards at a time instant t are typically denoted as Rt. Thus, the total rewards earned in an episode is given by R = r1+ r2 + ... + rt, where t is the length of the episode (which can be finite or infinite).

The concept of discounting is used in RL, where a parameter called the discount factor is used, typically represented by γ and 0 ≤ γ ≤ 1 and a power of it multiplies rt. γ = 0, making the agent myopic, and aiming only for the immediate rewards. γ = 1 makes the agent far-sighted to the point that it procrastinates the accomplishment of the final goal. Thus, a value of γ in the 0-1 range (0 and 1 exclusive) is used to ensure that the agent is neither too myopic nor too far-sighted. γ ensures that the agent prioritizes its actions to maximize the total discounted rewards, Rt, from time instant t, which is given by the following: 

Since 0 ≤ γ ≤ 1, the rewards into the distant future are valued much less than the rewards that the agent can earn in the immediate future. This helps the agent to not waste time and to prioritize its actions. In practice, γ = 0.9-0.99 is typically used in most RL problems.

主站蜘蛛池模板: 北川| 罗城| 安远县| 吕梁市| 扶绥县| 巴里| 福建省| 八宿县| 宜兰市| 原平市| 九江市| 日照市| 麻城市| 武城县| 乌拉特前旗| 华坪县| 馆陶县| 平遥县| 郸城县| 巫溪县| 筠连县| 龙山县| 宁陕县| 曲阜市| 孝昌县| 阳泉市| 琼结县| 安庆市| 大港区| 简阳市| 宁阳县| 巴楚县| 驻马店市| 广德县| 定结县| 崇明县| 本溪市| 博湖县| 宿州市| 天津市| 渑池县|