- TensorFlow Reinforcement Learning Quick Start Guide
- Kaushik Balakrishnan
- 225字
- 2021-06-24 15:29:06
Rewards
In RL literature, rewards at a time instant t are typically denoted as Rt. Thus, the total rewards earned in an episode is given by R = r1+ r2 + ... + rt, where t is the length of the episode (which can be finite or infinite).
The concept of discounting is used in RL, where a parameter called the discount factor is used, typically represented by γ and 0 ≤ γ ≤ 1 and a power of it multiplies rt. γ = 0, making the agent myopic, and aiming only for the immediate rewards. γ = 1 makes the agent far-sighted to the point that it procrastinates the accomplishment of the final goal. Thus, a value of γ in the 0-1 range (0 and 1 exclusive) is used to ensure that the agent is neither too myopic nor too far-sighted. γ ensures that the agent prioritizes its actions to maximize the total discounted rewards, Rt, from time instant t, which is given by the following:

Since 0 ≤ γ ≤ 1, the rewards into the distant future are valued much less than the rewards that the agent can earn in the immediate future. This helps the agent to not waste time and to prioritize its actions. In practice, γ = 0.9-0.99 is typically used in most RL problems.
- 基于LabWindows/CVI的虛擬儀器設(shè)計(jì)與應(yīng)用
- IoT Penetration Testing Cookbook
- Python Algorithmic Trading Cookbook
- Blender 3D Printing by Example
- 人工智能:語言智能處理
- JRuby語言實(shí)戰(zhàn)技術(shù)
- 電動(dòng)汽車驅(qū)動(dòng)與控制技術(shù)
- 工業(yè)機(jī)器人操作
- Raspberry Pi Projects for Kids
- Practical AWS Networking
- 數(shù)據(jù)清洗
- FreeCAD [How-to]
- Learning iOS 8 for Enterprise
- AWS Administration:The Definitive Guide(Second Edition)
- 單片機(jī)C語言編程實(shí)踐