- Hands-On Q-Learning with Python
- Nazia Habib
- 304字
- 2021-06-24 15:13:11
Gamma – current versus future rewards
Let's discuss the concept of current rewards versus future rewards. Your agent's discount rate gamma has a value between zero and one, and its function is to discount future rewards against immediate rewards.
Your agent is deciding what action to take based not only on the reward it expects to get for taking that action, but on the future rewards it might be able to get from the state it will be in after taking that action.
One easy way to illustrate discounting rewards is with the following example of a mouse in a maze collecting cheese as rewards and avoiding cats and traps (that is, electric shocks):

The rewards that are closest to the cats, even though their point values are higher (three versus one), should be discounted if we want to maximize how long the mouse agent lives and how much cheese it can collect. These rewards come with a higher risk of the mouse being killed, so we lower their value accordingly. In other words, collecting the closest cheese should be given a higher priority when the mouse decides what actions to take.
When we discount a future reward, we make it less valuable than an immediate reward (similar to how we take into account the time value of money when making a loan and treat a dollar received today as more valuable than a dollar received a year from now).
The value of gamma that we choose varies according to how highly we value future rewards:
- If we choose a value of zero for gamma, the agent will not care about future rewards at all and will only take current rewards into account
- Choosing a value of one for gamma will make the agent consider future rewards as highly as current rewards