書名： Hands-On Q-Learning with Python
作者名： Nazia Habib
本章字?jǐn)?shù)： 538字
更新時間： 2021-06-24 15:13:12

Decaying epsilon

The more familiar your agent becomes with its environment, the less exploration we want it to do. As it discovers more and more rewards, the odds that it will discover actions with higher reward values than the ones it has already discovered begin to decrease. It should start to increasingly stick with actions it knows are highly-valued and do less and less exploration.

This concept is called exploration versus exploitation. Exploration refers to discovering new states that may be higher-valued than the ones our agent has already seen, and exploitation means visiting the highest-valued states it has seen to benefit from the rewards it already knows it will collect there.

One popular illustration of this problem is the multi-armed bandit. A one-armed bandit is a slot machine, and an n-armed bandit is a hypothetical slot machine with n arms, each of which has a rigged probability that it will pay out a fixed percentage of the time.

We have a limited amount of money to put into this slot machine. Each arm will either give us a reward or not when we pull it, and each arm has a different probability of giving us a payout on each pull.

We want to maximize our total rewards for the money that we put in. So, which arm should we pull on our next try? Take a look at the following diagram illustrating the state of the multi-armed bandit:

When we first begin to solve this task, we don't know what the true probability is that we'll receive a reward from any of the arms. The only way for us to learn what will happen is to pull the arms and see.

As the agent in this situation, we are learning what our environment is like as we go. When we don't know anything about the arms and their payouts, we want to pull all of the arms until we get a good idea of how often they will pay out. If we learn that one arm pays out more often than others, we want to start pulling that arm more often.

The only way we can have a clear idea of what the true payout probabilities are is by sampling the arms enough, but once we do have a clear signal for those probabilities we should follow it to maximize our total payout. So, how can we devise a strategy to do this?

We will explore solutions to the multi-armed bandit problem in later chapters, but it's important to be able to put it in the context of epsilon decay. This is because epsilon is the value that tells us how often we should explore new actions as opposed to how often we should exploit the knowledge we already have of the actions we've taken; therefore, it's important that the epsilon changes as we progress through an environment.

In the problems that we'll be working on, we'll see that we should be exploring less and exploiting more as we gain more knowledge of our environment, so epsilon should decrease as we progress. We'll discuss well-known strategies for decaying epsilon, or making it decrease, based on the parameters within the problem that we've set.

官术网_书友最值得收藏!

Hands-On Q-Learning with Python

Decaying epsilon