官术网_书友最值得收藏!

Decaying epsilon

The more familiar your agent becomes with its environment, the less exploration we want it to do. As it discovers more and more rewards, the odds that it will discover actions with higher reward values than the ones it has already discovered begin to decrease. It should start to increasingly stick with actions it knows are highly-valued and do less and less exploration.

This concept is called exploration versus exploitation. Exploration refers to discovering new states that may be higher-valued than the ones our agent has already seen, and exploitation means visiting the highest-valued states it has seen to benefit from the rewards it already knows it will collect there. 

One popular illustration of this problem is the multi-armed bandit. A one-armed bandit is a slot machine, and an n-armed bandit is a hypothetical slot machine with n arms, each of which has a rigged probability that it will pay out a fixed percentage of the time.

We have a limited amount of money to put into this slot machine. Each arm will either give us a reward or not when we pull it, and each arm has a different probability of giving us a payout on each pull. 

We want to maximize our total rewards for the money that we put in. So, which arm should we pull on our next try? Take a look at the following diagram illustrating the state of the multi-armed bandit:

When we first begin to solve this task, we don't know what the true probability is that we'll receive a reward from any of the arms. The only way for us to learn what will happen is to pull the arms and see.

As the agent in this situation, we are learning what our environment is like as we go. When we don't know anything about the arms and their payouts, we want to pull all of the arms until we get a good idea of how often they will pay out. If we learn that one arm pays out more often than others, we want to start pulling that arm more often.

The only way we can have a clear idea of what the true payout probabilities are is by sampling the arms enough, but once we do have a clear signal for those probabilities we should follow it to maximize our total payout. So, how can we devise a strategy to do this?

We will explore solutions to the multi-armed bandit problem in later chapters, but it's important to be able to put it in the context of epsilon decay. This is because epsilon is the value that tells us how often we should explore new actions as opposed to how often we should exploit the knowledge we already have of the actions we've taken; therefore, it's important that the epsilon changes as we progress through an environment.

In the problems that we'll be working on, we'll see that we should be exploring less and exploiting more as we gain more knowledge of our environment, so epsilon should decrease as we progress. We'll discuss well-known strategies for decaying epsilon, or making it decrease, based on the parameters within the problem that we've set. 

主站蜘蛛池模板: 泽库县| 定南县| 呼玛县| 苍南县| 博野县| 高要市| 广西| 新蔡县| 古浪县| 双城市| 嘉兴市| 珠海市| 福贡县| 庄浪县| 班戈县| 惠水县| 历史| 汉川市| 临猗县| 交城县| 日土县| 庆阳市| 科技| 新干县| 东阳市| 旌德县| 垦利县| 肥城市| 兴和县| 泾源县| 惠州市| 兴义市| 犍为县| 枣庄市| 胶南市| 邛崃市| 邛崃市| 定兴县| 云浮市| 饶阳县| 格尔木市|