- Hands-On Q-Learning with Python
- Nazia Habib
- 538字
- 2021-06-24 15:13:12
Decaying epsilon
The more familiar your agent becomes with its environment, the less exploration we want it to do. As it discovers more and more rewards, the odds that it will discover actions with higher reward values than the ones it has already discovered begin to decrease. It should start to increasingly stick with actions it knows are highly-valued and do less and less exploration.
This concept is called exploration versus exploitation. Exploration refers to discovering new states that may be higher-valued than the ones our agent has already seen, and exploitation means visiting the highest-valued states it has seen to benefit from the rewards it already knows it will collect there.
One popular illustration of this problem is the multi-armed bandit. A one-armed bandit is a slot machine, and an n-armed bandit is a hypothetical slot machine with n arms, each of which has a rigged probability that it will pay out a fixed percentage of the time.
We have a limited amount of money to put into this slot machine. Each arm will either give us a reward or not when we pull it, and each arm has a different probability of giving us a payout on each pull.
We want to maximize our total rewards for the money that we put in. So, which arm should we pull on our next try? Take a look at the following diagram illustrating the state of the multi-armed bandit:

When we first begin to solve this task, we don't know what the true probability is that we'll receive a reward from any of the arms. The only way for us to learn what will happen is to pull the arms and see.
As the agent in this situation, we are learning what our environment is like as we go. When we don't know anything about the arms and their payouts, we want to pull all of the arms until we get a good idea of how often they will pay out. If we learn that one arm pays out more often than others, we want to start pulling that arm more often.
The only way we can have a clear idea of what the true payout probabilities are is by sampling the arms enough, but once we do have a clear signal for those probabilities we should follow it to maximize our total payout. So, how can we devise a strategy to do this?
We will explore solutions to the multi-armed bandit problem in later chapters, but it's important to be able to put it in the context of epsilon decay. This is because epsilon is the value that tells us how often we should explore new actions as opposed to how often we should exploit the knowledge we already have of the actions we've taken; therefore, it's important that the epsilon changes as we progress through an environment.
In the problems that we'll be working on, we'll see that we should be exploring less and exploiting more as we gain more knowledge of our environment, so epsilon should decrease as we progress. We'll discuss well-known strategies for decaying epsilon, or making it decrease, based on the parameters within the problem that we've set.
- ArchiCAD 19:The Definitive Guide
- 機器學習及應用(在線實驗+在線自測)
- Excel 2007函數與公式自學寶典
- 基于LPC3250的嵌入式Linux系統開發
- 計算機控制技術
- UTM(統一威脅管理)技術概論
- Windows 8應用開發實戰
- 離散事件系統建模與仿真
- 城市道路交通主動控制技術
- Python Data Science Essentials
- Windows環境下32位匯編語言程序設計
- ESP8266 Robotics Projects
- 計算機組成與操作系統
- Eclipse RCP應用系統開發方法與實戰
- Appcelerator Titanium Smartphone App Development Cookbook(Second Edition)