- Hands-On Q-Learning with Python
- Nazia Habib
- 478字
- 2021-06-24 15:13:18
Bandit optimization strategies
One strategy for deciding which arm to pull at this point is a simple greedy strategy; that is, pull the arm that currently looks best and don't worry about any of the other arms. This is, of course, a naive approach, because although arm 4 might currently look like the best arm to pull, we don't know yet whether any of the other arms will be better than 4.
An even worse outcome might be that since 2 was the first arm that we pulled that gave us a reward, we might decide to keep pulling it forever and never discover a potentially better arm such as 4. In optimization problems, we call this getting stuck on a local maximum, as illustrated in the following diagram.
When we are stuck on a local maximum, all of the reward values we see near us are less than our current reward value, so from our limited information state we think that we have reached an optimal path and should stay there and continue to collect rewards. From where we are, we can't see the global maximum that could give us more rewards if we were to seek it out and locate it:

Getting stuck on a local maximum happens when an agent continually takes stock of its available actions and chooses the same one over and over because it happens to be the current highest value. Because it cannot reach any states from where it is, this will give it enough information to increase the value of the other actions – note that it can never discover any values higher than the current local maximum if it does not branch out and take other actions.
There are other, better, strategies than the greedy strategy that we will use, starting with the epsilon-greedy strategy, which we will discuss in depth at the beginning of the Chapter 7, Decoupling Exploration and Exploitation in Multi-Armed Bandits, on solving bandit problems. Let's cover it briefly here.
Remember that epsilon is the probability that we will take an action other than the current highest Q-valued action. Using the epsilon-greedy strategy, when epsilon = 0.1, an agent will choose the current highest Q-valued machine with the probability of 0.9 (1 - epsilon) and any other machine with the probability of 0.1.
We can set the epsilon to any value we choose, and we can choose to decay it over time as we conduct repeated trials. As we'll see in our example implementations, epsilon-greedy strategies suffer when the agent gets stuck exploring a suboptimal path before it gets a chance to explore all paths and find the best one. Our goal will be to find a strategy that efficiently gets us to an optimal explore-exploit balance and brings us as quickly as possible to the highest-yielding action gradients (that is exploitation paths).