官术网_书友最值得收藏!

Bandit optimization strategies

One strategy for deciding which arm to pull at this point is a simple greedy strategy; that is, pull the arm that currently looks best and don't worry about any of the other arms. This is, of course, a naive approach, because although arm 4 might currently look like the best arm to pull, we don't know yet whether any of the other arms will be better than 4.

An even worse outcome might be that since 2 was the first arm that we pulled that gave us a reward, we might decide to keep pulling it forever and never discover a potentially better arm such as 4. In optimization problems, we call this getting stuck on a local maximum, as illustrated in the following diagram.

When we are stuck on a local maximum, all of the reward values we see near us are less than our current reward value, so from our limited information state we think that we have reached an optimal path and should stay there and continue to collect rewards. From where we are, we can't see the global maximum that could give us more rewards if we were to seek it out and locate it:

Getting stuck on a local maximum happens when an agent continually takes stock of its available actions and chooses the same one over and over because it happens to be the current highest value. Because it cannot reach any states from where it is, this will give it enough information to increase the value of the other actions – note that it can never discover any values higher than the current local maximum if it does not branch out and take other actions.

There are other, better, strategies than the greedy strategy that we will use, starting with the epsilon-greedy strategy, which we will discuss in depth at the beginning of the Chapter 7Decoupling Exploration and Exploitation in Multi-Armed Bandits, on solving bandit problems. Let's cover it briefly here.

Remember that epsilon is the probability that we will take an action other than the current highest Q-valued action. Using the epsilon-greedy strategy, when epsilon = 0.1, an agent will choose the current highest Q-valued machine with the probability of 0.9 (1 - epsilon) and any other machine with the probability of 0.1. 

We can set the epsilon to any value we choose, and we can choose to decay it over time as we conduct repeated trials. As we'll see in our example implementations, epsilon-greedy strategies suffer when the agent gets stuck exploring a suboptimal path before it gets a chance to explore all paths and find the best one. Our goal will be to find a strategy that efficiently gets us to an optimal explore-exploit balance and brings us as quickly as possible to the highest-yielding action gradients (that is exploitation paths).

主站蜘蛛池模板: 临朐县| 博野县| 沙坪坝区| 时尚| 谢通门县| 远安县| 阿尔山市| 托克逊县| 济阳县| 南和县| 绥江县| 阿拉善盟| 马关县| 凤阳县| 怀来县| 沙洋县| 山东| 宁德市| 正宁县| 遵化市| 苏州市| 来凤县| 临夏县| 瑞昌市| 马龙县| 镇原县| 中卫市| 黄龙县| 黄大仙区| 霍城县| 泽普县| 雅江县| 建始县| 合肥市| 鄂托克旗| 吴旗县| 扶沟县| 克什克腾旗| 德保县| 璧山县| 翼城县|