官术网_书友最值得收藏!

Bandit optimization strategies

One strategy for deciding which arm to pull at this point is a simple greedy strategy; that is, pull the arm that currently looks best and don't worry about any of the other arms. This is, of course, a naive approach, because although arm 4 might currently look like the best arm to pull, we don't know yet whether any of the other arms will be better than 4.

An even worse outcome might be that since 2 was the first arm that we pulled that gave us a reward, we might decide to keep pulling it forever and never discover a potentially better arm such as 4. In optimization problems, we call this getting stuck on a local maximum, as illustrated in the following diagram.

When we are stuck on a local maximum, all of the reward values we see near us are less than our current reward value, so from our limited information state we think that we have reached an optimal path and should stay there and continue to collect rewards. From where we are, we can't see the global maximum that could give us more rewards if we were to seek it out and locate it:

Getting stuck on a local maximum happens when an agent continually takes stock of its available actions and chooses the same one over and over because it happens to be the current highest value. Because it cannot reach any states from where it is, this will give it enough information to increase the value of the other actions – note that it can never discover any values higher than the current local maximum if it does not branch out and take other actions.

There are other, better, strategies than the greedy strategy that we will use, starting with the epsilon-greedy strategy, which we will discuss in depth at the beginning of the Chapter 7Decoupling Exploration and Exploitation in Multi-Armed Bandits, on solving bandit problems. Let's cover it briefly here.

Remember that epsilon is the probability that we will take an action other than the current highest Q-valued action. Using the epsilon-greedy strategy, when epsilon = 0.1, an agent will choose the current highest Q-valued machine with the probability of 0.9 (1 - epsilon) and any other machine with the probability of 0.1. 

We can set the epsilon to any value we choose, and we can choose to decay it over time as we conduct repeated trials. As we'll see in our example implementations, epsilon-greedy strategies suffer when the agent gets stuck exploring a suboptimal path before it gets a chance to explore all paths and find the best one. Our goal will be to find a strategy that efficiently gets us to an optimal explore-exploit balance and brings us as quickly as possible to the highest-yielding action gradients (that is exploitation paths).

主站蜘蛛池模板: 青河县| 玉田县| 万盛区| 江永县| 西城区| 扎兰屯市| 铜梁县| 邢台市| 临武县| 文安县| 会昌县| 久治县| 桐城市| 沐川县| 称多县| 宜川县| 新宾| 左权县| 繁昌县| 蕉岭县| 武宁县| 沁阳市| 孝感市| 永川市| 衡阳县| 泸州市| 晋州市| 西贡区| 临潭县| 咸阳市| 任丘市| 武威市| 平武县| 东台市| 陵川县| 浏阳市| 丽水市| 寻甸| 三台县| 文水县| 太白县|