官术网_书友最值得收藏!

Epsilon – exploration versus exploitation

Your agent's exploration rate epsilon also ranges from zero to one. As the agent explores its environment, it learns that some actions are better to take than others, but what about states and actions that it hasn't seen yet? We don't want it to get stuck on a local maximum, taking the same currently highest-valued actions over and over when there might be better actions it hasn't tried to take yet.

When you set your epsilon value, there will be a probability equal to epsilon that your agent will take a random (exploratory) action, and a probability equal to 1-epsilon that it will take the current highest Q-valued action for its current state. As we step through a full Q-table update example in the SARSA and the cliff-walking problem section, we'll see how the value that we choose for epsilon affects the rate at which the Q-table converges and the agent discovers the optimal solution. 

As the agent gets more and more familiar with its environment, we want it to start sticking to the high-valued actions it's already discovered and do less exploration of the states it hasn't seen. We achieve this by having epsilon decay over time as the agent learns more about its environment and the Q-table converges on its final optimal values.

There are many different ways to decay epsilon, either by using a constant decay factor or basing the decay factor on some other internal variable. Ideally, we want the epsilon decay function to be directly based on the Q-values that we've already discovered. We'll discuss what this means in the next section.

主站蜘蛛池模板: 泸西县| 墨竹工卡县| 阳朔县| 涞源县| 江北区| 灌南县| 丰城市| 枣强县| 兴仁县| 清远市| 赤城县| 莒南县| 怀柔区| 清流县| 西和县| 新巴尔虎左旗| 河西区| 扎鲁特旗| 丹寨县| 南皮县| 临高县| 公主岭市| 成都市| 韶关市| 南部县| 阿克苏市| 辽源市| 翼城县| 会昌县| 固始县| 穆棱市| 马边| 商洛市| 宣威市| 河南省| 曲周县| 宝丰县| 泗阳县| 景宁| 揭西县| 乌拉特中旗|