官术网_书友最值得收藏!

Actions and rewards

An action is any decision that we make from a state. The state that we are in determines the actions we can take. If we are in a maze and to the right of a wall, we can't turn left, but in other locations, we can turn left. Turning left may or may not be in the list of possible actions that we can take in any particular state. 

A reward is the outcome we receive for making a decision in an environment. Our Q-learning agent will keep track of the rewards it receives and will try to maximize the future rewards that it expects to receive with each action it takes. 

The reward function for a driving simulator can be something straightforward, such as the following:

  • +1 for moving one block
  • +10 for making the correct move that will get you closer to the destination
  • +100 for reaching the destination
  • -20 for violating traffic laws
  • -50 for hitting another vehicle

Every action we take leads to a reward, and each reward is noted by the learning agent as it explores its environment and learns what the best actions are to take in each state. For a Q-learning agent, these rewards are stored in a Q-table, which is a simple lookup table mapping states to actions. We will be creating a Q-table as part of our first project, which will be the OpenAI Gym Taxi-v2 environment shown here. The following ASCII screenshot shows a representation of the environment:

The Taxi-v2 environment simulates a taxicab driving around a small grid, picking up passengers and dropping them off at the correct locations. Retrieving the action space and state space from our taxi environment lets us know how many discrete actions and states we have:

The following is a representation of a Q-table for Taxi-v2. Note that it lists 500 states and 6 actions (South, North, East, West, Pickup, and Dropoff):

When the Q-table is initialized, each state-action pair has a value of zero, because the agent has not seen any rewards yet and has not set a value for any action. Once it explores its environment, it starts to fill in values for each state-action pair and to use those values to decide what actions to take the next time it is in those states. 

If we, as the agent, are in a particular state, such as state 300, and have 6 possible actions to choose from (as in the taxi example), depending on how much exploration we have done and how many iterations we have gone through, each action will have a different value. Let's say that East has a value of 2, Pickup has a value of 9, and all other actions have a value of 0.

Pickup is, therefore, the highest-valued action, and when we plug it into our argmax function, we have a high probability of choosing Pickup as our next action, depending on our hyperparameter values. Given that Pickup is highly valued at this point above all other actions, it is very likely that it is the correct action to take. 

Depending on our agent's policy (that is, the function it uses to choose actions based on states), however, it may or may not actually choose Pickup. If it is using an epsilon-greedy strategy, for example, it might choose a random action instead, which could turn out to be completely wrong.

This is important to bear in mind as we choose a decision-making strategy. We do not always want to choose the current highest-valued action, as there may be other higher-valued actions that we haven't discovered yet. This process is called exploration, and we'll discuss several methodologies for using it to find optimal reward paths. 

主站蜘蛛池模板: 公安县| 随州市| 志丹县| 澄迈县| 寿宁县| 石阡县| 外汇| 即墨市| 三河市| 文成县| 蒙自县| 台中市| 溧水县| 黑水县| 镇坪县| 贵德县| 七台河市| 松滋市| 马尔康县| 腾冲县| 东兰县| 于田县| 大荔县| 澜沧| 额尔古纳市| 商南县| 普安县| 华容县| 武汉市| 济南市| 江都市| 通化县| 顺义区| 会理县| 仙游县| 新丰县| 玉田县| 长顺县| 阳信县| 尚志市| 广平县|