官术网_书友最值得收藏!

The Q-learning approach to reinforcement learning

Q-learning is an attempt to learn the value Q(s,a) of a specific action given to the agent in a particular state. Consider a table where the number of rows represent the number of states, and the number of columns represent the number of actions. This is called a Q-table. Thus, we have to learn the value to find which action is the best for the agent in a given state.

Steps involved in Q-learning:

  1. Initialize the table of Q(s,a) with uniform values (say, all zeros).

  2. Observe the current state, s

  3. Choose an action, a, by epsilon greedy or any other action selection policies, and take the action

  4. As a result, a reward, r, is received and a new state, s', is perceived

  5. Update the Q value of the (s,a) pair in the table by using the following Bellman equation:

, where is the discounting factor
  1. Then, set the value of current state as a new state and repeat the process to complete one episode, that is, reaches the terminal state

  2. Run multiple episodes to train the agent

To simplify, we can say that the Q-value for a given state, s, and action, a, is updated by the sum of current reward, r, and the discounted () maximum Q value for the new state among all its actions. The discount factor delays the reward from the future compared to the present rewards. For example, a reward of 100 today will be worth more than 100 in the future. Similarly, a reward of 100 in the future must be worth less than 100 today. Therefore, we will discount the future rewards. Repeating this update process continuously results in Q-table values converging to accurate measures of the expected future reward for a given action in a given state.

When the volume of the state and action spaces increase, maintaining a Q-table is difficult. In the real world, the state spaces are infinitely large. Thus, there's a requirement of another approach that can produce Q(s,a) without a Q-table. One solution is to replace the Q-table with a function. This function will take the state as the input in the form of a vector, and output the vector of Q-values for all the actions in the given state. This function approximator can be represented by a neural network to predict the Q-values. Thus, we can add more layers and fit in a deep neural network for better prediction of Q-values when the state and action space becomes large, which seemed impossible with a Q-table. This gives rise to the Q-network and if a deeper neural network, such as a convolutional neural network, is used then it results in a deep Q-network (DQN).

More details on Q-learning and deep Q-networks will be covered in Chapter 5, Q-Learning and Deep Q-Networks.

主站蜘蛛池模板: 仙居县| 措勤县| 孟州市| 治多县| 余庆县| 新晃| 呼和浩特市| 阿合奇县| 济宁市| 赣榆县| 大姚县| 江阴市| 南川市| 博客| 万安县| 临朐县| 凤冈县| 揭东县| 海口市| 鄂伦春自治旗| 武清区| 长武县| 肃宁县| 鄂尔多斯市| 平定县| 志丹县| 沾益县| 南丹县| 冷水江市| 玉溪市| 福清市| 昌图县| 江安县| 龙口市| 涡阳县| 霍城县| 申扎县| 凤山市| 德保县| 遂川县| 甘肃省|