官术网_书友最值得收藏!

The Q-learning approach to reinforcement learning

Q-learning is an attempt to learn the value Q(s,a) of a specific action given to the agent in a particular state. Consider a table where the number of rows represent the number of states, and the number of columns represent the number of actions. This is called a Q-table. Thus, we have to learn the value to find which action is the best for the agent in a given state.

Steps involved in Q-learning:

  1. Initialize the table of Q(s,a) with uniform values (say, all zeros).

  2. Observe the current state, s

  3. Choose an action, a, by epsilon greedy or any other action selection policies, and take the action

  4. As a result, a reward, r, is received and a new state, s', is perceived

  5. Update the Q value of the (s,a) pair in the table by using the following Bellman equation:

, where is the discounting factor
  1. Then, set the value of current state as a new state and repeat the process to complete one episode, that is, reaches the terminal state

  2. Run multiple episodes to train the agent

To simplify, we can say that the Q-value for a given state, s, and action, a, is updated by the sum of current reward, r, and the discounted () maximum Q value for the new state among all its actions. The discount factor delays the reward from the future compared to the present rewards. For example, a reward of 100 today will be worth more than 100 in the future. Similarly, a reward of 100 in the future must be worth less than 100 today. Therefore, we will discount the future rewards. Repeating this update process continuously results in Q-table values converging to accurate measures of the expected future reward for a given action in a given state.

When the volume of the state and action spaces increase, maintaining a Q-table is difficult. In the real world, the state spaces are infinitely large. Thus, there's a requirement of another approach that can produce Q(s,a) without a Q-table. One solution is to replace the Q-table with a function. This function will take the state as the input in the form of a vector, and output the vector of Q-values for all the actions in the given state. This function approximator can be represented by a neural network to predict the Q-values. Thus, we can add more layers and fit in a deep neural network for better prediction of Q-values when the state and action space becomes large, which seemed impossible with a Q-table. This gives rise to the Q-network and if a deeper neural network, such as a convolutional neural network, is used then it results in a deep Q-network (DQN).

More details on Q-learning and deep Q-networks will be covered in Chapter 5, Q-Learning and Deep Q-Networks.

主站蜘蛛池模板: 保康县| 巴林左旗| 抚松县| 常州市| 宝兴县| 沙湾县| 绥芬河市| 绵竹市| 岱山县| 社旗县| 敦煌市| 象山县| 浦东新区| 洞口县| 松阳县| 莱芜市| 赤城县| 阳曲县| 靖安县| 焦作市| 治多县| 湟中县| 思南县| 保靖县| 五家渠市| 武安市| 沙湾县| 平顶山市| 开鲁县| 肥乡县| 惠来县| 专栏| 安徽省| 犍为县| 印江| 页游| 巴林左旗| 筠连县| 瓦房店市| 平远县| 儋州市|