官术网_书友最值得收藏!

Q-learning

Q-learning is an off-policy temporal-difference reinforcement learning algorithm. What a mouthful! But fear not, let’s not worry about what all this means, and instead just see how the algorithm works. To do this, we’ll use the game of chess we introduced in the previous section. As a reminder, the board configuration (the locations of the pieces) is the current state of the environment. Here, the agents can take actions, a, by moving pieces, thus changing the state into a new one. We'll represent a game of chess as a graph where the different board configurations are the graph’s vertices, and the possible moves from each configuration are the edges. To make a move, the agent follows the edge from the current state, s, to a new state, s'. The basic Q-learning algorithm uses Q-table to help the agent decide which moves to make. The Q-table contains one row for each board configuration, while the columns of the table are all possible actions that the agent can take (the moves). A table cell, q(s, a), contains the cumulative expected reward, called Q-value. This is the potential total reward that the agent will receive for the remainder of the game if they take an action, a, from their current state, s. At the beginning, the Q-table is initialized with an arbitrary value. With that knowledge, let’s see how Q-learning works:

Initialize the Q table with some arbitrary value
for each episode:
Observe the initial state s
for each step of the episode:
Select new action a using a policy based on the Q-table
Observe reward r and go to the new state s’
Update q(s, a) in the Q table using the Bellman Equation
until we reach a terminal state for the episode

An episode starts with a random initial state and finishes when we reach the terminal state. In our case, one episode would be one full game of chess.

The question that arises is this: how does the agent's policy determine what will be the next action? To do so, the policy has to take into account the Q-values of all the possible actions from the current state. The higher the Q-value, the more attractive the action is. However, the policy will sometimes ignore the Q-table (exploitation of the existing knowledge) and choose another random action to find higher potential rewards (exploration). In the beginning, the agent will take random actions because the Q-table doesn’t contain much information. As time progresses and the Q-table is gradually filled, the agent will become more informed in interacting with the environment.

We update q(s, a) after each new action, by using Bellman equation. The Bellman equation is beyond the scope of this introduction, but we’ll discuss it in detail in the later chapters. For now, it's enough to know that the updated value, q(s, a), is based on the newly-received reward, r , as well as the maximum possible Q-value, q*(s’, a’)of the new state, s'.

This example was intended to help you understand the basic workings of Q-learning, but you might have noticed an issue with this. We store the combination of all possible board configurations and moves in the Q-table. This would make the table huge and impossible to fit in today’s computer memory. Fortunately, there is a solution for this: we can replace the Q-table with a neural network, which will tell the agent what the optimal action is in each state. In recent years, this development has allowed reinforcement learning algorithms to achieve superhuman performance on tasks such as the game of Go, Dota 2, and Doom. In this book, we’ll discuss how to apply Q-learning and other RL algorithms to some of these tasks.

主站蜘蛛池模板: 舟山市| 揭阳市| 张家口市| 吉林省| 渑池县| 紫金县| 合山市| 贵德县| 承德市| 邳州市| 华蓥市| 鹤庆县| 离岛区| 盐边县| 遵义市| 潜江市| 内黄县| 长泰县| 塘沽区| 米泉市| 文昌市| 宜黄县| 镶黄旗| 新竹县| 监利县| 交口县| 青海省| 泸定县| 安国市| 巴彦县| 黄大仙区| 滕州市| 泸水县| 龙江县| 焦作市| 金川县| 米泉市| 密云县| 潢川县| 白水县| 乌兰浩特市|