官术网_书友最值得收藏!

Q-learning for FrozenLake

The whole example is in the Chapter05/02_frozenlake_q_learning.py file, and the difference is really minor. The most obvious change is to our value table. In the previous example, we kept the value of the state, so the key in the dictionary was just a state. Now we need to store values of the Q-function, which has two parameters: state and action, so the key in the value table is now a composite.

The second difference is in our calc_action_value function. We just don't need it anymore, as our action values are stored in the value table. Finally, the most important change in the code is in the agent's value_iteration method. Before, it was just a wrapper around the calc_action_value call, which did the job of Bellman approximation. Now, as this function has gone and was replaced by a value table, we need to do this approximation in the value_iteration method.

Let's look at the code. As it's almost the same, I'll jump directly to the most interesting value_iteration function:

    def value_iteration(self):
        for state in range(self.env.observation_space.n):
            for action in range(self.env.action_space.n):
                action_value = 0.0
                target_counts = self.transits[(state, action)]
                total = sum(target_counts.values())
                for tgt_state, count in target_counts.items():
                    reward = self.rewards[(state, action, tgt_state)]
                    best_action = self.select_action(tgt_state)
                    action_value += (count / total) * (reward + GAMMA * self.values[(tgt_state, best_action)])
                self.values[(state, action)] = action_value

The code is very similar to calc_action_value in the previous example and in fact it does almost the same thing. For the given state and action, it needs to calculate the value of this action using statistics about target states that we've reached with the action. To calculate this value, we use the Bellman equation and our counters, which allow us to approximate the probability of the target state. However, in Bellman's equation we have the value of the state and now we need to calculate it differently. Before, we had it stored in the value table (as we approximated the value of states), so we just took it from this table. We can't do this anymore, so we have to call the select_action method, which will choose for us the action with the largest Q-value, and then we take this Q-value as the value of the target state. Of course, we can implement another function which could calculate for us this value of state, but select_action does almost everything we need, so we will reuse it here.

There is another piece of this example that I'd like to emphasize here. Let's look at our select_action method:

    def select_action(self, state):
        best_action, best_value = None, None
        for action in range(self.env.action_space.n):
            action_value = self.values[(state, action)]
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        return best_action

As I said, we don't have the calc_action_value method anymore, so, to select action, we just iterate over the actions and look up their values in our values table. It could look like a minor improvement, but if you think about the data that we used in calc_action_value, it may become obvious why the learning of the Q-function is much more popular in RL than the learning of the V-function.

Our calc_action_value function uses both information about reward and probabilities. It's not a huge problem for the value iteration method, which relies on this information during training. However, in the next chapter, we'll learn about the value iteration method extension, which doesn't require probability approximation, but just takes it from the environment samples. For such methods, this dependency on probability adds an extra burden for the agent. In the case of Q-learning, what the agent needs to make the decision is just Q-values.

I don't want to say that V-functions are completely useless, because they are an essential part of Actor-Critic method which we'll talk about in part three of this book. However, in the area of value learning, Q-functions is the definite favorite. With regard to convergence speed, both our versions are almost identical (but the Q-learning version requires four times more memory for the value table).

rl_book_samples/Chapter05$ ./02_frozenlake_q_learning.py
[2017-10-13 12:38:56,658] Making new env: FrozenLake-v0
[2017-10-13 12:38:56,863] Making new env: FrozenLake-v0
Best reward updated 0.000 -> 0.050
Best reward updated 0.050 -> 0.200
Best reward updated 0.200 -> 0.350
Best reward updated 0.350 -> 0.700
Best reward updated 0.700 -> 0.750
Best reward updated 0.750 -> 0.850
Solved in 22 iterations!
主站蜘蛛池模板: 二连浩特市| 祁连县| 高陵县| 确山县| 平南县| 张家口市| 二手房| 凤翔县| 大姚县| 秭归县| 商洛市| 天祝| 田林县| 曲松县| 保靖县| 咸宁市| 富川| 金昌市| 海口市| 东光县| 左权县| 通辽市| 融水| 小金县| 菏泽市| 吴江市| 塘沽区| 镇雄县| 普宁市| 彩票| 娄烦县| 宝应县| 谢通门县| 会宁县| 枣阳市| 策勒县| 衢州市| 柳河县| 江永县| 朝阳区| 勐海县|