官术网_书友最值得收藏!

  • Python Reinforcement Learning
  • Sudharsan Ravichandiran Sean Saito Rajalingappaa Shanmugamani Yang Wenzhuo
  • 425字
  • 2021-06-24 15:17:34

Policy iteration

Unlike value iteration, in policy iteration we start with the random policy, then we find the value function of that policy; if the value function is not optimal then we find the new improved policy. We repeat this process until we find the optimal policy.

There are two steps in policy iteration:

  1. Policy evaluation: Evaluating the value function of a randomly estimated policy.
  2. Policy improvement: Upon evaluating the value function, if it is not optimal, we find a new improved policy:

The steps involved in the policy iteration are as follows:

  1. First, we initialize some random policy
  2. Then we find the value function for that random policy and evaluate to check if it is optimal which is called policy evaluation
  3. If it is not optimal, we find a new improved policy, which is called policy improvement
  4. We repeat these steps until we find an optimal policy

Let us understand intuitively by performing policy iteration manually step by step.

Consider the same grid example we saw in the section Value iteration. Our goal is to find the optimal policy:

  1. Initialize a random policy function.

Let us initialize a random policy function by specifying random actions to each state:

say A -> 0

       B -> 1

       C -> 0

  1. Find the value function for the randomly initialized policy.

Now we have to find the value function using our randomly initialized policy. Let us say our value function after computation looks like the following:

 

Now that we have a new value function according to our randomly initialized policy, let us compute a new policy using our new value function. How do we do this? It is very similar to what we did in Value iteration. We calculate Q value for our new value function and then take actions for each state which has a maximum value as the new policy.

Let us say the new policy results in:

A - > 0

B - > 1

C -> 1

We check our old policy, that is, the randomly initialized policy, and the new policy. If they are same, then we have attained the convergence, that is, found the optimal policy. If not, we will update our old policy (random policy) as a new policy and repeat from step 2.

Sound confusing? Look at the pseudo code:

policy_iteration():
Initialize random policy
for i in no_of_iterations:
Q_value = value_function(random_policy)
new_policy = Maximum state action pair from Q value
if random_policy == new policy:
break
random_policy = new_policy
return policy
主站蜘蛛池模板: 克山县| 津市市| 肥西县| 柞水县| 综艺| 酒泉市| 平昌县| 醴陵市| 张家界市| 会理县| 措美县| 澄迈县| 鄄城县| 灵寿县| 罗山县| 额敏县| 图片| 资溪县| 德令哈市| 枣庄市| 同心县| 辽宁省| 阿勒泰市| 五莲县| 敖汉旗| 资阳市| 安西县| 读书| 禹城市| 宣恩县| 利津县| 襄汾县| 广东省| 界首市| 濮阳市| 乐山市| 伊吾县| 玛纳斯县| 安吉县| 马鞍山市| 那坡县|