官术网_书友最值得收藏!

Policy iteration

Unlike value iteration, in policy iteration we start with the random policy, then we find the value function of that policy; if the value function is not optimal then we find the new improved policy. We repeat this process until we find the optimal policy.

There are two steps in policy iteration:

  1. Policy evaluation: Evaluating the value function of a randomly estimated policy.
  2. Policy improvement: Upon evaluating the value function, if it is not optimal, we find a new improved policy:

The steps involved in the policy iteration are as follows:

  1. First, we initialize some random policy
  2. Then we find the value function for that random policy and evaluate to check if it is optimal which is called policy evaluation
  3. If it is not optimal, we find a new improved policy, which is called policy improvement
  4. We repeat these steps until we find an optimal policy

Let us understand intuitively by performing policy iteration manually step by step.

Consider the same grid example we saw in the section Value iteration. Our goal is to find the optimal policy:

  1. Initialize a random policy function.

Let us initialize a random policy function by specifying random actions to each state:

say A -> 0

       B -> 1

       C -> 0

  1. Find the value function for the randomly initialized policy.

Now we have to find the value function using our randomly initialized policy. Let us say our value function after computation looks like the following:

 

Now that we have a new value function according to our randomly initialized policy, let us compute a new policy using our new value function. How do we do this? It is very similar to what we did in Value iteration. We calculate Q value for our new value function and then take actions for each state which has a maximum value as the new policy.

Let us say the new policy results in:

A - > 0

B - > 1

C -> 1

We check our old policy, that is, the randomly initialized policy, and the new policy. If they are same, then we have attained the convergence, that is, found the optimal policy. If not, we will update our old policy (random policy) as a new policy and repeat from step 2.

Sound confusing? Look at the pseudo code:

policy_iteration():
Initialize random policy
for i in no_of_iterations:
Q_value = value_function(random_policy)
new_policy = Maximum state action pair from Q value
if random_policy == new policy:
break
random_policy = new_policy
return policy
主站蜘蛛池模板: 淮南市| 通江县| 西安市| 中牟县| 惠来县| 滁州市| 女性| 墨竹工卡县| 牙克石市| 六盘水市| 水城县| 双峰县| 眉山市| 扎赉特旗| 扬州市| 瑞安市| 乌恰县| 南江县| 肥西县| 台江县| 莎车县| 贵溪市| 双辽市| 万荣县| 巴塘县| 五台县| 菏泽市| 梁河县| 祁连县| 顺义区| 呼伦贝尔市| 隆德县| 安龙县| 乌拉特中旗| 沾化县| 正定县| 高淳县| 九龙县| 江津市| 陆丰市| 南溪县|