官术网_书友最值得收藏!

  • Python Reinforcement Learning
  • Sudharsan Ravichandiran Sean Saito Rajalingappaa Shanmugamani Yang Wenzhuo
  • 637字
  • 2021-06-24 15:17:34

Policy iteration

In policy iteration, first we initialize a random policy. Then we will evaluate the random policies we initialized: are they good or not? But how can we evaluate the policies? We will evaluate our randomly initialized policies by computing value functions for them. If they are not good, then we find a new policy. We repeat this process until we find a good policy.

Now let us see how to solve the frozen lake problem using policy iteration.

Before looking at policy iteration, we will see how to compute a value function, given a policy. 

We initialize value_table as zero with the number of states:

value_table = np.zeros(env.nS)

Then, for each state, we get the action from the policy, and we compute the value function according to that action and state as follows:

        updated_value_table = np.copy(value_table)
for state in range(env.nS):
action = policy[state]
value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state])
for trans_prob, next_state, reward_prob, _ in env.P[state][action]])

We break this when the difference between value_table and updated_value_table is less than our threshold:

threshold = 1e-10
if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
break

Look at the following complete function:

def compute_value_function(policy, gamma=1.0):
value_table = np.zeros(env.nS)
threshold = 1e-10
while True:
updated_value_table = np.copy(value_table)
for state in range(env.nS):
action = policy[state]
value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state])
for trans_prob, next_state, reward_prob, _ in env.P[state][action]])
if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
break
return value_table

Now we will see how to perform policy iteration, step by step. 

First, we initialize random_policy as zero NumPy array with shape as number of states:

 random_policy = np.zeros(env.observation_space.n)

Then, for each iteration, we calculate the new_value_function according to our random policy:

new_value_function = compute_value_function(random_policy, gamma)

We will extract the policy using the calculated new_value_function. The extract_policy function is the same as the one we used in value iteration:

 new_policy = extract_policy(new_value_function, gamma)

Then we check whether we have reached convergence, that is, whether we found the optimal policy by comparing random_policy and the new policy. If they are the same, we will break the iteration; otherwise we update random_policy with new_policy:

if (np.all(random_policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
random_policy = new_policy

Look at the complete policy_iteration function:

def policy_iteration(env,gamma = 1.0):
random_policy = np.zeros(env.observation_space.n)
no_of_iterations = 200000
gamma = 1.0
for i in range(no_of_iterations):
new_value_function = compute_value_function(random_policy, gamma)
new_policy = extract_policy(new_value_function, gamma)
if (np.all(random_policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
random_policy = new_policy
return new_policy

Thus, we can get optimal_policy using policy_iteration

optimal_policy = policy_iteration(env, gamma = 1.0)

We will get some output, which is the optimal_policy, the actions to be performed in each state:

array([0., 3., 3., 3., 0., 0., 0., 0., 3., 1., 0., 0., 0., 2., 1., 0.])

The complete program is given as follows:

import gym
import numpy as np

env = gym.make('FrozenLake-v0')

def compute_value_function(policy, gamma=1.0):
value_table = np.zeros(env.nS)
threshold = 1e-10
while True:
updated_value_table = np.copy(value_table)
for state in range(env.nS):
action = policy[state]
value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state])
for trans_prob, next_state, reward_prob, _ in env.P[state][action]])
if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
break
return value_table


def extract_policy(value_table, gamma = 1.0):
policy = np.zeros(env.observation_space.n)
for state in range(env.observation_space.n):
Q_table = np.zeros(env.action_space.n)
for action in range(env.action_space.n):
for next_sr in env.P[state][action]:
trans_prob, next_state, reward_prob, _ = next_sr
Q_table[action] += (trans_prob * (reward_prob + gamma * value_table[next_state]))
policy[state] = np.argmax(Q_table)

return policy

def policy_iteration(env,gamma = 1.0):
random_policy = np.zeros(env.observation_space.n)
no_of_iterations = 200000
gamma = 1.0
for i in range(no_of_iterations):
new_value_function = compute_value_function(random_policy, gamma)
new_policy = extract_policy(new_value_function, gamma)
if (np.all(random_policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
random_policy = new_policy
return new_policy


print (policy_iteration(env))

Thus, we can derive the optimal policy, which specifies what action to perform in each state, using value and policy iteration to solve the frozen lake problem.

主站蜘蛛池模板: 安化县| 瓦房店市| 邯郸市| 厦门市| 偃师市| 拜泉县| 白城市| 凉城县| 礼泉县| 杭州市| 股票| 大竹县| 阳朔县| 科尔| 洞头县| 伊宁县| 长岛县| 凯里市| 台江县| 泊头市| 措美县| 娄烦县| 新泰市| 博白县| 公安县| 雅江县| 清涧县| 岚皋县| 潼关县| 囊谦县| 香港| 和政县| 呼伦贝尔市| 社旗县| 通海县| 桑日县| 昆明市| 宜阳县| 宁都县| 朝阳县| 社会|