- Python Reinforcement Learning
- Sudharsan Ravichandiran Sean Saito Rajalingappaa Shanmugamani Yang Wenzhuo
- 637字
- 2021-06-24 15:17:34
Policy iteration
In policy iteration, first we initialize a random policy. Then we will evaluate the random policies we initialized: are they good or not? But how can we evaluate the policies? We will evaluate our randomly initialized policies by computing value functions for them. If they are not good, then we find a new policy. We repeat this process until we find a good policy.
Now let us see how to solve the frozen lake problem using policy iteration.
Before looking at policy iteration, we will see how to compute a value function, given a policy.
We initialize value_table as zero with the number of states:
value_table = np.zeros(env.nS)
Then, for each state, we get the action from the policy, and we compute the value function according to that action and state as follows:
updated_value_table = np.copy(value_table)
for state in range(env.nS):
action = policy[state]
value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state])
for trans_prob, next_state, reward_prob, _ in env.P[state][action]])
We break this when the difference between value_table and updated_value_table is less than our threshold:
threshold = 1e-10
if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
break
Look at the following complete function:
def compute_value_function(policy, gamma=1.0):
value_table = np.zeros(env.nS)
threshold = 1e-10
while True:
updated_value_table = np.copy(value_table)
for state in range(env.nS):
action = policy[state]
value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state])
for trans_prob, next_state, reward_prob, _ in env.P[state][action]])
if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
break
return value_table
Now we will see how to perform policy iteration, step by step.
First, we initialize random_policy as zero NumPy array with shape as number of states:
random_policy = np.zeros(env.observation_space.n)
Then, for each iteration, we calculate the new_value_function according to our random policy:
new_value_function = compute_value_function(random_policy, gamma)
We will extract the policy using the calculated new_value_function. The extract_policy function is the same as the one we used in value iteration:
new_policy = extract_policy(new_value_function, gamma)
Then we check whether we have reached convergence, that is, whether we found the optimal policy by comparing random_policy and the new policy. If they are the same, we will break the iteration; otherwise we update random_policy with new_policy:
if (np.all(random_policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
random_policy = new_policy
Look at the complete policy_iteration function:
def policy_iteration(env,gamma = 1.0):
random_policy = np.zeros(env.observation_space.n)
no_of_iterations = 200000
gamma = 1.0
for i in range(no_of_iterations):
new_value_function = compute_value_function(random_policy, gamma)
new_policy = extract_policy(new_value_function, gamma)
if (np.all(random_policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
random_policy = new_policy
return new_policy
Thus, we can get optimal_policy using policy_iteration:
optimal_policy = policy_iteration(env, gamma = 1.0)
We will get some output, which is the optimal_policy, the actions to be performed in each state:
array([0., 3., 3., 3., 0., 0., 0., 0., 3., 1., 0., 0., 0., 2., 1., 0.])
The complete program is given as follows:
import gym
import numpy as np
env = gym.make('FrozenLake-v0')
def compute_value_function(policy, gamma=1.0):
value_table = np.zeros(env.nS)
threshold = 1e-10
while True:
updated_value_table = np.copy(value_table)
for state in range(env.nS):
action = policy[state]
value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state])
for trans_prob, next_state, reward_prob, _ in env.P[state][action]])
if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
break
return value_table
def extract_policy(value_table, gamma = 1.0):
policy = np.zeros(env.observation_space.n)
for state in range(env.observation_space.n):
Q_table = np.zeros(env.action_space.n)
for action in range(env.action_space.n):
for next_sr in env.P[state][action]:
trans_prob, next_state, reward_prob, _ = next_sr
Q_table[action] += (trans_prob * (reward_prob + gamma * value_table[next_state]))
policy[state] = np.argmax(Q_table)
return policy
def policy_iteration(env,gamma = 1.0):
random_policy = np.zeros(env.observation_space.n)
no_of_iterations = 200000
gamma = 1.0
for i in range(no_of_iterations):
new_value_function = compute_value_function(random_policy, gamma)
new_policy = extract_policy(new_value_function, gamma)
if (np.all(random_policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
random_policy = new_policy
return new_policy
print (policy_iteration(env))
Thus, we can derive the optimal policy, which specifies what action to perform in each state, using value and policy iteration to solve the frozen lake problem.
- MySQL數據庫進階實戰
- 復雜性思考:復雜性科學和計算模型(原書第2版)
- 數據之巔:數據的本質與未來
- 大數據:從概念到運營
- Oracle高性能自動化運維
- Dependency Injection with AngularJS
- 智能數據時代:企業大數據戰略與實戰
- 數據挖掘原理與SPSS Clementine應用寶典
- SQL優化最佳實踐:構建高效率Oracle數據庫的方法與技巧
- 深入淺出Greenplum分布式數據庫:原理、架構和代碼分析
- 探索新型智庫發展之路:藍迪國際智庫報告·2015(下冊)
- Gideros Mobile Game Development
- 數據分析思維:產品經理的成長筆記
- 數據應用工程:方法論與實踐
- 智能與數據重構世界