- PyTorch 1.x Reinforcement Learning Cookbook
- Yuxi (Hayden) Liu
- 324字
- 2021-06-24 12:34:46
There's more...
To take a closer look, we also plot the policy values over the whole evaluation process.
We first need to record the value for each iteration in the policy_evaluation function:
>>> def policy_evaluation_history(
policy, trans_matrix, rewards, gamma, threshold):
... n_state = policy.shape[0]
... V = torch.zeros(n_state)
... V_his = [V]
... i = 0
... while True:
... V_temp = torch.zeros(n_state)
... i += 1
... for state, actions in enumerate(policy):
... for action, action_prob in enumerate(actions):
... V_temp[state] += action_prob * (R[state] + gamma *
torch.dot(trans_matrix[state, action], V))
... max_delta = torch.max(torch.abs(V - V_temp))
... V = V_temp.clone()
... V_his.append(V)
... if max_delta <= threshold:
... break
... return V, V_his
Now we feed the policy_evaluation_history function with the optimal policy, a discount factor of 0.5, and other variables:
>>> V, V_history = policy_evaluation_history(
policy_optimal, T, R, gamma, threshold)
We then plot the resulting history of values using the following lines of code:
>>> import matplotlib.pyplot as plt
>>> s0, = plt.plot([v[0] for v in V_history])
>>> s1, = plt.plot([v[1] for v in V_history])
>>> s2, = plt.plot([v[2] for v in V_history])
>>> plt.title('Optimal policy with gamma = {}'.format(str(gamma)))
>>> plt.xlabel('Iteration')
>>> plt.ylabel('Policy values')
>>> plt.legend([s0, s1, s2],
... ["State s0",
... "State s1",
... "State s2"], loc="upper left")
>>> plt.show()
We see the following result:

It is interesting to see the stabilization between iterations 10 to 14 during the convergence.
Next, we run the same code but with two different discount factors, 0.2 and 0.99. We get the following plot with the discount factor at 0.2:

Comparing the plot with a discount factor of 0.5 with this one, we can see that the smaller the factor, the faster the policy values converge.
We also get the following plot with a discount factor of 0.99:

By comparing the plot with a discount factor of 0.5 to the plot with a discount factor of 0.99, we can see that the larger the factor, the longer it takes for policy values to converge. The discount factor is a tradeoff between rewards now and rewards in the future.
- ArchiCAD 19:The Definitive Guide
- 現代測控電子技術
- 手把手教你玩轉RPA:基于UiPath和Blue Prism
- 數據產品經理:解決方案與案例分析
- Security Automation with Ansible 2
- 深度學習中的圖像分類與對抗技術
- 水晶石精粹:3ds max & ZBrush三維數字靜幀藝術
- 塊數據5.0:數據社會學的理論與方法
- 具比例時滯遞歸神經網絡的穩定性及其仿真與應用
- Ruby on Rails敏捷開發最佳實踐
- 大數據驅動的機械裝備智能運維理論及應用
- TensorFlow Reinforcement Learning Quick Start Guide
- 電子設備及系統人機工程設計(第2版)
- 強化學習
- Oracle 11g Anti-hacker's Cookbook