官术网_书友最值得收藏!

There's more...

To take a closer look, we also plot the policy values over the whole evaluation process.

We first need to record the value for each iteration in the policy_evaluation function:

>>> def policy_evaluation_history(
policy, trans_matrix, rewards, gamma, threshold):
... n_state = policy.shape[0]
... V = torch.zeros(n_state)
... V_his = [V]
... i = 0
... while True:
... V_temp = torch.zeros(n_state)
... i += 1
... for state, actions in enumerate(policy):
... for action, action_prob in enumerate(actions):
... V_temp[state] += action_prob * (R[state] + gamma *
torch.dot(trans_matrix[state, action], V))
... max_delta = torch.max(torch.abs(V - V_temp))
... V = V_temp.clone()
... V_his.append(V)
... if max_delta <= threshold:
... break
... return V, V_his

Now we feed the policy_evaluation_history function with the optimal policy, a discount factor of 0.5, and other variables:

>>> V, V_history = policy_evaluation_history(
policy_optimal, T, R, gamma, threshold)

We then plot the resulting history of values using the following lines of code:

>>> import matplotlib.pyplot as plt
>>> s0, = plt.plot([v[0] for v in V_history])
>>> s1, = plt.plot([v[1] for v in V_history])
>>> s2, = plt.plot([v[2] for v in V_history])
>>> plt.title('Optimal policy with gamma = {}'.format(str(gamma)))
>>> plt.xlabel('Iteration')
>>> plt.ylabel('Policy values')
>>> plt.legend([s0, s1, s2],
... ["State s0",
... "State s1",
... "State s2"], loc="upper left")
>>> plt.show()

We see the following result:

It is interesting to see the stabilization between iterations 10 to 14 during the convergence.

Next, we run the same code but with two different discount factors, 0.2 and 0.99. We get the following plot with the discount factor at 0.2:

Comparing the plot with a discount factor of 0.5 with this one, we can see that the smaller the factor, the faster the policy values converge.

We also get the following plot with a discount factor of 0.99:

By comparing the plot with a discount factor of 0.5 to the plot with a discount factor of 0.99, we can see that the larger the factor, the longer it takes for policy values to converge. The discount factor is a tradeoff between rewards now and rewards in the future.

主站蜘蛛池模板: 鄂尔多斯市| 罗甸县| 新沂市| 内黄县| 梧州市| 苏尼特左旗| 天水市| 大庆市| 南召县| 新龙县| 师宗县| 台东县| 鄂尔多斯市| 长丰县| 固镇县| 杨浦区| 南昌市| 康平县| 海门市| 静宁县| 咸阳市| 灵武市| 措美县| 易门县| 淮阳县| 宣城市| 东乌珠穆沁旗| 丹阳市| 东乌| 茂名市| 瓦房店市| 师宗县| 公主岭市| 嵩明县| 九龙县| 双城市| 桃源县| 中西区| 登封市| 涞源县| 鱼台县|