官术网_书友最值得收藏!

Cliff walking with SARSA

We will now learn how to code the aforementioned equations in Python and implement the cliff walking problem with SARSA. First, let's import the numpy, sys, and matplotlib packages in Python. If you have not used these packages in the past, there are several Packt books on these topics to help you come up to speed. Type the following command to install the required packages in a Linux Terminal:

sudo apt-get install python-numpy python-scipy python-matplotlib

We will now summarize the code involved to solve the grid world problem. In a Terminal, use your favorite editor (for example, gedit, emacs, or vi) to code the following:

import numpy as np 
import sys
import matplotlib.pyplot as plt

We will use a 3 x 12 grid for the cliff walking problem, that is, 3 rows and 12 columns. We also have 4 actions to take at any cell. You can go north, east, south, or west:

nrows = 3
ncols = 12
nact = 4

We will consider 100000 episodes in total. For exploration, we will use ε-greedy with a value of ε = 0.1. We will consider a constant value for ε, although the interested reader is encouraged to consider variable values for ε as well with its value slowly annealed to zero over the course of the episodes.

The learning rate, α, is chosen as 0.1, and the discount factor γ = 0.95 is used, which are typical values for this problem:

nepisodes = 100000
epsilon = 0.1
alpha = 0.1
gamma = 0.95

We will next assign values for the rewards. For any normal action that does not fall into the cliff, the reward is -1; if the agent falls down the cliff, the reward is -100; for reaching the destination, the reward is also -1. Feel free to explore other values for these rewards later, and investigate its effect on the final Q values and path taken from start to destination:

reward_normal = -1
reward_cliff = -100
reward_destination = -1

The Q values for state-action pairs are initialized to zero. We will use a NumPy array for Q, which is nrows x ncols x nact, that is, we have a nact number of entries for each cell, and we have nrows x ncols total number of cells:

Q = np.zeros((nrows,ncols,nact),dtype=np.float)

We will define a function to make the agent go to the start location, which has (x, y) co-ordinates of (x=0, y=nrows):

def go_to_start():
# start coordinates
y = nrows
x = 0
return x, y

Next, we define a function to take a random action, where we define a = 0 for moving top/north, a = 1 for moving right/east, a = 2 for moving bottom/south, and a = 4 for moving left/west. Specifically, we will use NumPy's random.randint() function, as follows:

def random_action():
# a = 0 : top/north
# a = 1 : right/east
# a = 2 : bottom/south
# a = 3 : left/west
a = np.random.randint(nact)
return a

We will now define the move function, which will take a given (x, y) location of the agent and the current action, a, and then will perform the action. It will return the new location of the agent after taking the action, (x1, y1), as well as the state of the agent, which we define as state = 0 for the agent to be OK after taking the action; state = 1 for reaching the destination; and state = 2 for falling into the cliff. If the agent leaves the domain through the left, top, or right, it is sent back to the same grid (equivalent to taking no action):

def move(x,y,a):
# state = 0: OK
# state = 1: reached destination
# state = 2: fell into cliff
state = 0

if (x == 0 and y == nrows and a == 0):
# start location
x1 = x
y1 = y - 1
return x1, y1, state
elif (x == ncols-1 and y == nrows-1 and a == 2):
# reached destination
x1 = x
y1 = y + 1
state = 1
return x1, y1, state
else:
# inside grid; perform action
if (a == 0):
x1 = x
y1 = y - 1
elif (a == 1):
x1 = x + 1
y1 = y
elif (a == 2):
x1 = x
y1 = y + 1
elif (a == 3):
x1 = x - 1
y1 = y

# don't allow agent to leave boundary
if (x1 < 0):
x1 = 0
if (x1 > ncols-1):
x1 = ncols-1
if (y1 < 0):
y1 = 0
if (y1 > nrows-1):
state = 2

return x1, y1, state

We will next define the exploit function, which will take the (x, y) location of the agent and take the greedy action based on the Q values, that is, it will take the a action that has the highest Q value at that (x, y) location. We will do this using NumPy's np.argmax(). If we are at the start location, we go north (a = 0); if we are one step away from the destination, we go south (a = 2):

def exploit(x,y,Q):
# start location
if (x == 0 and y == nrows):
a = 0
return a

# destination location
if (x == ncols-1 and y == nrows-1):
a = 2
return a

if (x == ncols-1 and y == nrows):
print("exploit at destination not possible ")
sys.exit()

# interior location
if (x < 0 or x > ncols-1 or y < 0 or y > nrows-1):
print("error ", x, y)
sys.exit()

a = np.argmax(Q[y,x,:])
return a

Next, we will perform the Bellman update using the following bellman() function:

def bellman(x,y,a,reward,Qs1a1,Q):
if (y == nrows and x == 0):
# at start location; no Bellman update possible
return Q

if (y == nrows and x == ncols-1):
# at destination location; no Bellman update possible
return Q

# perform the Bellman update
Q[y,x,a] = Q[y,x,a] + alpha*(reward + gamma*Qs1a1 - Q[y,x,a])
return Q

We will then define a function to explore or exploit, depending on a random number less than ε, the parameter we use in the ε-greedy exploration strategy. For this, we will use NumPy's np.random.uniform(), which will output a real random number between zero and one:

def explore_exploit(x,y,Q):
# if we end up at the start location, then exploit
if (x == 0 and y == nrows):
a = 0
return a

# call a uniform random number
r = np.random.uniform()

if (r < epsilon):
# explore
a = random_action()
else:
# exploit
a = exploit(x,y,Q)
return a

We now have all the functions required for the cliff walking problem. So, we will loop over the episodes and for each episode we start at the starting location, then explore or exploit, then we move the agent one step depending on the action. Here is the Python code for this:

for n in range(nepisodes+1):

# print every 1000 episodes
if (n % 1000 == 0):
print("episode #: ", n)

# start
x, y = go_to_start()

# explore or exploit
a = explore_exploit(x,y,Q)

while(True):
# move one step
x1, y1, state = move(x,y,a)

We perform the Bellman update based on the rewards received; note that this is based on the equations presented earlier in this chapter in the theory section. We stop the episode if we reach the destination or fall down the cliff; if not, we continue the exploration or exploitation strategy for one more step, and this goes on and on. The state variable in the following code takes the 1 value for reaching the destination, takes the value 2 for falling down the cliff, and is 0 otherwise:

   # Bellman update
if (state == 1):
reward = reward_destination
Qs1a1 = 0.0
Q = bellman(x,y,a,reward,Qs1a1,Q)
break
elif (state == 2):
reward = reward_cliff
Qs1a1 = 0.0
Q = bellman(x,y,a,reward,Qs1a1,Q)
break
elif (state == 0):
reward = reward_normal
# Sarsa
a1 = explore_exploit(x1,y1,Q)
if (x1 == 0 and y1 == nrows):
# start location
Qs1a1 = 0.0
else:
Qs1a1 = Q[y1,x1,a1]

Q = bellman(x,y,a,reward,Qs1a1,Q)
x = x1
y = y1
a = a1

The preceding code will complete all the episodes, and we now have the converged values of Q. We will now plot this using matplotlib for each of the actions:

for i in range(nact):
plt.subplot(nact,1,i+1)
plt.imshow(Q[:,:,i])
plt.axis('off')
plt.colorbar()
if (i == 0):
plt.title('Q-north')
elif (i == 1):
plt.title('Q-east')
elif (i == 2):
plt.title('Q-south')
elif (i == 3):
plt.title('Q-west')
plt.savefig('Q_sarsa.png')
plt.clf()
plt.close()

Finally, we will do a path planning using the preceding converged Q values. That is, we will plot the exact path the agent will take from start to finish using the final converged Q values. For this, we will create a variable called path, and store values for it tracing the path. We will then plot it using matplotlib as follows:

path = np.zeros((nrows,ncols,nact),dtype=np.float)
x, y = go_to_start()
while(True):
a = exploit(x,y,Q)
print(x,y,a)
x1, y1, state = move(x,y,a)
if (state == 1 or state == 2):
print("breaking ", state)
break
elif (state == 0):
x = x1
y = y1
if (x >= 0 and x <= ncols-1 and y >= 0 and y <= nrows-1):
path[y,x] = 100.0
plt.imshow(path)
plt.savefig('path_sarsa.png')

That's it. We have completed the coding required for the cliff walking problem with SARSA. We will now view the results. In the following screenshot, we present the Q values for each of the actions (going north, east, south, or west) at each of the locations in the grid. As per the legend, yellow represents high Q values and violet represents low Q values.

SARSA clearly tries to avoid the cliff by choosing to not go south when the agent is just one step to the north of the cliff, as is evident from the large negative Q values for the south action:

Figure 1: Q values for the cliff walking problem using SARSA

We will next plot the path taken by the agent from start to finish in the following screenshot:

Figure 2: Path taken by the agent using SARSA

The same cliff walking problem will now be investigated using Q-learning.

主站蜘蛛池模板: 乐山市| 吉木乃县| 广河县| 外汇| 乌拉特中旗| 绿春县| 思茅市| 铅山县| 和顺县| 全南县| 沽源县| 乐至县| 通榆县| 左贡县| 和龙市| 白水县| 方山县| 固原市| 砀山县| 南阳市| 新野县| 潼关县| 招远市| 黄龙县| 定州市| 都昌县| 莆田市| 台前县| 海丰县| 广西| 惠水县| 和田市| 建平县| 平潭县| 乐业县| 集贤县| 巴东县| 达州市| 溧水县| 克山县| 灌阳县|