- TensorFlow Reinforcement Learning Quick Start Guide
- Kaushik Balakrishnan
- 1243字
- 2021-06-24 15:29:10
Cliff walking with SARSA
We will now learn how to code the aforementioned equations in Python and implement the cliff walking problem with SARSA. First, let's import the numpy, sys, and matplotlib packages in Python. If you have not used these packages in the past, there are several Packt books on these topics to help you come up to speed. Type the following command to install the required packages in a Linux Terminal:
sudo apt-get install python-numpy python-scipy python-matplotlib
We will now summarize the code involved to solve the grid world problem. In a Terminal, use your favorite editor (for example, gedit, emacs, or vi) to code the following:
import numpy as np
import sys
import matplotlib.pyplot as plt
We will use a 3 x 12 grid for the cliff walking problem, that is, 3 rows and 12 columns. We also have 4 actions to take at any cell. You can go north, east, south, or west:
nrows = 3
ncols = 12
nact = 4
We will consider 100000 episodes in total. For exploration, we will use ε-greedy with a value of ε = 0.1. We will consider a constant value for ε, although the interested reader is encouraged to consider variable values for ε as well with its value slowly annealed to zero over the course of the episodes.
The learning rate, α, is chosen as 0.1, and the discount factor γ = 0.95 is used, which are typical values for this problem:
nepisodes = 100000
epsilon = 0.1
alpha = 0.1
gamma = 0.95
We will next assign values for the rewards. For any normal action that does not fall into the cliff, the reward is -1; if the agent falls down the cliff, the reward is -100; for reaching the destination, the reward is also -1. Feel free to explore other values for these rewards later, and investigate its effect on the final Q values and path taken from start to destination:
reward_normal = -1
reward_cliff = -100
reward_destination = -1
The Q values for state-action pairs are initialized to zero. We will use a NumPy array for Q, which is nrows x ncols x nact, that is, we have a nact number of entries for each cell, and we have nrows x ncols total number of cells:
Q = np.zeros((nrows,ncols,nact),dtype=np.float)
We will define a function to make the agent go to the start location, which has (x, y) co-ordinates of (x=0, y=nrows):
def go_to_start():
# start coordinates
y = nrows
x = 0
return x, y
Next, we define a function to take a random action, where we define a = 0 for moving top/north, a = 1 for moving right/east, a = 2 for moving bottom/south, and a = 4 for moving left/west. Specifically, we will use NumPy's random.randint() function, as follows:
def random_action():
# a = 0 : top/north
# a = 1 : right/east
# a = 2 : bottom/south
# a = 3 : left/west
a = np.random.randint(nact)
return a
We will now define the move function, which will take a given (x, y) location of the agent and the current action, a, and then will perform the action. It will return the new location of the agent after taking the action, (x1, y1), as well as the state of the agent, which we define as state = 0 for the agent to be OK after taking the action; state = 1 for reaching the destination; and state = 2 for falling into the cliff. If the agent leaves the domain through the left, top, or right, it is sent back to the same grid (equivalent to taking no action):
def move(x,y,a):
# state = 0: OK
# state = 1: reached destination
# state = 2: fell into cliff
state = 0
if (x == 0 and y == nrows and a == 0):
# start location
x1 = x
y1 = y - 1
return x1, y1, state
elif (x == ncols-1 and y == nrows-1 and a == 2):
# reached destination
x1 = x
y1 = y + 1
state = 1
return x1, y1, state
else:
# inside grid; perform action
if (a == 0):
x1 = x
y1 = y - 1
elif (a == 1):
x1 = x + 1
y1 = y
elif (a == 2):
x1 = x
y1 = y + 1
elif (a == 3):
x1 = x - 1
y1 = y
# don't allow agent to leave boundary
if (x1 < 0):
x1 = 0
if (x1 > ncols-1):
x1 = ncols-1
if (y1 < 0):
y1 = 0
if (y1 > nrows-1):
state = 2
return x1, y1, state
We will next define the exploit function, which will take the (x, y) location of the agent and take the greedy action based on the Q values, that is, it will take the a action that has the highest Q value at that (x, y) location. We will do this using NumPy's np.argmax(). If we are at the start location, we go north (a = 0); if we are one step away from the destination, we go south (a = 2):
def exploit(x,y,Q):
# start location
if (x == 0 and y == nrows):
a = 0
return a
# destination location
if (x == ncols-1 and y == nrows-1):
a = 2
return a
if (x == ncols-1 and y == nrows):
print("exploit at destination not possible ")
sys.exit()
# interior location
if (x < 0 or x > ncols-1 or y < 0 or y > nrows-1):
print("error ", x, y)
sys.exit()
a = np.argmax(Q[y,x,:])
return a
Next, we will perform the Bellman update using the following bellman() function:
def bellman(x,y,a,reward,Qs1a1,Q):
if (y == nrows and x == 0):
# at start location; no Bellman update possible
return Q
if (y == nrows and x == ncols-1):
# at destination location; no Bellman update possible
return Q
# perform the Bellman update
Q[y,x,a] = Q[y,x,a] + alpha*(reward + gamma*Qs1a1 - Q[y,x,a])
return Q
We will then define a function to explore or exploit, depending on a random number less than ε, the parameter we use in the ε-greedy exploration strategy. For this, we will use NumPy's np.random.uniform(), which will output a real random number between zero and one:
def explore_exploit(x,y,Q):
# if we end up at the start location, then exploit
if (x == 0 and y == nrows):
a = 0
return a
# call a uniform random number
r = np.random.uniform()
if (r < epsilon):
# explore
a = random_action()
else:
# exploit
a = exploit(x,y,Q)
return a
We now have all the functions required for the cliff walking problem. So, we will loop over the episodes and for each episode we start at the starting location, then explore or exploit, then we move the agent one step depending on the action. Here is the Python code for this:
for n in range(nepisodes+1):
# print every 1000 episodes
if (n % 1000 == 0):
print("episode #: ", n)
# start
x, y = go_to_start()
# explore or exploit
a = explore_exploit(x,y,Q)
while(True):
# move one step
x1, y1, state = move(x,y,a)
We perform the Bellman update based on the rewards received; note that this is based on the equations presented earlier in this chapter in the theory section. We stop the episode if we reach the destination or fall down the cliff; if not, we continue the exploration or exploitation strategy for one more step, and this goes on and on. The state variable in the following code takes the 1 value for reaching the destination, takes the value 2 for falling down the cliff, and is 0 otherwise:
# Bellman update
if (state == 1):
reward = reward_destination
Qs1a1 = 0.0
Q = bellman(x,y,a,reward,Qs1a1,Q)
break
elif (state == 2):
reward = reward_cliff
Qs1a1 = 0.0
Q = bellman(x,y,a,reward,Qs1a1,Q)
break
elif (state == 0):
reward = reward_normal
# Sarsa
a1 = explore_exploit(x1,y1,Q)
if (x1 == 0 and y1 == nrows):
# start location
Qs1a1 = 0.0
else:
Qs1a1 = Q[y1,x1,a1]
Q = bellman(x,y,a,reward,Qs1a1,Q)
x = x1
y = y1
a = a1
The preceding code will complete all the episodes, and we now have the converged values of Q. We will now plot this using matplotlib for each of the actions:
for i in range(nact):
plt.subplot(nact,1,i+1)
plt.imshow(Q[:,:,i])
plt.axis('off')
plt.colorbar()
if (i == 0):
plt.title('Q-north')
elif (i == 1):
plt.title('Q-east')
elif (i == 2):
plt.title('Q-south')
elif (i == 3):
plt.title('Q-west')
plt.savefig('Q_sarsa.png')
plt.clf()
plt.close()
Finally, we will do a path planning using the preceding converged Q values. That is, we will plot the exact path the agent will take from start to finish using the final converged Q values. For this, we will create a variable called path, and store values for it tracing the path. We will then plot it using matplotlib as follows:
path = np.zeros((nrows,ncols,nact),dtype=np.float)
x, y = go_to_start()
while(True):
a = exploit(x,y,Q)
print(x,y,a)
x1, y1, state = move(x,y,a)
if (state == 1 or state == 2):
print("breaking ", state)
break
elif (state == 0):
x = x1
y = y1
if (x >= 0 and x <= ncols-1 and y >= 0 and y <= nrows-1):
path[y,x] = 100.0
plt.imshow(path)
plt.savefig('path_sarsa.png')
That's it. We have completed the coding required for the cliff walking problem with SARSA. We will now view the results. In the following screenshot, we present the Q values for each of the actions (going north, east, south, or west) at each of the locations in the grid. As per the legend, yellow represents high Q values and violet represents low Q values.
SARSA clearly tries to avoid the cliff by choosing to not go south when the agent is just one step to the north of the cliff, as is evident from the large negative Q values for the south action:

We will next plot the path taken by the agent from start to finish in the following screenshot:

The same cliff walking problem will now be investigated using Q-learning.
- Internet接入·網絡安全
- 平面設計初步
- 精通Excel VBA
- 自動檢測與轉換技術
- 西門子S7-200 SMART PLC實例指導學與用
- 大學C/C++語言程序設計基礎
- Troubleshooting OpenVPN
- Hands-On Data Warehousing with Azure Data Factory
- R Machine Learning Projects
- 精通LabVIEW程序設計
- 大數據素質讀本
- Flash CS3動畫制作
- 玩轉機器人:基于Proteus的電路原理仿真(移動視頻版)
- SolarWinds Server & Application Monitor:Deployment and Administration
- 多媒體技術應用教程