書名： TensorFlow Reinforcement Learning Quick Start Guide
作者名： Kaushik Balakrishnan
本章字數： 1243字
更新時間： 2021-06-24 15:29:10

Cliff walking with SARSA

We will now learn how to code the aforementioned equations in Python and implement the cliff walking problem with SARSA. First, let's import the numpy, sys, and matplotlib packages in Python. If you have not used these packages in the past, there are several Packt books on these topics to help you come up to speed. Type the following command to install the required packages in a Linux Terminal:

sudo apt-get install python-numpy python-scipy python-matplotlib

We will now summarize the code involved to solve the grid world problem. In a Terminal, use your favorite editor (for example, gedit, emacs, or vi) to code the following:

import numpy as np 
import sys 
import matplotlib.pyplot as plt

We will use a 3 x 12 grid for the cliff walking problem, that is, 3 rows and 12 columns. We also have 4 actions to take at any cell. You can go north, east, south, or west:

nrows = 3
ncols = 12
nact = 4

We will consider 100000 episodes in total. For exploration, we will use ε-greedy with a value of ε = 0.1. We will consider a constant value for ε, although the interested reader is encouraged to consider variable values for ε as well with its value slowly annealed to zero over the course of the episodes.

The learning rate, α, is chosen as 0.1, and the discount factor γ = 0.95 is used, which are typical values for this problem:

nepisodes = 100000
epsilon = 0.1
alpha = 0.1
gamma = 0.95

We will next assign values for the rewards. For any normal action that does not fall into the cliff, the reward is -1; if the agent falls down the cliff, the reward is -100; for reaching the destination, the reward is also -1. Feel free to explore other values for these rewards later, and investigate its effect on the final Q values and path taken from start to destination:

reward_normal = -1
reward_cliff = -100
reward_destination = -1

The Q values for state-action pairs are initialized to zero. We will use a NumPy array for Q, which is nrows x ncols x nact, that is, we have a nact number of entries for each cell, and we have nrows x ncols total number of cells:

Q = np.zeros((nrows,ncols,nact),dtype=np.float)

We will define a function to make the agent go to the start location, which has (x, y) co-ordinates of (x=0, y=nrows):

def go_to_start():
    # start coordinates
    y = nrows
    x = 0
    return x, y

Next, we define a function to take a random action, where we define a = 0 for moving top/north, a = 1 for moving right/east, a = 2 for moving bottom/south, and a = 4 for moving left/west. Specifically, we will use NumPy's random.randint() function, as follows:

def random_action():
    # a = 0 : top/north
    # a = 1 : right/east
    # a = 2 : bottom/south
    # a = 3 : left/west
    a = np.random.randint(nact)
    return a

We will now define the move function, which will take a given (x, y) location of the agent and the current action, a, and then will perform the action. It will return the new location of the agent after taking the action, (x1, y1), as well as the state of the agent, which we define as state = 0 for the agent to be OK after taking the action; state = 1 for reaching the destination; and state = 2 for falling into the cliff. If the agent leaves the domain through the left, top, or right, it is sent back to the same grid (equivalent to taking no action):

def move(x,y,a):
    # state = 0: OK
    # state = 1: reached destination
    # state = 2: fell into cliff
    state = 0

   if (x == 0 and y == nrows and a == 0):
     # start location
     x1 = x
     y1 = y - 1
     return x1, y1, state
   elif (x == ncols-1 and y == nrows-1 and a == 2):
     # reached destination
     x1 = x
     y1 = y + 1
     state = 1
     return x1, y1, state
   else:
     # inside grid; perform action
     if (a == 0):
       x1 = x
       y1 = y - 1
     elif (a == 1):
       x1 = x + 1
       y1 = y
     elif (a == 2):
       x1 = x
       y1 = y + 1
     elif (a == 3):
       x1 = x - 1
       y1 = y

     # don't allow agent to leave boundary
     if (x1 < 0):
       x1 = 0
     if (x1 > ncols-1):
       x1 = ncols-1
     if (y1 < 0):
       y1 = 0
     if (y1 > nrows-1):
       state = 2

   return x1, y1, state

We will next define the exploit function, which will take the (x, y) location of the agent and take the greedy action based on the Q values, that is, it will take the a action that has the highest Q value at that (x, y) location. We will do this using NumPy's np.argmax(). If we are at the start location, we go north (a = 0); if we are one step away from the destination, we go south (a = 2):

def exploit(x,y,Q):
   # start location
   if (x == 0 and y == nrows):
     a = 0
     return a

   # destination location
   if (x == ncols-1 and y == nrows-1):
     a = 2
     return a

   if (x == ncols-1 and y == nrows):
     print("exploit at destination not possible ")
     sys.exit()

   # interior location
   if (x < 0 or x > ncols-1 or y < 0 or y > nrows-1):
     print("error ", x, y)
     sys.exit()

   a = np.argmax(Q[y,x,:])
   return a

Next, we will perform the Bellman update using the following bellman() function:

def bellman(x,y,a,reward,Qs1a1,Q):
  if (y == nrows and x == 0):
    # at start location; no Bellman update possible
    return Q

  if (y == nrows and x == ncols-1):
    # at destination location; no Bellman update possible
    return Q

  # perform the Bellman update 
  Q[y,x,a] = Q[y,x,a] + alpha*(reward + gamma*Qs1a1 - Q[y,x,a])
  return Q

We will then define a function to explore or exploit, depending on a random number less than ε, the parameter we use in the ε-greedy exploration strategy. For this, we will use NumPy's np.random.uniform(), which will output a real random number between zero and one:

def explore_exploit(x,y,Q):
  # if we end up at the start location, then exploit
  if (x == 0 and y == nrows):
     a = 0
     return a

  # call a uniform random number
  r = np.random.uniform()

  if (r < epsilon):
    # explore
    a = random_action()
  else:
    # exploit
    a = exploit(x,y,Q)
  return a

We now have all the functions required for the cliff walking problem. So, we will loop over the episodes and for each episode we start at the starting location, then explore or exploit, then we move the agent one step depending on the action. Here is the Python code for this:

for n in range(nepisodes+1):

  # print every 1000 episodes 
  if (n % 1000 == 0):
    print("episode #: ", n)
 
  # start
  x, y = go_to_start()

  # explore or exploit
  a = explore_exploit(x,y,Q)

  while(True):
   # move one step
   x1, y1, state = move(x,y,a)

We perform the Bellman update based on the rewards received; note that this is based on the equations presented earlier in this chapter in the theory section. We stop the episode if we reach the destination or fall down the cliff; if not, we continue the exploration or exploitation strategy for one more step, and this goes on and on. The state variable in the following code takes the 1 value for reaching the destination, takes the value 2 for falling down the cliff, and is 0 otherwise:

   # Bellman update
   if (state == 1):
      reward = reward_destination
      Qs1a1 = 0.0
      Q = bellman(x,y,a,reward,Qs1a1,Q)
      break
   elif (state == 2):
      reward = reward_cliff
      Qs1a1 = 0.0
      Q = bellman(x,y,a,reward,Qs1a1,Q)
      break
   elif (state == 0):
      reward = reward_normal
      # Sarsa
      a1 = explore_exploit(x1,y1,Q)
      if (x1 == 0 and y1 == nrows):
        # start location
        Qs1a1 = 0.0
      else:
        Qs1a1 = Q[y1,x1,a1]
 
      Q = bellman(x,y,a,reward,Qs1a1,Q)
      x = x1
      y = y1
      a = a1

The preceding code will complete all the episodes, and we now have the converged values of Q. We will now plot this using matplotlib for each of the actions:

for i in range(nact):
  plt.subplot(nact,1,i+1)
  plt.imshow(Q[:,:,i])
  plt.axis('off')
  plt.colorbar()
  if (i == 0):
    plt.title('Q-north')
  elif (i == 1):
    plt.title('Q-east')
  elif (i == 2):
    plt.title('Q-south')
  elif (i == 3):
    plt.title('Q-west')
  plt.savefig('Q_sarsa.png')
  plt.clf()
  plt.close()

Finally, we will do a path planning using the preceding converged Q values. That is, we will plot the exact path the agent will take from start to finish using the final converged Q values. For this, we will create a variable called path, and store values for it tracing the path. We will then plot it using matplotlib as follows:

path = np.zeros((nrows,ncols,nact),dtype=np.float)
x, y = go_to_start()
while(True):
 a = exploit(x,y,Q)
 print(x,y,a)
 x1, y1, state = move(x,y,a)
 if (state == 1 or state == 2):
 print("breaking ", state)
 break
 elif (state == 0):
 x = x1
 y = y1
 if (x >= 0 and x <= ncols-1 and y >= 0 and y <= nrows-1):
 path[y,x] = 100.0
plt.imshow(path)
plt.savefig('path_sarsa.png')

That's it. We have completed the coding required for the cliff walking problem with SARSA. We will now view the results. In the following screenshot, we present the Q values for each of the actions (going north, east, south, or west) at each of the locations in the grid. As per the legend, yellow represents high Q values and violet represents low Q values.

SARSA clearly tries to avoid the cliff by choosing to not go south when the agent is just one step to the north of the cliff, as is evident from the large negative Q values for the south action:

Figure 1: Q values for the cliff walking problem using SARSA

We will next plot the path taken by the agent from start to finish in the following screenshot:

Figure 2: Path taken by the agent using SARSA

The same cliff walking problem will now be investigated using Q-learning.

官术网_书友最值得收藏!

TensorFlow Reinforcement Learning Quick Start Guide

Cliff walking with SARSA