官术网_书友最值得收藏!

The Epsilon-Greedy approach

The Epsilon-Greedy is a widely used solution to the explore-exploit dilemma. Exploration is all about searching and exploring new options through experimentation and research to generate new values, while exploitation is all about refining existing options by repeating those options and improving their values.

The Epsilon-Greedy approach is very simple to understand and easy to implement:

epsilon() = 0.05 or 0.1 #any small value between 0 to 1
#epsilon() is the probability of exploration

p = random number between 0 and 1

if p epsilon() :
pull a random action
else:
pull current best action

Eventually, after several iterations, we discover the best actions among all at each state because it gets the option to explore new random actions as well as exploit the existing actions and refine them.

Let's try to implement a basic Q-learning algorithm to make an agent learn how to navigate across this frozen lake of 16 grids, from the start to the goal without falling into the hole:

# importing dependency libraries
from __future__ import print_function
import Gym
import numpy as np
import time

#Load the environment

env = Gym.make('FrozenLake-v0')

s = env.reset()
print("initial state : ",s)
print()

env.render()
print()

print(env.action_space) #number of actions
print(env.observation_space) #number of states
print()

print("Number of actions : ",env.action_space.n)
print("Number of states : ",env.observation_space.n)
print()

#Epsilon-Greedy approach for Exploration and Exploitation of the state-action spaces
def epsilon_greedy(Q,s,na):
epsilon = 0.3
p = np.random.uniform(low=0,high=1)
#print(p)
if p > epsilon:
return np.argmax(Q[s,:])#say here,initial policy = for each state consider the action having highest Q-value
else:
return env.action_space.sample()

# Q-Learning Implementation

#Initializing Q-table with zeros
Q = np.zeros([env.observation_space.n,env.action_space.n])

#set hyperparameters
lr = 0.5 #learning rate
y = 0.9 #discount factor lambda
eps = 100000 #total episodes being 100000


for i in range(eps):
s = env.reset()
t = False
while(True):
a = epsilon_greedy(Q,s,env.action_space.n)
s_,r,t,_ = env.step(a)
if (r==0):
if t==True:
r = -5 #to give negative rewards when holes turn up
Q[s_] = np.ones(env.action_space.n)*r #in terminal state Q value equals the reward
else:
r = -1 #to give negative rewards to avoid long routes
if (r==1):
r = 100
Q[s_] = np.ones(env.action_space.n)*r #in terminal state Q value equals the reward
Q[s,a] = Q[s,a] + lr * (r + y*np.max(Q[s_,a]) - Q[s,a])
s = s_
if (t == True) :
break

print("Q-table")
print(Q)
print()

print("Output after learning")
print()
#learning ends with the end of the above loop of several episodes above
#let's check how much our agent has learned
s = env.reset()
env.render()
while(True):
a = np.argmax(Q[s])
s_,r,t,_ = env.step(a)
print("===============")
env.render()
s = s_
if(t==True) :
break
----------------------------------------------------------------------------------------------
<<OUTPUT>>

initial state : 0

SFFF
FHFH
FFFH
HFFG

Discrete(4)
Discrete(16)

Number of actions : 4
Number of states : 16

Q-table
[[ -9.85448046 -7.4657981 -9.59584501 -10. ] [ -9.53200011 -9.54250775 -9.10115662 -10. ] [ -9.65308982 -9.51359977 -7.52052469 -10. ] [ -9.69762313 -9.5540111 -9.56571455 -10. ] [ -9.82319854 -4.83823005 -9.56441915 -9.74234959] [ -5. -5. -5. -5. ] [ -9.6554905 -9.44717167 -7.35077759 -9.77885057] [ -5. -5. -5. -5. ] [ -9.66012445 -4.28223592 -9.48312882 -9.76812285] [ -9.59664264 9.60799515 -4.48137699 -9.61956668] [ -9.71057124 -5.6863911 -2.04563412 -9.75341962] [ -5. -5. -5. -5. ] [ -5. -5. -5. -5. ] [ -9.54737964 22.84803205 18.17841481 -9.45516929] [ -9.69494035 34.16859049 72.04055782 40.62254838] [ 100. 100. 100. 100. ]]

Output after learning

SFFF FHFH FFFH HFFG
=============== (Down) SFFF FHFH FFFH HFFG
=============== (Down) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
===============
(Right)
SFFF
FHFH
FFFH
HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
主站蜘蛛池模板: 襄樊市| 威海市| 彰化市| 诏安县| 辉南县| 台南市| 普兰县| 简阳市| 岢岚县| 育儿| 永福县| 浏阳市| 苍南县| 黑水县| 体育| 蓬莱市| 武威市| 甘南县| 西昌市| 顺义区| 西昌市| 同德县| 安泽县| 荣昌县| 微博| 顺昌县| 宝丰县| 翁源县| 冕宁县| 方城县| 新丰县| 临邑县| 安阳市| 杭锦后旗| 泰来县| 保德县| 万全县| 台北市| 深泽县| 布尔津县| 马尔康县|