- Reinforcement Learning with TensorFlow
- Sayon Dutta
- 615字
- 2021-08-27 18:52:00
The Epsilon-Greedy approach
The Epsilon-Greedy is a widely used solution to the explore-exploit dilemma. Exploration is all about searching and exploring new options through experimentation and research to generate new values, while exploitation is all about refining existing options by repeating those options and improving their values.
The Epsilon-Greedy approach is very simple to understand and easy to implement:
epsilon() = 0.05 or 0.1 #any small value between 0 to 1
#epsilon() is the probability of exploration
p = random number between 0 and 1
if p epsilon(
) :
pull a random action
else:
pull current best action
Eventually, after several iterations, we discover the best actions among all at each state because it gets the option to explore new random actions as well as exploit the existing actions and refine them.
Let's try to implement a basic Q-learning algorithm to make an agent learn how to navigate across this frozen lake of 16 grids, from the start to the goal without falling into the hole:
# importing dependency libraries
from __future__ import print_function
import Gym
import numpy as np
import time
#Load the environment
env = Gym.make('FrozenLake-v0')
s = env.reset()
print("initial state : ",s)
print()
env.render()
print()
print(env.action_space) #number of actions
print(env.observation_space) #number of states
print()
print("Number of actions : ",env.action_space.n)
print("Number of states : ",env.observation_space.n)
print()
#Epsilon-Greedy approach for Exploration and Exploitation of the state-action spaces
def epsilon_greedy(Q,s,na):
epsilon = 0.3
p = np.random.uniform(low=0,high=1)
#print(p)
if p > epsilon:
return np.argmax(Q[s,:])#say here,initial policy = for each state consider the action having highest Q-value
else:
return env.action_space.sample()
# Q-Learning Implementation
#Initializing Q-table with zeros
Q = np.zeros([env.observation_space.n,env.action_space.n])
#set hyperparameters
lr = 0.5 #learning rate
y = 0.9 #discount factor lambda
eps = 100000 #total episodes being 100000
for i in range(eps):
s = env.reset()
t = False
while(True):
a = epsilon_greedy(Q,s,env.action_space.n)
s_,r,t,_ = env.step(a)
if (r==0):
if t==True:
r = -5 #to give negative rewards when holes turn up
Q[s_] = np.ones(env.action_space.n)*r #in terminal state Q value equals the reward
else:
r = -1 #to give negative rewards to avoid long routes
if (r==1):
r = 100
Q[s_] = np.ones(env.action_space.n)*r #in terminal state Q value equals the reward
Q[s,a] = Q[s,a] + lr * (r + y*np.max(Q[s_,a]) - Q[s,a])
s = s_
if (t == True) :
break
print("Q-table")
print(Q)
print()
print("Output after learning")
print()
#learning ends with the end of the above loop of several episodes above
#let's check how much our agent has learned
s = env.reset()
env.render()
while(True):
a = np.argmax(Q[s])
s_,r,t,_ = env.step(a)
print("===============")
env.render()
s = s_
if(t==True) :
break
----------------------------------------------------------------------------------------------
<<OUTPUT>>
initial state : 0
SFFF
FHFH
FFFH
HFFG
Discrete(4)
Discrete(16)
Number of actions : 4
Number of states : 16
Q-table
[[ -9.85448046 -7.4657981 -9.59584501 -10. ] [ -9.53200011 -9.54250775 -9.10115662 -10. ] [ -9.65308982 -9.51359977 -7.52052469 -10. ] [ -9.69762313 -9.5540111 -9.56571455 -10. ] [ -9.82319854 -4.83823005 -9.56441915 -9.74234959] [ -5. -5. -5. -5. ] [ -9.6554905 -9.44717167 -7.35077759 -9.77885057] [ -5. -5. -5. -5. ] [ -9.66012445 -4.28223592 -9.48312882 -9.76812285] [ -9.59664264 9.60799515 -4.48137699 -9.61956668] [ -9.71057124 -5.6863911 -2.04563412 -9.75341962] [ -5. -5. -5. -5. ] [ -5. -5. -5. -5. ] [ -9.54737964 22.84803205 18.17841481 -9.45516929] [ -9.69494035 34.16859049 72.04055782 40.62254838] [ 100. 100. 100. 100. ]]
Output after learning
SFFF FHFH FFFH HFFG
=============== (Down) SFFF FHFH FFFH HFFG
=============== (Down) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
===============
(Right)
SFFF
FHFH
FFFH
HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
- Microsoft Power BI Quick Start Guide
- 人工免疫算法改進(jìn)及其應(yīng)用
- 基于多目標(biāo)決策的數(shù)據(jù)挖掘方法評估與應(yīng)用
- 網(wǎng)絡(luò)安全與防護(hù)
- 網(wǎng)站前臺設(shè)計綜合實訓(xùn)
- 貫通Java Web開發(fā)三劍客
- Hands-On Reactive Programming with Reactor
- JRuby語言實戰(zhàn)技術(shù)
- 基于人工免疫原理的檢測系統(tǒng)模型及其應(yīng)用
- MongoDB 4 Quick Start Guide
- 智能+:制造業(yè)的智能化轉(zhuǎn)型
- 超好玩的Python少兒編程
- 這樣用Word!
- 局域網(wǎng)組建與使用完全自學(xué)手冊
- Kubernetes Design Patterns and Extensions