- Python Reinforcement Learning
- Sudharsan Ravichandiran Sean Saito Rajalingappaa Shanmugamani Yang Wenzhuo
- 555字
- 2021-06-24 15:17:34
Solving the frozen lake problem
If you haven't understood anything we have learned so far, don't worry, we will look at all the concepts along with a frozen lake problem.
Imagine there is a frozen lake stretching from your home to your office; you have to walk on the frozen lake to reach your office. But oops! There are holes in the frozen lake so you have to be careful while walking on the frozen lake to avoid getting trapped in the holes:

Look at the preceding diagram:
- S is the starting position (home)
- F is the frozen lake where you can walk
- H are the holes, which you have to be so careful about
- G is the goal (office)
Okay, now let us use our agent instead of you to find the correct way to reach the office. The agent's goal is to find the optimal path to go from S to G without getting trapped at H. How can an agent achieve this? We give +1 point as a reward to the agent if it correctly walks on the frozen lake and 0 points if it falls into a hole, so the agent can determine which is the right action. An agent will now try to find the optimal policy. Optimal policy implies taking the correct path, which maximizes the agent's reward. If the agent is maximizing the reward, apparently the agent is learning to skip the holes and reach the destination.
We can model our problem into MDP, which we studied earlier. MDP consists of the following:
- States: Set of states. Here we have 16 states (each little square box in the grid).
- Actions: Set of all possible actions (left, right, up, down; these are all the four possible actions our agent can take in our frozen lake environment).
- Transition probabilities: The probability of moving from one state (F) to another state (H) by performing an action a.
- Rewards probabilities: This is the probability of receiving a reward while moving from one state (F) to another state (H) by performing an action a.
Now our objective is to solve MDP. Solving the MDP implies finding the optimal policies. We introduce three special functions now:
- Policy function: Specifies what action to perform in each state
- Value function: Specifies how good a state is
- Q function: Specifies how good an action is in a particular state
When we say how good, what does that really mean? It implies how good it is to maximize the rewards.
Then, we represent the value function and Q function using a special equation called a Bellman Optimality equation. If we solve this equation, we can find the optimal policy. Here, solving the equation means finding the right value function and policy. If we find the right value function and policy, that will be our optimal path which yields maximum rewards.
We will use a special technique called dynamic programming to solve the Bellman optimality equation. To apply DP, the model dynamics have to be known in advance, which basically means the model environment's transition probabilities and reward probabilities have to be known in advance. Since we know the model dynamics, we can use DP here. We use two special DP algorithms to find the optimal policy:
- Value iteration
- Policy iteration
- 數(shù)據(jù)存儲架構(gòu)與技術(shù)
- ETL數(shù)據(jù)整合與處理(Kettle)
- 商業(yè)分析思維與實踐:用數(shù)據(jù)分析解決商業(yè)問題
- 數(shù)據(jù)架構(gòu)與商業(yè)智能
- 數(shù)據(jù)科學(xué)實戰(zhàn)指南
- 大數(shù)據(jù)分析:數(shù)據(jù)倉庫項目實戰(zhàn)
- 大數(shù)據(jù)技術(shù)原理與應(yīng)用:概念、存儲、處理、分析與應(yīng)用
- SIEMENS數(shù)控技術(shù)應(yīng)用工程師:SINUMERIK 840D-810D數(shù)控系統(tǒng)功能應(yīng)用與維修調(diào)整教程
- 改進(jìn)的群智能算法及其應(yīng)用
- 數(shù)據(jù)分析思維:產(chǎn)品經(jīng)理的成長筆記
- 離線和實時大數(shù)據(jù)開發(fā)實戰(zhàn)
- 數(shù)據(jù)中心經(jīng)營之道
- Practical Convolutional Neural Networks
- 深入理解Flink:實時大數(shù)據(jù)處理實踐
- 數(shù)據(jù)產(chǎn)品經(jīng)理寶典:大數(shù)據(jù)時代如何創(chuàng)造卓越產(chǎn)品