- Python Reinforcement Learning
- Sudharsan Ravichandiran Sean Saito Rajalingappaa Shanmugamani Yang Wenzhuo
- 555字
- 2021-06-24 15:17:34
Solving the frozen lake problem
If you haven't understood anything we have learned so far, don't worry, we will look at all the concepts along with a frozen lake problem.
Imagine there is a frozen lake stretching from your home to your office; you have to walk on the frozen lake to reach your office. But oops! There are holes in the frozen lake so you have to be careful while walking on the frozen lake to avoid getting trapped in the holes:

Look at the preceding diagram:
- S is the starting position (home)
- F is the frozen lake where you can walk
- H are the holes, which you have to be so careful about
- G is the goal (office)
Okay, now let us use our agent instead of you to find the correct way to reach the office. The agent's goal is to find the optimal path to go from S to G without getting trapped at H. How can an agent achieve this? We give +1 point as a reward to the agent if it correctly walks on the frozen lake and 0 points if it falls into a hole, so the agent can determine which is the right action. An agent will now try to find the optimal policy. Optimal policy implies taking the correct path, which maximizes the agent's reward. If the agent is maximizing the reward, apparently the agent is learning to skip the holes and reach the destination.
We can model our problem into MDP, which we studied earlier. MDP consists of the following:
- States: Set of states. Here we have 16 states (each little square box in the grid).
- Actions: Set of all possible actions (left, right, up, down; these are all the four possible actions our agent can take in our frozen lake environment).
- Transition probabilities: The probability of moving from one state (F) to another state (H) by performing an action a.
- Rewards probabilities: This is the probability of receiving a reward while moving from one state (F) to another state (H) by performing an action a.
Now our objective is to solve MDP. Solving the MDP implies finding the optimal policies. We introduce three special functions now:
- Policy function: Specifies what action to perform in each state
- Value function: Specifies how good a state is
- Q function: Specifies how good an action is in a particular state
When we say how good, what does that really mean? It implies how good it is to maximize the rewards.
Then, we represent the value function and Q function using a special equation called a Bellman Optimality equation. If we solve this equation, we can find the optimal policy. Here, solving the equation means finding the right value function and policy. If we find the right value function and policy, that will be our optimal path which yields maximum rewards.
We will use a special technique called dynamic programming to solve the Bellman optimality equation. To apply DP, the model dynamics have to be known in advance, which basically means the model environment's transition probabilities and reward probabilities have to be known in advance. Since we know the model dynamics, we can use DP here. We use two special DP algorithms to find the optimal policy:
- Value iteration
- Policy iteration
- 數據庫應用實戰
- 數據產品經理高效學習手冊:產品設計、技術常識與機器學習
- Microsoft SQL Server企業級平臺管理實踐
- Access 2007數據庫應用上機指導與練習
- MySQL基礎教程
- SQL查詢:從入門到實踐(第4版)
- Access 2016數據庫技術及應用
- 大數據:從概念到運營
- 企業級數據與AI項目成功之道
- Mastering LOB Development for Silverlight 5:A Case Study in Action
- 聯動Oracle:設計思想、架構實現與AWR報告
- R Object-oriented Programming
- 中文版Access 2007實例與操作
- 商業智能工具應用與數據可視化
- 數據之美:一本書學會可視化設計