官术网_书友最值得收藏!

Solving the frozen lake problem

If you haven't understood anything we have learned so far, don't worry, we will look at all the concepts along with a frozen lake problem.

Imagine there is a frozen lake stretching from your home to your office; you have to walk on the frozen lake to reach your office. But oops! There are holes in the frozen lake so you have to be careful while walking on the frozen lake to avoid getting trapped in the holes:

Look at the preceding diagram:

  • S is the starting position (home)
  • F is the frozen lake where you can walk
  • H are the holes, which you have to be so careful about
  • G is the goal (office)

Okay, now let us use our agent instead of you to find the correct way to reach the office. The agent's goal is to find the optimal path to go from S to G without getting trapped at H. How can an agent achieve this? We give +1 point as a reward to the agent if it correctly walks on the frozen lake and 0 points if it falls into a hole, so the agent can determine which is the right action. An agent will now try to find the optimal policy. Optimal policy implies taking the correct path, which maximizes the agent's reward. If the agent is maximizing the reward, apparently the agent is learning to skip the holes and reach the destination.

We can model our problem into MDP, which we studied earlier. MDP consists of the following:

  • States: Set of states. Here we have 16 states (each little square box in the grid).
  • Actions: Set of all possible actions (left, right, up, down; these are all the four possible actions our agent can take in our frozen lake environment).
  • Transition probabilities: The probability of moving from one state (F) to another state (H) by performing an action a.
  • Rewards probabilities: This is the probability of receiving a reward while moving from one state (F) to another state (H) by performing an action a.

Now our objective is to solve MDP. Solving the MDP implies finding the optimal policies. We introduce three special functions now:

  • Policy function: Specifies what action to perform in each state
  • Value function: Specifies how good a state is
  • Q function: Specifies how good an action is in a particular state

When we say how good, what does that really mean? It implies how good it is to maximize the rewards.

Then, we represent the value function and Q function using a special equation called a Bellman Optimality equation. If we solve this equation, we can find the optimal policy. Here, solving the equation means finding the right value function and policy. If we find the right value function and policy, that will be our optimal path which yields maximum rewards. 

We will use a special technique called dynamic programming to solve the Bellman optimality equation. To apply DP, the model dynamics have to be known in advance, which basically means the model environment's transition probabilities and reward probabilities have to be known in advance. Since we know the model dynamics, we can use DP here. We use two special DP algorithms to find the optimal policy:

  • Value iteration
  • Policy iteration
主站蜘蛛池模板: 长汀县| 阿勒泰市| 枣庄市| 濮阳县| 高清| 延寿县| 三台县| 包头市| 崇州市| 宁河县| 项城市| 长汀县| 吉安市| 曲阜市| 涿州市| 拜泉县| 连云港市| 通许县| 平原县| 沙坪坝区| 黄骅市| 教育| 抚顺市| 金昌市| 谢通门县| 玉门市| 且末县| 商南县| 安仁县| 宝清县| 安塞县| 绥宁县| 蛟河市| 江北区| 东乌珠穆沁旗| 平顶山市| 罗定市| 临海市| 庄浪县| 金川县| 高淳县|