- PyTorch 1.x Reinforcement Learning Cookbook
- Yuxi (Hayden) Liu
- 327字
- 2021-06-24 12:34:45
Creating an MDP
Developed upon the Markov chain, an MDP involves an agent and a decision-making process. Let's go ahead with developing an MDP and calculating the value function under the optimal policy.
Besides a set of possible states, S = {s0, s1, ... , sm}, an MDP is defined by a set of actions, A = {a0, a1, ... , an}; a transition model, T(s, a, s'); a reward function, R(s); and a discount factor, ??. The transition matrix, T(s, a, s'), contains the probabilities of taking action a from state s then landing in s'. The discount factor, ??, controls the tradeoff between future rewards and immediate ones.
To make our MDP slightly more complicated, we extend the study and sleep process with one more state, s2 play games. Let's say we have two actions, a0 work and a1 slack. The 3 * 2 * 3 transition matrix T(s, a, s') is as follows:

This means, for example, that when taking the a1 slack action from state s0 study, there is a 60% chance that it will become s1 sleep (maybe getting tired ) and a 30% chance that it will become s2 play games (maybe wanting to relax ), and that there is a 10% chance of keeping on studying (maybe a true workaholic ). We define the reward function as [+1, 0, -1] for three states, to compensate for the hard work. Obviously, the optimal policy, in this case, is choosing a0 work for each step (keep on studying – no pain no gain, right?). Also, we choose 0.5 as the discount factor, to begin with. In the next section, we will compute the state-value function (also called the value function, just the value for short, or expected utility) under the optimal policy.
- 32位嵌入式系統與SoC設計導論
- Hands-On Graph Analytics with Neo4j
- 過程控制工程及仿真
- 完全掌握AutoCAD 2008中文版:機械篇
- 變頻器、軟啟動器及PLC實用技術260問
- Storm應用實踐:實時事務處理之策略
- Mastering ServiceNow Scripting
- Red Hat Linux 9實務自學手冊
- Salesforce Advanced Administrator Certification Guide
- Linux Shell編程從初學到精通
- 工業機器人操作
- Redash v5 Quick Start Guide
- 傳感技術基礎與技能實訓
- Spark Streaming實時流式大數據處理實戰
- 特征工程入門與實踐