官术网_书友最值得收藏!

Solving MDPs with RL

RL algorithms are designed to solve exactly the type of optimization problem an MDP frames; that is, to find an optimal decision-making policy to maximize the rewards offered by making decisions within this environment.

The rewards offered for taking each action are shown in the preceding MDP diagram as yellow arrows. When we take action a0 and end up in state S0, we get a reward of +5; and when we take action a1 and end up in state S0, we get a reward of -1.

The Taxi-v2 environment has 500 states, as we'll see shortly, so it is not practical to represent them all in a diagram such as the previous one. Instead, we will be enumerating them in our Q-table in the next section. We'll use a state vector to represent the variables in each state that we'll be keeping track of.

In general, we can keep track of any variables in a Q-learning problem that we think are relevant for our model, and incorporate them into the state vector. The state vector can be treated as a set of state variables, and it can also be treated as a set of linear numbered states, as long as the individual information about each state is not lost, no matter how the vector information is stored:

The preceding diagram models how agents and environments act with each other in a general way; an agent is in a state, takes an action on its environment, and then receives a reward and moves to a new state. In control process terms, the environment acts on the agent through state and reward information, and the agent acts on the state through actions. 

In the case of Taxi-v2 and other OpenAI Gym environments that we'll be using, the state space is predetermined for us, so we do not have to decide what state variables to keep track of or how to enumerate our states. In an environment that we have designed on our own, we will have to choose how to model these attributes ourselves as efficiently as possible. 

We will also see that in the problems we are working with, we don't need knowledge of any previous states to determine what actions to take in our current state. Every state can be represented by a state variable, and every action is available in an action space that the agent can choose and act on with only the knowledge of the current state. 

主站蜘蛛池模板: 隆子县| 漳平市| 塘沽区| 剑河县| 英吉沙县| 宜黄县| 新昌县| 金寨县| 石狮市| 扬州市| 竹山县| 江西省| 库伦旗| 鹿邑县| 肇源县| 昌邑市| 辽宁省| 河间市| 泽普县| 厦门市| 疏勒县| 锡林郭勒盟| 建宁县| 临泽县| 芦山县| 舒城县| 恭城| 江津市| 余姚市| 贵港市| 沙田区| 襄樊市| 阿瓦提县| 枣阳市| 巴塘县| 开化县| 渭源县| 凌海市| 鲁山县| 酉阳| 荃湾区|