官术网_书友最值得收藏!

Monte Carlo methods

The Monte Carlo (MC) methods for estimating the value function and discovering excellent policies do not require the presence of a model of the environment. They are able to learn through the use of the agent's experience alone or from samples of state sequences, actions, and rewards obtained from the interactions between agent and environment. The experience can be acquired by the agent in line with the learning process or emulated by a previously populated dataset. The possibility of gaining experience during learning (online learning) is interesting because it allows obtaining excellent behavior even in the absence of a priori knowledge of the dynamics of the environment. Even learning through an already populated experience dataset can be interesting, because, if combined with online learning, it makes automatic policy improvement induced by others' experiences possible.

In general, MC methods rely on repeated random sampling to obtain numerical results. To do this, they use randomness to solve deterministic problems. In our case, we will use random sampling of states and action-state pairs and we will look at the rewards and then we will review the policy in an iterative way. The iteration of the process will converge on optimal policy as we explore every possible action-state pair.

For example, we could take the following procedure:

  • We will assign a reward of +1 to a right action, -1 to a wrong action, and 0 to a draw.
  • We will establish a table in which each key corresponds to a particular state-action pair and each value is the value of that pair. This represents the average reward received for that action in that state.

To solve the reinforcement learning problems, MC methods estimate the value function on the basis of the total sum of rewards, obtained on average in the past episodes. This assumes that the experience is divided into episodes, and that all episodes are composed of a finite number of transitions. This is because, in MC methods, the estimation of new values, and the modification of policies, takes place at the end of each episode. MC methods iteratively estimate policy and value function. In this case, however, each iteration cycle is equivalent to completing an episode—the new estimates of policy and value function occur episode by episode.

The following is a pseudocode for MC policy evaluation:

Initialize
arbitrary policy π
arbitrary state-value function
Repeat
generate episode using π
for each state s in episode
the received reinforcement R is added to the set of rewards obtained so far
estimate the value function on the basis on the average of the total sum of rewards obtained

Usually, the term MC is used for estimation methods, the operations of which involve random components. In this case, the term MC refers to reinforcement learning methods based on total reward averages. Unlike the DP methods that calculate the values ??for each state, the MC methods calculate the values ??for each state-action pair, because, in the absence of a model, the only state values ??are not sufficient to decide which action is best performed in a certain state.

主站蜘蛛池模板: 大埔区| 金平| 宁化县| 获嘉县| 鸡泽县| 余江县| 革吉县| 长宁区| 惠水县| 天津市| 老河口市| 敦煌市| 寿宁县| 平顶山市| 瑞安市| 岐山县| 永寿县| 大关县| 大港区| 乐至县| 自治县| 民丰县| 佛教| 安泽县| 隆化县| 莎车县| 巴青县| 岑溪市| 马山县| 渝中区| 桐城市| 会昌县| 安新县| 白水县| 镇宁| 石泉县| 梅州市| 涞源县| 佛山市| 长春市| 富民县|