官术网_书友最值得收藏!

Monte Carlo methods

The Monte Carlo (MC) methods for estimating the value function and discovering excellent policies do not require the presence of a model of the environment. They are able to learn through the use of the agent's experience alone or from samples of state sequences, actions, and rewards obtained from the interactions between agent and environment. The experience can be acquired by the agent in line with the learning process or emulated by a previously populated dataset. The possibility of gaining experience during learning (online learning) is interesting because it allows obtaining excellent behavior even in the absence of a priori knowledge of the dynamics of the environment. Even learning through an already populated experience dataset can be interesting, because, if combined with online learning, it makes automatic policy improvement induced by others' experiences possible.

In general, MC methods rely on repeated random sampling to obtain numerical results. To do this, they use randomness to solve deterministic problems. In our case, we will use random sampling of states and action-state pairs and we will look at the rewards and then we will review the policy in an iterative way. The iteration of the process will converge on optimal policy as we explore every possible action-state pair.

For example, we could take the following procedure:

  • We will assign a reward of +1 to a right action, -1 to a wrong action, and 0 to a draw.
  • We will establish a table in which each key corresponds to a particular state-action pair and each value is the value of that pair. This represents the average reward received for that action in that state.

To solve the reinforcement learning problems, MC methods estimate the value function on the basis of the total sum of rewards, obtained on average in the past episodes. This assumes that the experience is divided into episodes, and that all episodes are composed of a finite number of transitions. This is because, in MC methods, the estimation of new values, and the modification of policies, takes place at the end of each episode. MC methods iteratively estimate policy and value function. In this case, however, each iteration cycle is equivalent to completing an episode—the new estimates of policy and value function occur episode by episode.

The following is a pseudocode for MC policy evaluation:

Initialize
arbitrary policy π
arbitrary state-value function
Repeat
generate episode using π
for each state s in episode
the received reinforcement R is added to the set of rewards obtained so far
estimate the value function on the basis on the average of the total sum of rewards obtained

Usually, the term MC is used for estimation methods, the operations of which involve random components. In this case, the term MC refers to reinforcement learning methods based on total reward averages. Unlike the DP methods that calculate the values ??for each state, the MC methods calculate the values ??for each state-action pair, because, in the absence of a model, the only state values ??are not sufficient to decide which action is best performed in a certain state.

主站蜘蛛池模板: 汉川市| 合山市| 叶城县| 黄浦区| 肇州县| 绥化市| 东乌珠穆沁旗| 株洲市| 绥滨县| 民和| 光山县| 葫芦岛市| 临潭县| 唐河县| 娄烦县| 衡山县| 屏山县| 怀宁县| 上犹县| 霸州市| 榆中县| 兴化市| 日照市| 宁远县| 古浪县| 阳原县| 讷河市| 高密市| 贵溪市| 宜昌市| 汤阴县| 文登市| 屏南县| 阿拉尔市| 松阳县| 溆浦县| 西平县| 永康市| 营口市| 博野县| 阳原县|