官术网_书友最值得收藏!

Theoretical background of the cross-entropy method

This section is optional and included for readers who are interested in why the method works. If you wish, you can refer to the original paper on cross-entropy, which will be given at the end of the section.

The basis of the cross-entropy method lies in the importance sampling theorem, which states this:

In our RL case, H(x) is a reward value obtained by some policy x and p(x) is a distribution of all possible policies. We don't want to maximize our reward by searching all possible policies, instead we want to find a way to approximate p(x)H(x) by q(x), iteratively minimizing the distance between them. The distance between two probability distributions is calculated by Kullback-Leibler (KL) pergence which is as follows:

The first term in KL is called entropy and doesn't depend on that, so could be omitted during the minimization. The second term is called cross-entropy and is a very common optimization objective in DL.

Combining both formulas, we can get an iterative algorithm, which starts with Theoretical background of the cross-entropy method and on every step improves. This is an approximation of p(x)H(x) with an update:

This is a generic cross-entropy method, which can be significantly simplified in our RL case. Firstly, we replace our H(x) with an indicator function, which is 1 when the reward for the episode is above the threshold and 0 if the reward is below. Our policy update will look like this:

Strictly speaking, the preceding formula misses the normalization term, but it still works in practice without it. So, the method is quite clear: we sample episodes using our current policy (starting with some random initial policy) and minimize the negative log likelihood of the most successful samples and our policy.

There is a whole book dedicated to this method, written by Dirk P. Kroese. A shorter description can be found in the Cross-Entropy Method paper by Dirk P.Kroese (https://people.smp.uq.edu.au/DirkKroese/ps/eormsCE.pdf).

主站蜘蛛池模板: 鸡泽县| 海阳市| 广饶县| 门源| 苍南县| 黄梅县| 贵港市| 印江| 策勒县| 绥芬河市| 瑞安市| 三明市| 安图县| 连州市| 庐江县| 左云县| 抚顺市| 瓦房店市| 冀州市| 永平县| 临沭县| 奈曼旗| 富顺县| 进贤县| 临夏县| 南昌市| 普安县| 玉屏| 遂昌县| 赫章县| 白沙| 肇州县| 凌源市| 明光市| 鹤峰县| 兴业县| 天全县| 兴安县| 铜川市| 高陵县| 梓潼县|