官术网_书友最值得收藏!

Theoretical background of the cross-entropy method

This section is optional and included for readers who are interested in why the method works. If you wish, you can refer to the original paper on cross-entropy, which will be given at the end of the section.

The basis of the cross-entropy method lies in the importance sampling theorem, which states this:

In our RL case, H(x) is a reward value obtained by some policy x and p(x) is a distribution of all possible policies. We don't want to maximize our reward by searching all possible policies, instead we want to find a way to approximate p(x)H(x) by q(x), iteratively minimizing the distance between them. The distance between two probability distributions is calculated by Kullback-Leibler (KL) pergence which is as follows:

The first term in KL is called entropy and doesn't depend on that, so could be omitted during the minimization. The second term is called cross-entropy and is a very common optimization objective in DL.

Combining both formulas, we can get an iterative algorithm, which starts with Theoretical background of the cross-entropy method and on every step improves. This is an approximation of p(x)H(x) with an update:

This is a generic cross-entropy method, which can be significantly simplified in our RL case. Firstly, we replace our H(x) with an indicator function, which is 1 when the reward for the episode is above the threshold and 0 if the reward is below. Our policy update will look like this:

Strictly speaking, the preceding formula misses the normalization term, but it still works in practice without it. So, the method is quite clear: we sample episodes using our current policy (starting with some random initial policy) and minimize the negative log likelihood of the most successful samples and our policy.

There is a whole book dedicated to this method, written by Dirk P. Kroese. A shorter description can be found in the Cross-Entropy Method paper by Dirk P.Kroese (https://people.smp.uq.edu.au/DirkKroese/ps/eormsCE.pdf).

主站蜘蛛池模板: 禹城市| 松阳县| 喀什市| 九台市| 图们市| 漠河县| 西畴县| 绥阳县| 孝昌县| 禹州市| 南召县| 营口市| 东乡县| 集安市| 佛山市| 张北县| 休宁县| 古田县| 松桃| 永福县| 沭阳县| 安图县| 通州区| 西昌市| 兰州市| 嘉鱼县| 南木林县| 甘孜| 钦州市| 临夏市| 临颍县| 云龙县| 安岳县| 许昌县| 武穴市| 玉环县| 行唐县| 通河县| 大埔区| 吴旗县| 宁安市|