官术网_书友最值得收藏!

Theoretical background of the cross-entropy method

This section is optional and included for readers who are interested in why the method works. If you wish, you can refer to the original paper on cross-entropy, which will be given at the end of the section.

The basis of the cross-entropy method lies in the importance sampling theorem, which states this:

In our RL case, H(x) is a reward value obtained by some policy x and p(x) is a distribution of all possible policies. We don't want to maximize our reward by searching all possible policies, instead we want to find a way to approximate p(x)H(x) by q(x), iteratively minimizing the distance between them. The distance between two probability distributions is calculated by Kullback-Leibler (KL) pergence which is as follows:

The first term in KL is called entropy and doesn't depend on that, so could be omitted during the minimization. The second term is called cross-entropy and is a very common optimization objective in DL.

Combining both formulas, we can get an iterative algorithm, which starts with Theoretical background of the cross-entropy method and on every step improves. This is an approximation of p(x)H(x) with an update:

This is a generic cross-entropy method, which can be significantly simplified in our RL case. Firstly, we replace our H(x) with an indicator function, which is 1 when the reward for the episode is above the threshold and 0 if the reward is below. Our policy update will look like this:

Strictly speaking, the preceding formula misses the normalization term, but it still works in practice without it. So, the method is quite clear: we sample episodes using our current policy (starting with some random initial policy) and minimize the negative log likelihood of the most successful samples and our policy.

There is a whole book dedicated to this method, written by Dirk P. Kroese. A shorter description can be found in the Cross-Entropy Method paper by Dirk P.Kroese (https://people.smp.uq.edu.au/DirkKroese/ps/eormsCE.pdf).

主站蜘蛛池模板: 财经| 肥乡县| 胶南市| 嘉禾县| 四川省| 南江县| 扬中市| 陇西县| 保靖县| 河源市| 昌江| 喀喇沁旗| 长岛县| 玉溪市| 西安市| 桃园市| 湘西| 邵东县| 民和| 呈贡县| 安平县| 澎湖县| 蒲江县| 临桂县| 灵川县| 乾安县| 贵定县| 上蔡县| 赤城县| 平江县| 凭祥市| 金昌市| 抚松县| 陆良县| 布尔津县| 新邵县| 乐陵市| 施秉县| 溆浦县| 丰县| 铁岭市|