- Deep Reinforcement Learning Hands-On
- Maxim Lapan
- 323字
- 2021-06-25 20:46:56
Theoretical background of the cross-entropy method
This section is optional and included for readers who are interested in why the method works. If you wish, you can refer to the original paper on cross-entropy, which will be given at the end of the section.
The basis of the cross-entropy method lies in the importance sampling theorem, which states this:

In our RL case, H(x) is a reward value obtained by some policy x and p(x) is a distribution of all possible policies. We don't want to maximize our reward by searching all possible policies, instead we want to find a way to approximate p(x)H(x) by q(x), iteratively minimizing the distance between them. The distance between two probability distributions is calculated by Kullback-Leibler (KL) pergence which is as follows:

The first term in KL is called entropy and doesn't depend on that, so could be omitted during the minimization. The second term is called cross-entropy and is a very common optimization objective in DL.
Combining both formulas, we can get an iterative algorithm, which starts with and on every step improves. This is an approximation of p(x)H(x) with an update:

This is a generic cross-entropy method, which can be significantly simplified in our RL case. Firstly, we replace our H(x) with an indicator function, which is 1 when the reward for the episode is above the threshold and 0 if the reward is below. Our policy update will look like this:

Strictly speaking, the preceding formula misses the normalization term, but it still works in practice without it. So, the method is quite clear: we sample episodes using our current policy (starting with some random initial policy) and minimize the negative log likelihood of the most successful samples and our policy.
There is a whole book dedicated to this method, written by Dirk P. Kroese. A shorter description can be found in the Cross-Entropy Method paper by Dirk P.Kroese (https://people.smp.uq.edu.au/DirkKroese/ps/eormsCE.pdf).
- Oracle SOA Governance 11g Implementation
- Excel 2007函數與公式自學寶典
- Spark編程基礎(Scala版)
- 機器人創新實訓教程
- Windows內核原理與實現
- Blender Compositing and Post Processing
- RPA(機器人流程自動化)快速入門:基于Blue Prism
- 3D Printing for Architects with MakerBot
- Ceph:Designing and Implementing Scalable Storage Systems
- 大數據驅動的設備健康預測及維護決策優化
- 愛犯錯的智能體
- 運動控制系統應用與實踐
- Python:Data Analytics and Visualization
- 空間站多臂機器人運動控制研究
- 分析力!專業Excel的制作與分析實用法則