- Deep Reinforcement Learning Hands-On
- Maxim Lapan
- 323字
- 2021-06-25 20:46:56
Theoretical background of the cross-entropy method
This section is optional and included for readers who are interested in why the method works. If you wish, you can refer to the original paper on cross-entropy, which will be given at the end of the section.
The basis of the cross-entropy method lies in the importance sampling theorem, which states this:

In our RL case, H(x) is a reward value obtained by some policy x and p(x) is a distribution of all possible policies. We don't want to maximize our reward by searching all possible policies, instead we want to find a way to approximate p(x)H(x) by q(x), iteratively minimizing the distance between them. The distance between two probability distributions is calculated by Kullback-Leibler (KL) pergence which is as follows:

The first term in KL is called entropy and doesn't depend on that, so could be omitted during the minimization. The second term is called cross-entropy and is a very common optimization objective in DL.
Combining both formulas, we can get an iterative algorithm, which starts with and on every step improves. This is an approximation of p(x)H(x) with an update:

This is a generic cross-entropy method, which can be significantly simplified in our RL case. Firstly, we replace our H(x) with an indicator function, which is 1 when the reward for the episode is above the threshold and 0 if the reward is below. Our policy update will look like this:

Strictly speaking, the preceding formula misses the normalization term, but it still works in practice without it. So, the method is quite clear: we sample episodes using our current policy (starting with some random initial policy) and minimize the negative log likelihood of the most successful samples and our policy.
There is a whole book dedicated to this method, written by Dirk P. Kroese. A shorter description can be found in the Cross-Entropy Method paper by Dirk P.Kroese (https://people.smp.uq.edu.au/DirkKroese/ps/eormsCE.pdf).
- Big Data Analytics with Hadoop 3
- Natural Language Processing Fundamentals
- Apache Hive Essentials
- Mastering D3.js
- 模型制作
- 21天學通Visual Basic
- Implementing AWS:Design,Build,and Manage your Infrastructure
- 水下無線傳感器網絡的通信與決策技術
- 面向對象程序設計綜合實踐
- Hands-On Reactive Programming with Reactor
- 筆記本電腦維修90個精選實例
- Linux內核精析
- 單片機原理實用教程
- 在實戰中成長:C++開發之路
- 重估:人工智能與賦能社會