官术网_书友最值得收藏!

Combinatorial entropy

Information gain criterion is based on the Shannon entropy notion. The Shannon entropy is a very important topic in the information theory, physics, and other domains. Mathematically, it is expressed as:

Where i is a state of a system, N is a total number of possible states, and pi is a probability of the system being in the state i. Entropy describes the amount of uncertainty in the system. The more order you have in the system, the less entropy there is.

For the visual introduction to the information theory, check Visual Information Theory by Christopher Olah at: http://colah.github.io/posts/2015-09-Visual-Information/.

If you want to learn more about entropy, check the nice, interactive blog Entropy Explained, With Sheep by Aatish Bhatia at: https://aatishb.com/entropy/.

Let's show a simple example of how entropy can be useful for decision tree construction. For this, we'll simplify a task of alien creature classification, assuming that we can measure only one feature: body length. We have 10 individuals (= platyhog and = rabbosaurus) with the following body lengths:

If we take one random individual from the group, it can be a platyhog with the probability of 0.6, or a rabbosaurus with the probability of 0.4. We have two states in this system for two outcomes. Let's calculate the entropy of it:

So, the amount of uncertainty in this dataset is 0.97. Is it a lot, or a little? We don't have anything yet to compare it with, so let's divide the set at the middle (> 5 meters), and calculate the entropy for both subsets:

 
 

H and H are now less than the original H. This demonstrates how you can reduce the entropy by splitting the dataset in the right place. This idea lies in the fundamentals of decision tree learning algorithms.

We can calculate how effectively we reduced the entropy by splitting the set using the Information Gain (IG) criterion:

Information Gain = Entropy(parent) - Weighted Sum of Entropy (Children), or:

q is a number of groups after splitting, Ni is a count of elements in the i-th group, N—is the total count of elements before split. In our example, q = 2, N = 10, and N= N= 5:

This means that asking the question Is the body length greater than 5? gives us an information gain of 0.61. Is it a lot, or a little? Let's compare it to the information loss of the split around length > 7:

Apparently, the choice of the middle point was lucky, because all other splits don't look promising. But you are free to check them if you want.

There is no sense to split the left part further, but we can continue splitting the right subset until entropy of each of its children will not be equal to zero (see Figure 2.9).

So, this is our decision tree, and a recursive algorithm for its building. But now comes an interesting question: how to know which split yields the maximal information gain? The simplest way is a greedy search: just check all possible variants.

Information gain is only one of the heuristics, there are more of them; for instance, in our scikit-learn decision tree learner, we used Gini impurity as a heuristic. According to the Michigan State University (http://www.cse.msu.edu/~cse802/DecisionTrees.pdf):

"Gini impurity is the expected error rate at node N if the category label is selected randomly from the class distribution present at N."

Check the documentation on the criterion property of DecisionTreeClassifier for more information about different heuristics available for tree learning in scikit-learn. In practice, Gini works very similarly to the information gain. A historical fact to dilute the theoretical exposition: Corrado Gini was an Italian statistician and the author of The Scientific Basis of Fascism (1927):

Figure 2.9: Building a decision tree. H stands for entropy in each group. Picture by Mykola Sosnovshchenko.
主站蜘蛛池模板: 安图县| 锡林浩特市| 宜黄县| 兴义市| 泰安市| 广安市| 宽甸| 罗田县| 巨鹿县| 明溪县| 广元市| 枝江市| 清水县| 准格尔旗| 印江| 龙口市| 石首市| 永新县| 南涧| 宝兴县| 方正县| 乌鲁木齐市| 杂多县| 繁昌县| 印江| 桐乡市| 临漳县| 剑川县| 英德市| 西青区| 楚雄市| 湟源县| 石首市| 南江县| 车致| 杭锦后旗| 浦北县| 江安县| 武川县| 旬邑县| 宜春市|