- Hands-On Machine Learning with Microsoft Excel 2019
- Julio Cesar Rodriguez Martino
- 466字
- 2021-06-24 15:11:03
Comparing the entropy differences (information gain)
To know which variable to choose for the first split, we calculate the information gain G when going from the original data to the corresponding subset as the difference between the entropy values:

Here, S(f1) is the entropy of the target variable and S(f1,f2) is the entropy of each feature with respect to the target variable. The entropy values were calculated in the previous subsections, so we use them here:
- If we choose Outlook as the first variable to split the tree, the information gain is as follows:
G(Train outside,Outlook) = S(Train outside) - S(Train outside,Outlook)
= 0.94-0.693=0.247
- If we choose Temperature, the information gain is as follows:
G(Train outside,Temperature) = S(Train outside) - S(Train outside,Temperature)
= 0.94-0.911=0.029
- If we choose Humidity, the information gain is as follows:
G(Train outside,Humidity) = S(Train outside) - S(Train outside,Humidity)
= 0.94-0.788=0.152
- Finally, choosing Windy gives the following information gain:
G(Train outside,Windy) = S(Train outside) - S(Train outside,Windy)
= 0.94-0.892=0.048
The variable to choose for the first splitting of the tree is the one showing the largest information gain, that is, Outlook. If we do this, we will notice that one of the resulting subsets after the splitting has zero entropy, so we don't need to split it further.
To continue building the tree following a similar procedure, the steps to take are as follows:
- Calculate S(Sunny), S(Sunny,Temperature), S(Sunny,Humidity), and S(Sunny,Windy).
- Calculate G(Sunny,Temperature), G(Sunny,Humidity), and G(Sunny,Windy).
- The larger value will tell us what feature to use to split Sunny.
- Calculate other gains, using S(Rainy), S(Rainy,Temperature), S(Rainy,Humidity), and S(Rainy,Windy).
- The larger value will tell us what feature to use to split Rainy.
- Continue iterating until there are no features left to use.
As we will see later in this book, trees are never built by hand. It is important to understand how they work and which calculations are involved. Using Excel, it is easy to follow the full process and each step. Following the same principle, we will work through an unsupervised learning example in the next section.
- 公有云容器化指南:騰訊云TKE實戰與應用
- MySQL高可用解決方案:從主從復制到InnoDB Cluster架構
- 企業數字化創新引擎:企業級PaaS平臺HZERO
- 信息系統與數據科學
- InfluxDB原理與實戰
- 數據庫應用基礎教程(Visual FoxPro 9.0)
- 大數據Hadoop 3.X分布式處理實戰
- Microsoft Power BI數據可視化與數據分析
- Google Cloud Platform for Developers
- 深入理解InfluxDB:時序數據庫詳解與實踐
- 菜鳥學SPSS數據分析
- 利用Python進行數據分析(原書第2版)
- Deep Learning with R for Beginners
- 企業級大數據項目實戰:用戶搜索行為分析系統從0到1
- Unity Game Development Blueprints