官术网_书友最值得收藏!

Classification trees

Classification trees operate under the same principle as regression trees, except that the splits aren't determined by the RSS but by an error rate. The error rate used isn't what you would expect where the calculation is simply the misclassified observations divided by the total observations. As it turns out, when it comes to tree-splitting, a misclassification rate, by itself, may lead to a situation where you can gain information with a further split but not improve the misclassification rate. Let's look at an example.

Suppose we have a node, let's call it N0, where you have seven observations labeled No and three observations labeled Yes. We can say that the misclassified rate is 30%. With this in mind, let's calculate a common alternative error measure called the Gini index. The formula for a single node Gini index is as follows:

Then, for N0, the Gini is 1 - (.7)2 - (.3)2, which is equal to 0.42, versus the misclassification rate of 30%.

Taking this example further, we'll now create node N1 with three observations from Class 1 and none from Class 2, along with N2, which has four observations from Class 1 and three from Class 2. Now, the overall misclassification rate for this branch of the tree is still 30%, but look at how the overall Gini index has improved:

  • Gini(N1) = 1 - (3/3)2 - (0/3)2 = 0
  • Gini(N2) = 1 - (4/7)2 - (3/7)2 = 0.49
  • New Gini index = (proportion of N1 x Gini(N1)) + (proportion of N2 x Gini(N2)), which is equal to (0.3 x 0) + (0.7 x 0.49) or 0.343

By doing a split on a surrogate error rate, we actually improved our model impurity, reducing it from 0.42 to 0.343, whereas the misclassification rate didn't change. This is the methodology that's used by the rpart() package, which we'll be using in this chapter.

主站蜘蛛池模板: 古丈县| 壶关县| 合江县| 盱眙县| 安陆市| 山阳县| 清流县| 阿城市| 广南县| 辉县市| 将乐县| 昭苏县| 新巴尔虎右旗| 思南县| 中牟县| 永吉县| 淳化县| 县级市| 崇礼县| 辰溪县| 芦山县| 东莞市| 尼勒克县| 德兴市| 平南县| 贵州省| 军事| 常州市| 孟州市| 南漳县| 神农架林区| 读书| 清徐县| 孝义市| 青龙| 镇宁| 双峰县| 马鞍山市| 枞阳县| 分宜县| 临湘市|