官术网_书友最值得收藏!

Classification trees

Classification trees operate under the same principle as regression trees, except that the splits aren't determined by the RSS but by an error rate. The error rate used isn't what you would expect where the calculation is simply the misclassified observations divided by the total observations. As it turns out, when it comes to tree-splitting, a misclassification rate, by itself, may lead to a situation where you can gain information with a further split but not improve the misclassification rate. Let's look at an example.

Suppose we have a node, let's call it N0, where you have seven observations labeled No and three observations labeled Yes. We can say that the misclassified rate is 30%. With this in mind, let's calculate a common alternative error measure called the Gini index. The formula for a single node Gini index is as follows:

Then, for N0, the Gini is 1 - (.7)2 - (.3)2, which is equal to 0.42, versus the misclassification rate of 30%.

Taking this example further, we'll now create node N1 with three observations from Class 1 and none from Class 2, along with N2, which has four observations from Class 1 and three from Class 2. Now, the overall misclassification rate for this branch of the tree is still 30%, but look at how the overall Gini index has improved:

  • Gini(N1) = 1 - (3/3)2 - (0/3)2 = 0
  • Gini(N2) = 1 - (4/7)2 - (3/7)2 = 0.49
  • New Gini index = (proportion of N1 x Gini(N1)) + (proportion of N2 x Gini(N2)), which is equal to (0.3 x 0) + (0.7 x 0.49) or 0.343

By doing a split on a surrogate error rate, we actually improved our model impurity, reducing it from 0.42 to 0.343, whereas the misclassification rate didn't change. This is the methodology that's used by the rpart() package, which we'll be using in this chapter.

主站蜘蛛池模板: 姜堰市| 镶黄旗| 隆子县| 平果县| 新巴尔虎右旗| 杭州市| 济源市| 济阳县| 海盐县| 海门市| 灵璧县| 南昌市| 泗水县| 济宁市| 邵东县| 桐城市| 东乡县| 龙井市| 体育| 沅江市| 呼和浩特市| 香港 | 平南县| 依安县| 景宁| 湘阴县| 苏尼特右旗| 保德县| 徐闻县| 运城市| 木里| 色达县| 江油市| 大英县| 金坛市| 汝阳县| 锦州市| 珲春市| 乐陵市| 会同县| 西平县|