官术网_书友最值得收藏!

Classification trees

Classification trees operate under the same principle as regression trees, except that the splits aren't determined by the RSS but by an error rate. The error rate used isn't what you would expect where the calculation is simply the misclassified observations divided by the total observations. As it turns out, when it comes to tree-splitting, a misclassification rate, by itself, may lead to a situation where you can gain information with a further split but not improve the misclassification rate. Let's look at an example.

Suppose we have a node, let's call it N0, where you have seven observations labeled No and three observations labeled Yes. We can say that the misclassified rate is 30%. With this in mind, let's calculate a common alternative error measure called the Gini index. The formula for a single node Gini index is as follows:

Then, for N0, the Gini is 1 - (.7)2 - (.3)2, which is equal to 0.42, versus the misclassification rate of 30%.

Taking this example further, we'll now create node N1 with three observations from Class 1 and none from Class 2, along with N2, which has four observations from Class 1 and three from Class 2. Now, the overall misclassification rate for this branch of the tree is still 30%, but look at how the overall Gini index has improved:

  • Gini(N1) = 1 - (3/3)2 - (0/3)2 = 0
  • Gini(N2) = 1 - (4/7)2 - (3/7)2 = 0.49
  • New Gini index = (proportion of N1 x Gini(N1)) + (proportion of N2 x Gini(N2)), which is equal to (0.3 x 0) + (0.7 x 0.49) or 0.343

By doing a split on a surrogate error rate, we actually improved our model impurity, reducing it from 0.42 to 0.343, whereas the misclassification rate didn't change. This is the methodology that's used by the rpart() package, which we'll be using in this chapter.

主站蜘蛛池模板: 钟祥市| 孟州市| 岗巴县| 娄烦县| 清水河县| 怀宁县| 青冈县| 崇文区| 延安市| 盐亭县| 诸暨市| 延庆县| 曲阳县| 平罗县| 南岸区| 阿拉善左旗| 乐清市| 综艺| 汝南县| 喀喇| 霍林郭勒市| 遂宁市| 山阳县| 马边| 安义县| 巢湖市| 鞍山市| 蓬安县| 井陉县| 缙云县| 吉林省| 四子王旗| 赣州市| 荣昌县| 蕲春县| 岳阳市| 禄劝| 林周县| 重庆市| 南靖县| 鸡东县|