- Hands-On Machine Learning with scikit:learn and Scientific Python Toolkits
- Tarek Amr
- 817字
- 2021-06-18 18:24:29
How do decision trees learn?
It's time to find out how decision trees actually learn in order to configure them. In the internal structure we just printed, the tree decided to use a petal width of 0.8 as its initial splitting decision. This was done because decision trees try to build the smallest possible tree using the following technique.
It went through all the features trying to find a feature (petal width, here) and a value within that feature (0.8, here) so that if we split all our training data into two parts (one for petal width ≤ 0.8, and one for petal width > 0.8), we get the purest split possible. In other words, it tries to find a condition where we can separate our classes as much as possible. Then, for each side, it iteratively tries to split the data further using the same technique.
Splitting criteria
If we onlyhad two classes, an ideal split would put members of one class on one side and members of the others on the other side. In our case, we succeeded in putting members of class 0 on one side and members of classes 1 and 2 on the other. Obviously, we are not always guaranteed to get such a pure split. As we can see in the other branches further down the tree, we always had a mix of samples from classes 1 and 2 on each side.
Having said that, we need a way to measure purity. We need a criterion based on if one split is purer than the other. There are two criteria that scikit-learn uses for classifiers' purity—gini and entropy—with the gini criterion as its default option. When it comes to decision tree regression, there are other criteria that we will come across later on.
Preventing overfitting
After the first split, the tree went on to try to separate between the remaining classes; the Versicolor and the Virginica irises. However, are we really sure that our training data is detailed enough to explain all the nuances that differentiate between the two classes? Isn't it possible that all those branches are driving the algorithm to learn things that happen to exist in the training data, but will not generalize well enough when faced with future data? Allowing a tree to grow so much results in what is called overfitting. The tree tries to perfectly fit the training data, forgetting that the data it may encounter in the future may be different. To prevent overfitting, the following settings may be used to limit the growth of a tree:
- max_depth:This is the maximum depth a tree can get to. A lower number means that the tree will stop branching earlier. Setting it to None means that the tree will continue to grow until all the leaves are pure or until all the leaves contain fewer than the min_samples_split samples.
- min_samples_split: The minimum number of samples needed in a level to allow further splitting there. A higher number means that the tree will stop branching earlier.
- min_samples_leaf:The minimum number of samples needed in a level to allow it to become a leaf node. A leaf node is a node where there are no further splits and where decisions are made. A higher number may have the effect of smoothing the model, especially in regression.
If max_depth is not set at training time to limit the tree's growth, then alternatively, you can prune the tree after it has been built. Curious readers can check the cost_complexity_pruning_path() method of the decision tree and find out how to use it to prune an already-grown tree.
Predictions
At the end of the training process, nodes that aren't split any further are called leaf nodes. Within a leaf node, we may have five samples—four of them from class 1, one from class 2, and none from class 0. Then, at prediction time, if a sample ends up in the same leaf node, we can easily decide that the new sample belongs to class 1 since this leaf node had a 4:1 ratio of its training samples from class 1 compared to the other two classes.
When we make predictions on the test set, we can evaluate the classifier's accuracy versus the actual labels we have in the test set. Nevertheless, the manner in which we split our data may affect the reliability of the scores we get. In the next section, we will see how to get more reliable scores.
- Learn Type:Driven Development
- .NET 4.0面向?qū)ο缶幊搪劊夯A(chǔ)篇
- Android 7編程入門經(jīng)典:使用Android Studio 2(第4版)
- Visual C
- Java 11 Cookbook
- VMware虛擬化技術(shù)
- Unity UI Cookbook
- 用案例學(xué)Java Web整合開發(fā)
- 區(qū)塊鏈技術(shù)進階與實戰(zhàn)(第2版)
- Orleans:構(gòu)建高性能分布式Actor服務(wù)
- 深度實踐KVM:核心技術(shù)、管理運維、性能優(yōu)化與項目實施
- 精通Spring:Java Web開發(fā)與Spring Boot高級功能
- 深入理解Java虛擬機:JVM高級特性與最佳實踐
- Qt 5.12實戰(zhàn)
- 程序員的英語