- Hands-On Machine Learning with scikit:learn and Scientific Python Toolkits
- Tarek Amr
- 817字
- 2021-06-18 18:24:29
How do decision trees learn?
It's time to find out how decision trees actually learn in order to configure them. In the internal structure we just printed, the tree decided to use a petal width of 0.8 as its initial splitting decision. This was done because decision trees try to build the smallest possible tree using the following technique.
It went through all the features trying to find a feature (petal width, here) and a value within that feature (0.8, here) so that if we split all our training data into two parts (one for petal width ≤ 0.8, and one for petal width > 0.8), we get the purest split possible. In other words, it tries to find a condition where we can separate our classes as much as possible. Then, for each side, it iteratively tries to split the data further using the same technique.
Splitting criteria
If we onlyhad two classes, an ideal split would put members of one class on one side and members of the others on the other side. In our case, we succeeded in putting members of class 0 on one side and members of classes 1 and 2 on the other. Obviously, we are not always guaranteed to get such a pure split. As we can see in the other branches further down the tree, we always had a mix of samples from classes 1 and 2 on each side.
Having said that, we need a way to measure purity. We need a criterion based on if one split is purer than the other. There are two criteria that scikit-learn uses for classifiers' purity—gini and entropy—with the gini criterion as its default option. When it comes to decision tree regression, there are other criteria that we will come across later on.
Preventing overfitting
After the first split, the tree went on to try to separate between the remaining classes; the Versicolor and the Virginica irises. However, are we really sure that our training data is detailed enough to explain all the nuances that differentiate between the two classes? Isn't it possible that all those branches are driving the algorithm to learn things that happen to exist in the training data, but will not generalize well enough when faced with future data? Allowing a tree to grow so much results in what is called overfitting. The tree tries to perfectly fit the training data, forgetting that the data it may encounter in the future may be different. To prevent overfitting, the following settings may be used to limit the growth of a tree:
- max_depth:This is the maximum depth a tree can get to. A lower number means that the tree will stop branching earlier. Setting it to None means that the tree will continue to grow until all the leaves are pure or until all the leaves contain fewer than the min_samples_split samples.
- min_samples_split: The minimum number of samples needed in a level to allow further splitting there. A higher number means that the tree will stop branching earlier.
- min_samples_leaf:The minimum number of samples needed in a level to allow it to become a leaf node. A leaf node is a node where there are no further splits and where decisions are made. A higher number may have the effect of smoothing the model, especially in regression.
If max_depth is not set at training time to limit the tree's growth, then alternatively, you can prune the tree after it has been built. Curious readers can check the cost_complexity_pruning_path() method of the decision tree and find out how to use it to prune an already-grown tree.
Predictions
At the end of the training process, nodes that aren't split any further are called leaf nodes. Within a leaf node, we may have five samples—four of them from class 1, one from class 2, and none from class 0. Then, at prediction time, if a sample ends up in the same leaf node, we can easily decide that the new sample belongs to class 1 since this leaf node had a 4:1 ratio of its training samples from class 1 compared to the other two classes.
When we make predictions on the test set, we can evaluate the classifier's accuracy versus the actual labels we have in the test set. Nevertheless, the manner in which we split our data may affect the reliability of the scores we get. In the next section, we will see how to get more reliable scores.
- Android應(yīng)用程序開發(fā)與典型案例
- Python爬蟲開發(fā):從入門到實(shí)戰(zhàn)(微課版)
- Magento 2 Development Cookbook
- 秒懂設(shè)計(jì)模式
- Python數(shù)據(jù)挖掘與機(jī)器學(xué)習(xí)實(shí)戰(zhàn)
- Mastering Drupal 8 Views
- Raspberry Pi Home Automation with Arduino(Second Edition)
- Haskell Data Analysis Cookbook
- Mastering Data Mining with Python:Find patterns hidden in your data
- 編程菜鳥學(xué)Python數(shù)據(jù)分析
- Domain-Driven Design in PHP
- Python函數(shù)式編程(第2版)
- Unity 2017 Game AI Programming(Third Edition)
- Functional Python Programming
- Microsoft Windows Identity Foundation Cookbook