- The Data Science Workshop
- Anthony So Thomas V. Joseph Robert Thas John Andrew Worsley Dr. Samuel Asare
- 1771字
- 2021-06-11 18:27:26
Minimum Sample in Leaf
Previously, we learned how to reduce or increase the depth of trees in Random Forest and saw how it can affect its performance and tendency to overfit or not. Now we will go through another important hyperparameter: min_samples_leaf.
This hyperparameter, as its name implies, is related to the leaf nodes of the trees. We saw earlier that the RandomForest algorithm builds nodes that will clearly separate observations into two different groups. If we look at the tree example in Figure 4.15, the top node is splitting data into two groups: the left-hand group contains mainly observations for the bending_1 class and the right-hand group can be from any class. This sounds like a reasonable split but are we sure it is not increasing the risk of overfitting? For instance, what if this split leads to only one observation falling on the left-hand side? This rule would be very specific (applying to only one single case) and we can't say it is generic enough for unseen data. It may be an edge case in the training set that will never happen again.
It would be great if we could let the model know to not create such specific rules that happen quite infrequently. Luckily, RandomForest has such a hyperparameter and, you guessed it, it is min_samples_leaf. This hyperparameter will specify the minimum number of observations (or samples) that will have to fall under a leaf node to be considered in the tree. For instance, if we set min_samples_leaf to 3, then RandomForest will only consider a split that leads to at least three observations on both the left and right leaf nodes. If this condition is not met for a split, the model will not consider it and will exclude it from the tree. The default value in sklearn for this hyperparameter is 1. Let's try to find the optimal value for min_samples_leaf for the Activity Recognition dataset:
rf_model7 = RandomForestClassifier(random_state=1, \
n_estimators=50, \
max_depth=10, \
min_samples_leaf=3)
rf_model7.fit(X_train, y_train)
preds7 = rf_model7.predict(X_train)
test_preds7 = rf_model7.predict(X_test)
print(accuracy_score(y_train, preds7))
print(accuracy_score(y_test, test_preds7))
The output will be as follows:

Figure 4.29: Accuracy scores for the training and testing sets for min_samples_leaf=3
With min_samples_leaf=3, the accuracy for both the training and testing sets didn't change much compared to the best model we found in the previous section. Let's try increasing it to 10:
rf_model8 = RandomForestClassifier(random_state=1, \
n_estimators=50, \
max_depth=10, \
min_samples_leaf=10)
rf_model8.fit(X_train, y_train)
preds8 = rf_model8.predict(X_train)
test_preds8 = rf_model8.predict(X_test)
print(accuracy_score(y_train, preds8))
print(accuracy_score(y_test, test_preds8))
The output will be as follows:

Figure 4.30: Accuracy scores for the training and testing sets for min_samples_leaf=10
Now the accuracy of the training set dropped a bit but increased for the testing set and their difference is smaller now. So, our model is overfitting less. Let's try another value for this hyperparameter – 25:
rf_model9 = RandomForestClassifier(random_state=1, \
n_estimators=50, \
max_depth=10, \
min_samples_leaf=25)
rf_model9.fit(X_train, y_train)
preds9 = rf_model9.predict(X_train)
test_preds9 = rf_model9.predict(X_test)
print(accuracy_score(y_train, preds9))
print(accuracy_score(y_test, test_preds9))
The output will be as follows:

Figure 4.31: Accuracy scores for the training and testing sets for min_samples_leaf=25
Both accuracies for the training and testing sets decreased but they are quite close to each other now. So, we will keep this value (25) as the optimal one for this dataset as the performance is still OK and we are not overfitting too much.
When choosing the optimal value for this hyperparameter, you need to be careful: a value that's too low will increase the chance of the model overfitting, but on the other hand, setting a very high value will lead to underfitting (the model will not accurately predict the right outcome).
For instance, if you have a dataset of 1000 rows, if you set min_samples_leaf to 400, then the model will not be able to find good splits to predict 5 different classes. In this case, the model can only create one single split and the model will only be able to predict two different classes instead of 5. It is good practice to start with low values first and then progressively increase them until you reach satisfactory performance.
Exercise 4.04: Tuning min_samples_leaf
In this exercise, we will keep tuning our Random Forest classifier that predicts animal type by trying two different values for the min_samples_leaf hyperparameter:
We will be using the same zoo dataset as in the previous exercise.
- Open a new Colab notebook.
- Import the pandas package, train_test_split, RandomForestClassifier, and accuracy_score from sklearn:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
- Create a variable called file_url that contains the URL to the dataset:
file_url = 'https://raw.githubusercontent.com'\
'/PacktWorkshops/The-Data-Science-Workshop'\
'/master/Chapter04/Dataset/openml_phpZNNasq.csv'
- Load the dataset into a DataFrame using the .read_csv() method from pandas:
df = pd.read_csv(file_url)
- Remove the animal column using .drop() and then extract the type target variable into a new variable called y using .pop():
df.drop(columns='animal', inplace=True)
y = df.pop('type')
- Split the data into training and testing sets with train_test_split() and the parameters test_size=0.4 and random_state=188:
X_train, X_test, \
y_train, y_test = train_test_split(df, y, test_size=0.4, \
random_state=188)
- Instantiate RandomForestClassifier with random_state=42, n_estimators=30, max_depth=2, and min_samples_leaf=3, and then fit the model with the training set:
rf_model = RandomForestClassifier(random_state=42, \
n_estimators=30, \
max_depth=2, \
min_samples_leaf=3)
rf_model.fit(X_train, y_train)
You should get the following output:
Figure 4.32: Logs of RandomForest
- Make predictions on the training and testing sets with .predict() and save the results into two new variables called train_preds and test_preds:
train_preds = rf_model.predict(X_train)
test_preds = rf_model.predict(X_test)
- Calculate the accuracy score for the training and testing sets and save the results in two new variables called train_acc and test_acc:
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
- Print the accuracy score – train_acc and test_acc:
print(train_acc)
print(test_acc)
You should get the following output:
Figure 4.33: Accuracy scores for the training and testing sets
The accuracy score decreased for both the training and testing sets compared to the best result we got in the previous exercise. Now the difference between the training and testing sets' accuracy scores is much smaller so our model is overfitting less.
- Instantiate another RandomForestClassifier with random_state=42, n_estimators=30, max_depth=2, and min_samples_leaf=7, and then fit the model with the training set:
rf_model2 = RandomForestClassifier(random_state=42, \
n_estimators=30, \
max_depth=2, \
min_samples_leaf=7)
rf_model2.fit(X_train, y_train)
You should get the following output:
Figure 4.34: Logs of RandomForest with max_depth=2
- Make predictions on the training and testing sets with .predict() and save the results into two new variables called train_preds2 and test_preds2:
train_preds2 = rf_model2.predict(X_train)
test_preds2 = rf_model2.predict(X_test)
- Calculate the accuracy score for the training and testing sets and save the results in two new variables called train_acc2 and test_acc2:
train_acc2 = accuracy_score(y_train, train_preds2)
test_acc2 = accuracy_score(y_test, test_preds2)
- Print the accuracy scores: train_acc and test_acc:
print(train_acc2)
print(test_acc2)
You should get the following output:
Figure 4.35: Accuracy scores for the training and testing sets
Increasing the value of min_samples_leaf to 7 has led the model to not overfit anymore. We got extremely similar accuracy scores for the training and testing sets, at around 0.8. We will choose this value as the optimal one for min_samples_leaf for this dataset.
Note
To access the source code for this specific section, please refer to https://packt.live/3kUYVZa.
You can also run this example online at https://packt.live/348bv0W.
- Visual FoxPro程序設計教程
- 少年輕松趣編程:用Scratch創作自己的小游戲
- PyTorch自然語言處理入門與實戰
- Instant 960 Grid System
- oreilly精品圖書:軟件開發者路線圖叢書(共8冊)
- Python高效開發實戰:Django、Tornado、Flask、Twisted(第3版)
- Mastering Rust
- C語言從入門到精通
- Python項目實戰從入門到精通
- Android驅動開發權威指南
- Visual Studio 2015高級編程(第6版)
- INSTANT Silverlight 5 Animation
- Java編程從入門到精通
- Java并發編程:核心方法與框架
- Spring Data JPA從入門到精通