官术网_书友最值得收藏!

  • The Data Science Workshop
  • Anthony So Thomas V. Joseph Robert Thas John Andrew Worsley Dr. Samuel Asare
  • 1771字
  • 2021-06-11 18:27:26

Minimum Sample in Leaf

Previously, we learned how to reduce or increase the depth of trees in Random Forest and saw how it can affect its performance and tendency to overfit or not. Now we will go through another important hyperparameter: min_samples_leaf.

This hyperparameter, as its name implies, is related to the leaf nodes of the trees. We saw earlier that the RandomForest algorithm builds nodes that will clearly separate observations into two different groups. If we look at the tree example in Figure 4.15, the top node is splitting data into two groups: the left-hand group contains mainly observations for the bending_1 class and the right-hand group can be from any class. This sounds like a reasonable split but are we sure it is not increasing the risk of overfitting? For instance, what if this split leads to only one observation falling on the left-hand side? This rule would be very specific (applying to only one single case) and we can't say it is generic enough for unseen data. It may be an edge case in the training set that will never happen again.

It would be great if we could let the model know to not create such specific rules that happen quite infrequently. Luckily, RandomForest has such a hyperparameter and, you guessed it, it is min_samples_leaf. This hyperparameter will specify the minimum number of observations (or samples) that will have to fall under a leaf node to be considered in the tree. For instance, if we set min_samples_leaf to 3, then RandomForest will only consider a split that leads to at least three observations on both the left and right leaf nodes. If this condition is not met for a split, the model will not consider it and will exclude it from the tree. The default value in sklearn for this hyperparameter is 1. Let's try to find the optimal value for min_samples_leaf for the Activity Recognition dataset:

rf_model7 = RandomForestClassifier(random_state=1, \

                                   n_estimators=50, \

                                   max_depth=10, \

                                   min_samples_leaf=3)

rf_model7.fit(X_train, y_train)

preds7 = rf_model7.predict(X_train)

test_preds7 = rf_model7.predict(X_test)

print(accuracy_score(y_train, preds7))

print(accuracy_score(y_test, test_preds7))

The output will be as follows:

Figure 4.29: Accuracy scores for the training and testing sets for min_samples_leaf=3

With min_samples_leaf=3, the accuracy for both the training and testing sets didn't change much compared to the best model we found in the previous section. Let's try increasing it to 10:

rf_model8 = RandomForestClassifier(random_state=1, \

                                   n_estimators=50, \

                                   max_depth=10, \

                                   min_samples_leaf=10)

rf_model8.fit(X_train, y_train)

preds8 = rf_model8.predict(X_train)

test_preds8 = rf_model8.predict(X_test)

print(accuracy_score(y_train, preds8))

print(accuracy_score(y_test, test_preds8))

The output will be as follows:

Figure 4.30: Accuracy scores for the training and testing sets for min_samples_leaf=10

Now the accuracy of the training set dropped a bit but increased for the testing set and their difference is smaller now. So, our model is overfitting less. Let's try another value for this hyperparameter – 25:

rf_model9 = RandomForestClassifier(random_state=1, \

                                   n_estimators=50, \

                                   max_depth=10, \

                                   min_samples_leaf=25)

rf_model9.fit(X_train, y_train)

preds9 = rf_model9.predict(X_train)

test_preds9 = rf_model9.predict(X_test)

print(accuracy_score(y_train, preds9))

print(accuracy_score(y_test, test_preds9))

The output will be as follows:

Figure 4.31: Accuracy scores for the training and testing sets for min_samples_leaf=25

Both accuracies for the training and testing sets decreased but they are quite close to each other now. So, we will keep this value (25) as the optimal one for this dataset as the performance is still OK and we are not overfitting too much.

When choosing the optimal value for this hyperparameter, you need to be careful: a value that's too low will increase the chance of the model overfitting, but on the other hand, setting a very high value will lead to underfitting (the model will not accurately predict the right outcome).

For instance, if you have a dataset of 1000 rows, if you set min_samples_leaf to 400, then the model will not be able to find good splits to predict 5 different classes. In this case, the model can only create one single split and the model will only be able to predict two different classes instead of 5. It is good practice to start with low values first and then progressively increase them until you reach satisfactory performance.

Exercise 4.04: Tuning min_samples_leaf

In this exercise, we will keep tuning our Random Forest classifier that predicts animal type by trying two different values for the min_samples_leaf hyperparameter:

We will be using the same zoo dataset as in the previous exercise.

  1. Open a new Colab notebook.
  2. Import the pandas package, train_test_split, RandomForestClassifier, and accuracy_score from sklearn:

    import pandas as pd

    from sklearn.model_selection import train_test_split

    from sklearn.ensemble import RandomForestClassifier

    from sklearn.metrics import accuracy_score

  3. Create a variable called file_url that contains the URL to the dataset:

    file_url = 'https://raw.githubusercontent.com'\

               '/PacktWorkshops/The-Data-Science-Workshop'\

               '/master/Chapter04/Dataset/openml_phpZNNasq.csv'

  4. Load the dataset into a DataFrame using the .read_csv() method from pandas:

    df = pd.read_csv(file_url)

  5. Remove the animal column using .drop() and then extract the type target variable into a new variable called y using .pop():

    df.drop(columns='animal', inplace=True)

    y = df.pop('type')

  6. Split the data into training and testing sets with train_test_split() and the parameters test_size=0.4 and random_state=188:

    X_train, X_test, \

    y_train, y_test = train_test_split(df, y, test_size=0.4, \

                                       random_state=188)

  7. Instantiate RandomForestClassifier with random_state=42, n_estimators=30, max_depth=2, and min_samples_leaf=3, and then fit the model with the training set:

    rf_model = RandomForestClassifier(random_state=42, \

                                      n_estimators=30, \

                                      max_depth=2, \

                                      min_samples_leaf=3)

    rf_model.fit(X_train, y_train)

    You should get the following output:

    Figure 4.32: Logs of RandomForest

  8. Make predictions on the training and testing sets with .predict() and save the results into two new variables called train_preds and test_preds:

    train_preds = rf_model.predict(X_train)

    test_preds = rf_model.predict(X_test)

  9. Calculate the accuracy score for the training and testing sets and save the results in two new variables called train_acc and test_acc:

    train_acc = accuracy_score(y_train, train_preds)

    test_acc = accuracy_score(y_test, test_preds)

  10. Print the accuracy score – train_acc and test_acc:

    print(train_acc)

    print(test_acc)

    You should get the following output:

    Figure 4.33: Accuracy scores for the training and testing sets

    The accuracy score decreased for both the training and testing sets compared to the best result we got in the previous exercise. Now the difference between the training and testing sets' accuracy scores is much smaller so our model is overfitting less.

  11. Instantiate another RandomForestClassifier with random_state=42, n_estimators=30, max_depth=2, and min_samples_leaf=7, and then fit the model with the training set:

    rf_model2 = RandomForestClassifier(random_state=42, \

                                       n_estimators=30, \

                                       max_depth=2, \

                                       min_samples_leaf=7)

    rf_model2.fit(X_train, y_train)

    You should get the following output:

    Figure 4.34: Logs of RandomForest with max_depth=2

  12. Make predictions on the training and testing sets with .predict() and save the results into two new variables called train_preds2 and test_preds2:

    train_preds2 = rf_model2.predict(X_train)

    test_preds2 = rf_model2.predict(X_test)

  13. Calculate the accuracy score for the training and testing sets and save the results in two new variables called train_acc2 and test_acc2:

    train_acc2 = accuracy_score(y_train, train_preds2)

    test_acc2 = accuracy_score(y_test, test_preds2)

  14. Print the accuracy scores: train_acc and test_acc:

    print(train_acc2)

    print(test_acc2)

    You should get the following output:

Figure 4.35: Accuracy scores for the training and testing sets

Increasing the value of min_samples_leaf to 7 has led the model to not overfit anymore. We got extremely similar accuracy scores for the training and testing sets, at around 0.8. We will choose this value as the optimal one for min_samples_leaf for this dataset.

Note

To access the source code for this specific section, please refer to https://packt.live/3kUYVZa.

You can also run this example online at https://packt.live/348bv0W.

主站蜘蛛池模板: 静海县| 张北县| 杨浦区| 保定市| 上高县| 攀枝花市| 梅河口市| 邵武市| 封开县| 都匀市| 广元市| 宁蒗| 涞水县| 伊金霍洛旗| 随州市| 淮阳县| 阜阳市| 大荔县| 宜君县| 阿坝县| 隆德县| 南郑县| 井冈山市| 安乡县| 万安县| 资兴市| 青冈县| 原平市| 孝义市| 安康市| 通城县| 兰西县| 白水县| 韶关市| 天津市| 芜湖县| 广汉市| 明水县| 永顺县| 师宗县| 庆元县|