黄财神心咒正确念法

書名： The Data Science Workshop
作者名： Anthony So Thomas V. Joseph Robert Thas John Andrew Worsley Dr. Samuel Asare
本章字數： 1504字
更新時間： 2021-06-11 18:27:25

Maximum Depth

In the previous section, we learned how Random Forest builds multiple trees to make predictions. Increasing the number of trees does improve model performance but it usually doesn't help much to decrease the risk of overfitting. Our model in the previous example is still performing much better on the training set (data it has already seen) than on the testing set (unseen data).

So, we are not confident enough yet to say the model will perform well in production. There are different hyperparameters that can help to lower the risk of overfitting for Random Forest and one of them is called max_depth.

This hyperparameter defines the depth of the trees built by Random Forest. Basically, it tells Random Forest model, how many nodes (questions) it can create before making predictions. But how will that help to reduce overfitting, you may ask. Well, let's say you built a single tree and set the max_depth hyperparameter to 50. This would mean that there would be some cases where you could ask 49 different questions (the value c includes the final leaf node) before making a prediction. So, the logic would be IF X1 > value1 AND X2 > value2 AND X1 <= value3 AND … AND X3 > value49 THEN predict class A.

As you can imagine, this is a very specific rule. In the end, it may apply to only a few observations in the training set, with this case appearing very infrequently. Therefore, your model would be overfitting. By default, the value of this max_depth parameter is None, which means there is no limit set for the depth of the trees.

What you really want is to find some rules that are generic enough to be applied to bigger groups of observations. This is why it is recommended to not create deep trees with Random Forest. Let's try several values for this hyperparameter on the Activity Recognition dataset: 3, 10, and 50:

rf_model4 = RandomForestClassifier(random_state=1, \

n_estimators=50, max_depth=3)

rf_model4.fit(X_train, y_train)

preds4 = rf_model4.predict(X_train)

test_preds4 = rf_model4.predict(X_test)

print(accuracy_score(y_train, preds4))

print(accuracy_score(y_test, test_preds4))

You should get the following output:

Figure 4.22: Accuracy scores for the training and testing sets and a max_depth of 3

For a max_depth of 3, we got extremely similar results for the training and testing sets but the overall performance decreased drastically to 0.61. Our model is not overfitting anymore, but it is now underfitting; that is, it is not predicting the target variable very well (only in 61% of cases). Let's increase max_depth to 10:

rf_model5 = RandomForestClassifier(random_state=1, \

n_estimators=50, \

max_depth=10)

rf_model5.fit(X_train, y_train)

preds5 = rf_model5.predict(X_train)

test_preds5 = rf_model5.predict(X_test)

print(accuracy_score(y_train, preds5))

print(accuracy_score(y_test, test_preds5))

Figure 4.23: Accuracy scores for the training and testing sets and a max_depth of 10

The accuracy of the training set increased and is relatively close to the testing set. We are starting to get some good results, but the model is still slightly overfitting. Now we will see the results for max_depth = 50:

rf_model6 = RandomForestClassifier(random_state=1, \

n_estimators=50, \

max_depth=50)

rf_model6.fit(X_train, y_train)

preds6 = rf_model6.predict(X_train)

test_preds6 = rf_model6.predict(X_test)

print(accuracy_score(y_train, preds6))

print(accuracy_score(y_test, test_preds6))

The output will be as follows:

Figure 4.24: Accuracy scores for the training and testing sets and a max_depth of 50

The accuracy jumped to 0.99 for the training set but it didn't improve much for the testing set. So, the model is overfitting with max_depth = 50. It seems the sweet spot to get good predictions and not much overfitting is around 10 for the max_depth hyperparameter in this dataset.

Exercise 4.03: Tuning max_depth to Reduce Overfitting

In this exercise, we will keep tuning our RandomForest classifier that predicts animal type by trying two different values for the max_depth hyperparameter:

We will be using the same zoo dataset as in the previous exercise.

Open a new Colab notebook.
Import the pandas package, train_test_split, RandomForestClassifier, and accuracy_score from sklearn:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
Create a variable called file_url that contains the URL to the dataset:
file_url = 'https://raw.githubusercontent.com'\
           'PacktWorkshops/The-Data-Science-Workshop'\
           '/master/Chapter04/Dataset'\
           '/openml_phpZNNasq.csv'
Load the dataset into a DataFrame using the .read_csv() method from pandas:
df = pd.read_csv(file_url)
Remove the animal column using .drop() and then extract the type target variable into a new variable called y using .pop():
df.drop(columns='animal', inplace=True)
y = df.pop('type')
Split the data into training and testing sets with train_test_split() and the parameters test_size=0.4 and random_state=188:
X_train, X_test, y_train, y_test = train_test_split\
(df, y, test_size=0.4, \
random_state=188)
Instantiate RandomForestClassifier with random_state=42, n_estimators=30, and max_depth=5, and then fit the model with the training set:
rf_model = RandomForestClassifier(random_state=42, \
n_estimators=30, \
max_depth=5)
rf_model.fit(X_train, y_train)
You should get the following output:

Figure 4.25: Logs of RandomForest
Make predictions on the training and testing sets with .predict() and save the results into two new variables called train_preds and test_preds:
train_preds = rf_model.predict(X_train)
test_preds = rf_model.predict(X_test)
Calculate the accuracy score for the training and testing sets and save the results in two new variables called train_acc and test_acc:
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
Print the accuracy scores: train_acc and test_acc:
print(train_acc)
print(test_acc)
You should get the following output:

Figure 4.26: Accuracy scores for the training and testing sets
We got the exact same accuracy scores as for the best result we obtained in the previous exercise. This value for the max_depth hyperparameter hasn't impacted the model's performance.
Instantiate another RandomForestClassifier with random_state=42, n_estimators=30, and max_depth=2, and then fit the model with the training set:
rf_model2 = RandomForestClassifier(random_state=42, \
n_estimators=30, \
max_depth=2)
rf_model2.fit(X_train, y_train)
You should get the following output:

Figure 4.27: Logs of RandomForestClassifier with max_depth = 2
Make predictions on the training and testing sets with .predict() and save the results into two new variables called train_preds2 and test_preds2:
train_preds2 = rf_model2.predict(X_train)
test_preds2 = rf_model2.predict(X_test)
Calculate the accuracy scores for the training and testing sets and save the results in two new variables called train_acc2 and test_acc2:
train_acc2 = accuracy_score(y_train, train_preds2)
test_acc2 = accuracy_score(y_test, test_preds2)
Print the accuracy scores: train_acc and test_acc:
print(train_acc2)
print(test_acc2)
You should get the following output:

Figure 4.28: Accuracy scores for training and testing sets

You learned how to tune the max_depth hyperparameter in this exercise. Reducing its value to 2 decreased the accuracy score for the training set to 0.9 but it also helped to reduce the overfitting for the training and testing set (0.83), so we will keep this value as the optimal one and proceed to the next step.

Note

To access the source code for this specific section, please refer to https://packt.live/31YXkIY.

You can also run this example online at https://packt.live/2CCkxYX.

官术网_书友最值得收藏!

The Data Science Workshop

Maximum Depth

Exercise 4.03: Tuning max_depth to Reduce Overfitting