旺旺吧app

書名： The Data Science Workshop
作者名： Anthony So Thomas V. Joseph Robert Thas John Andrew Worsley Dr. Samuel Asare
本章字?jǐn)?shù)： 2505字
更新時間： 2021-06-11 18:27:26

Maximum Features

We are getting close to the end of this chapter. You have already learned how to tune several of the most important hyperparameters for RandomForest. In this section, we will present you with another extremely important one: max_features.

Earlier, we learned that RandomForest builds multiple trees and takes the average to make predictions. This is why it is called a forest, but we haven't really discussed the "random" part yet. Going through this chapter, you may have asked yourself: how does building multiple trees help to get better predictions, and won't all the trees look the same given that the input data is the same?

Before answering these questions, let's use the analogy of a court trial. In some countries, the final decision of a trial is either made by a judge or a jury. A judge is a person who knows the law in detail and can decide whether a person has broken the law or not. On the other hand, a jury is composed of people from different backgrounds who don't know each other or any of the parties involved in the trial and have limited knowledge of the legal system. In this case, we are asking random people who are not expert in the law to decide the outcome of a case. This sounds very risky at first. The risk of one person making the wrong decision is very high. But in fact, the risk of 10 or 20 people all making the wrong decision is relatively low.

But there is one condition that needs to be met for this to work: randomness. If all the people in the jury come from the same background, work in the same industry, or live in the same area, they may share the same way of thinking and make similar decisions. For instance, if a group of people were raised in a community where you only drink hot chocolate at breakfast and one day you ask them if it is OK to drink coffee at breakfast, they would all say no.

On the other hand, say you got another group of people from different backgrounds with different habits: some drink coffee, others tea, a few drink orange juice, and so on. If you asked them the same question, you would end up with the majority of them saying yes. Because we randomly picked these people, they have less bias as a group, and this therefore lowers the risk of them making a wrong decision.

RandomForest actually applies the same logic: it builds a number of trees independently of each other by randomly sampling the data. A tree may see 60% of the training data, another one 70%, and so on. By doing so, there is a high chance that the trees are absolutely different from each other and don't share the same bias. This is the secret of RandomForest: building multiple random trees leads to higher accuracy.

But it is not the only way RandomForest creates randomness. It does so also by randomly sampling columns. Each tree will only see a subset of the features rather than all of them. And this is exactly what the max_features hyperparameter is for: it will set the maximum number of features a tree is allowed to see.

In sklearn, you can specify the value of this hyperparameter as:

The maximum number of features, as an integer.
A ratio, as the percentage of allowed features.
The sqrt function (the default value in sklearn, which stands for square root), which will use the square root of the number of features as the maximum value. If, for a dataset, there are 25 features, its square root will be 5 and this will be the value for max_features.
The log2 function, which will use the log base, 2, of the number of features as the maximum value. If, for a dataset, there are eight features, its log2 will be 3 and this will be the value for max_features.
The None value, which means Random Forest will use all the features available.

Let's try three different values on the activity dataset. First, we will specify the maximum number of features as two:

rf_model10 = RandomForestClassifier(random_state=1, \

n_estimators=50, \

max_depth=10, \

min_samples_leaf=25, \

max_features=2)

rf_model10.fit(X_train, y_train)

preds10 = rf_model10.predict(X_train)

test_preds10 = rf_model10.predict(X_test)

print(accuracy_score(y_train, preds10))

print(accuracy_score(y_test, test_preds10))

The output will be as follows:

Figure 4.36: Accuracy scores for the training and testing sets for max_features=2

We got results similar to those of the best model we trained in the previous section. This is not really surprising as we were using the default value of max_features at that time, which is sqrt. The square root of 2 equals 1.45, which is quite close to 2. This time, let's try with the ratio 0.7:

rf_model11 = RandomForestClassifier(random_state=1, \

n_estimators=50, \

max_depth=10, \

min_samples_leaf=25, \

max_features=0.7)

rf_model11.fit(X_train, y_train)

preds11 = rf_model11.predict(X_train)

test_preds11 = rf_model11.predict(X_test)

print(accuracy_score(y_train, preds11))

print(accuracy_score(y_test, test_preds11))

The output will be as follows:

Figure 4.37: Accuracy scores for the training and testing sets for max_features=0.7

With this ratio, both accuracy scores increased for the training and testing sets and the difference between them is less. Our model is overfitting less now and has slightly improved its predictive power. Let's give it a shot with the log2 option:

rf_model12 = RandomForestClassifier(random_state=1, \

n_estimators=50, \

max_depth=10, \

min_samples_leaf=25, \

max_features='log2')

rf_model12.fit(X_train, y_train)

preds12 = rf_model12.predict(X_train)

test_preds12 = rf_model12.predict(X_test)

print(accuracy_score(y_train, preds12))

print(accuracy_score(y_test, test_preds12))

The output will be as follows:

Figure 4.38: Accuracy scores for the training and testing sets for max_features='log2'

We got similar results as for the default value (sqrt) and 2. Again, this is normal as the log2 of 6 equals 2.58. So, the optimal value we found for the max_features hyperparameter is 0.7 for this dataset.

Exercise 4.05: Tuning max_features

In this exercise, we will keep tuning our RandomForest classifier that predicts animal type by trying two different values for the max_features hyperparameter:

We will be using the same zoo dataset as in the previous exercise.

Open a new Colab notebook.
Import the pandas package, train_test_split, RandomForestClassifier, and accuracy_score from sklearn:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
Create a variable called file_url that contains the URL to the dataset:
file_url = 'https://raw.githubusercontent.com'\
'/PacktWorkshops/The-Data-Science-Workshop'\
'/master/Chapter04/Dataset/openml_phpZNNasq.csv'
Load the dataset into a DataFrame using the .read_csv() method from pandas:
df = pd.read_csv(file_url)
Remove the animal column using .drop() and then extract the type target variable into a new variable called y using .pop():
df.drop(columns='animal', inplace=True)
y = df.pop('type')
Split the data into training and testing sets with train_test_split() and the parameters test_size=0.4 and random_state=188:
X_train, X_test, \
y_train, y_test = train_test_split(df, y, test_size=0.4, \
random_state=188)
Instantiate RandomForestClassifier with random_state=42, n_estimators=30, max_depth=2, min_samples_leaf=7, and max_features=10, and then fit the model with the training set:
rf_model = RandomForestClassifier(random_state=42, \
                                  n_estimators=30, \
                                  max_depth=2, \
                                  min_samples_leaf=7, \
                                  max_features=10)
rf_model.fit(X_train, y_train)
You should get the following output:

Figure 4.39: Logs of RandomForest
Make predictions on the training and testing sets with .predict() and save the results into two new variables called train_preds and test_preds:
train_preds = rf_model.predict(X_train)
test_preds = rf_model.predict(X_test)
Calculate the accuracy scores for the training and testing sets and save the results in two new variables called train_acc and test_acc:
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
Print the accuracy scores: train_acc and test_acc:
print(train_acc)
print(test_acc)
You should get the following output:

Figure 4.40: Accuracy scores for the training and testing sets
Instantiate another RandomForestClassifier with random_state=42, n_estimators=30, max_depth=2, min_samples_leaf=7, and max_features=0.2, and then fit the model with the training set:
rf_model2 = RandomForestClassifier(random_state=42, \
                                   n_estimators=30, \
                                   max_depth=2, \
                                   min_samples_leaf=7, \
                                   max_features=0.2)
rf_model2.fit(X_train, y_train)
You should get the following output:

Figure 4.41: Logs of RandomForest with max_features = 0.2
Make predictions on the training and testing sets with .predict() and save the results into two new variables called train_preds2 and test_preds2:
train_preds2 = rf_model2.predict(X_train)
test_preds2 = rf_model2.predict(X_test)
Calculate the accuracy score for the training and testing sets and save the results in two new variables called train_acc2 and test_acc2:
train_acc2 = accuracy_score(y_train, train_preds2)
test_acc2 = accuracy_score(y_test, test_preds2)
Print the accuracy scores: train_acc and test_acc:
print(train_acc2)
print(test_acc2)
You should get the following output:

Figure 4.42: Accuracy scores for the training and testing sets

The values 10 and 0.2, which we tried in this exercise for the max_features hyperparameter, did improve the accuracy of the training set but not the testing set. With these values, the model starts to overfit again. The optimal value for max_features is the default value (sqrt) for this dataset. In the end, we succeeded in building a model with a 0.8 accuracy score that is not overfitting. This is a pretty good result given the fact the dataset wasn't big: we got only 6 features and 41759 observations.

Note

To access the source code for this specific section, please refer to https://packt.live/3g8nTk7.

You can also run this example online at https://packt.live/324quGv.

Activity 4.01: Train a Random Forest Classifier on the ISOLET Dataset

You are working for a technology company and they are planning to launch a new voice assistant product. You have been tasked with building a classification model that will recognize the letters spelled out by a user based on the signal frequencies captured. Each sound can be captured and represented as a signal composed of multiple frequencies.

Note

This activity uses the ISOLET dataset, taken from the UCI Machine Learning Repository from the following link: https://packt.live/2QFOawy.

The CSV version of this dataset can be found here: https://packt.live/36DWHpi.

The following steps will help you to complete this activity:

Download and load the dataset using .read_csv() from pandas.
Extract the response variable using .pop() from pandas.
Split the dataset into training and test sets using train_test_split() from sklearn.model_selection.
Create a function that will instantiate and fit a RandomForestClassifier using .fit() from sklearn.ensemble.
Create a function that will predict the outcome for the training and testing sets using .predict().
Create a function that will print the accuracy score for the training and testing sets using accuracy_score() from sklearn.metrics.
Train and get the accuracy score for a range of different hyperparameters. Here are some options you can try:
- n_estimators = 20 and 50
- max_depth = 5 and 10
- min_samples_leaf = 10 and 50
- max_features = 0.5 and 0.3
Select the best hyperparameter value.

These are the accuracy scores for the best model we trained:

Figure 4.43: Accuracy scores for the Random Forest classifier

Note

The solution to the activity can be found here: https://packt.live/2GbJloz.

官术网_书友最值得收藏!

The Data Science Workshop

Maximum Features

Exercise 4.05: Tuning max_features

Activity 4.01: Train a Random Forest Classifier on the ISOLET Dataset