免费免登录游戏

書名： The Data Science Workshop
作者名： Anthony So Thomas V. Joseph Robert Thas John Andrew Worsley Dr. Samuel Asare
本章字數： 2919字
更新時間： 2021-06-11 18:27:25

Evaluating the Model's Performance

Now that we know how to train a Random Forest classifier, it is time to check whether we did a good job or not. What we want is to get a model that makes extremely accurate predictions, so we need to assess its performance using some kind of metric.

For a classification problem, multiple metrics can be used to assess the model's predictive power, such as F1 score, precision, recall, or ROC AUC. Each of them has its own specificity and depending on the projects and datasets, you may use one or another.

In this chapter, we will use a metric called accuracy score. It calculates the ratio between the number of correct predictions and the total number of predictions made by the model:

Figure 4.5: Formula for accuracy score

For instance, if your model made 950 correct predictions out of 1,000 cases, then the accuracy score would be 950/1000 = 0.95. This would mean that your model was 95% accurate on that dataset. The sklearn package provides a function to calculate this score automatically and it is called accuracy_score(). We need to import it first:

from sklearn.metrics import accuracy_score

Then, we just need to provide the list of predictions for some observations and the corresponding true value for the target variable. Using the previous example, we will use the y_train and preds variables, which respectively contain the response variable (also known as the target) for the training set and the corresponding predictions made by the Random Forest model. We will reuse the predictions from the previous section – preds:

accuracy_score(y_train, preds)

The output will be as follows:

Figure 4.6: Accuracy score on the training set

We achieved an accuracy score of 0.988 on our training data. This means we accurately predicted more than 98% of these cases. Unfortunately, this doesn't mean you will be able to achieve such a high score for new, unseen data. Your model may have just learned the patterns that are only relevant to this training set, and in that case, the model will overfit.

If we take the analogy of a student learning a subject for a semester, they could memorize by heart all the textbook exercises but when given a similar but unseen exercise, they wouldn't be able to solve it. Ideally, the student should understand the underlying concepts of the subject and be able to apply that learning to any similar exercises. This is exactly the same for our model: we want it to learn the generic patterns that will help it to make accurate predictions even on unseen data.

But how can we assess the performance of a model for unseen data? Is there a way to get that kind of assessment? The answer to these questions is yes.

Remember, in the last section, we split the dataset into training and testing sets. We used the training set to fit the model and assess its predictive power on it. But it hasn't seen the observations from the testing set at all, so we can use it to assess whether our model is capable of generalizing unseen data. Let's calculate the accuracy score for the testing set:

test_preds = rf_model.predict(X_test)

accuracy_score(y_test, test_preds)

The output will be as follows:

Figure 4.7: Accuracy score on the testing set

OK. Now the accuracy has dropped drastically to 0.77. The difference between the training and testing sets is quite big. This tells us our model is actually overfitting and learned only the patterns relevant to the training set. In an ideal case, the performance of your model should be equal or very close to equal for those two sets.

In the next sections, we will look at tuning some Random Forest hyperparameters in order to reduce overfitting.

Exercise 4.01: Building a Model for Classifying Animal Type and Assessing Its Performance

In this exercise, we will train a Random Forest classifier to predict the type of an animal based on its attributes and check its accuracy score:

Note

The dataset we will be using is the Zoo Data Set shared by Richard S. Forsyth: https://packt.live/36DpRVK. The CSV version of this dataset can be found here: https://packt.live/37RWGhF.

Open a new Colab notebook.
Import the pandas package:
import pandas as pd
Create a variable called file_url that contains the URL of the dataset:
file_url = 'https://raw.githubusercontent.com'\
           '/PacktWorkshops/The-Data-Science-Workshop'\
           '/master/Chapter04/Dataset'\
           '/openml_phpZNNasq.csv'
Load the dataset into a DataFrame using the .read_csv() method from pandas:
df = pd.read_csv(file_url)
Print the first five rows of the DataFrame:
df.head()
You should get the following output:

Figure 4.8: First five rows of the DataFrame
We will be using the type column as our target variable. We will need to remove the animal column from the DataFrame and only use the remaining columns as features.
Remove the 'animal' column using the .drop() method from pandas and specify the columns='animal' and inplace=True parameters (to directly update the original DataFrame):
df.drop(columns='animal', inplace=True)
Extract the 'type' column using the .pop() method from pandas:
y = df.pop('type')
Print the first five rows of the updated DataFrame:
df.head()
You should get the following output:

Figure 4.9: First five rows of the DataFrame
Import the train_test_split function from sklearn.model_selection:
from sklearn.model_selection import train_test_split
Split the dataset into training and testing sets with the df, y, test_size=0.4, and random_state=188 parameters:
X_train, X_test, y_train, y_test = train_test_split\
(df, y, test_size=0.4, \
random_state=188)
Import RandomForestClassifier from sklearn.ensemble:
from sklearn.ensemble import RandomForestClassifier
Instantiate the RandomForestClassifier object with random_state equal to 42. Set the n-estimators value to an initial default value of 10. We'll discuss later how changing this value affects the result.
rf_model = RandomForestClassifier(random_state=42, \
n_estimators=10)
Fit RandomForestClassifier with the training set:
rf_model.fit(X_train, y_train)
You should get the following output:

Figure 4.10: Logs of RandomForestClassifier
Predict the outcome of the training set with the .predict()method, save the results in a variable called 'train_preds', and print its value:
train_preds = rf_model.predict(X_train)
train_preds
You should get the following output:

Figure 4.11: Predictions on the training set
Import the accuracy_score function from sklearn.metrics:
from sklearn.metrics import accuracy_score
Calculate the accuracy score on the training set, save the result in a variable called train_acc, and print its value:
train_acc = accuracy_score(y_train, train_preds)
print(train_acc)
You should get the following output:

Figure 4.12: Accuracy score on the training set
Our model achieved an accuracy of 1 on the training set, which means it perfectly predicted the target variable on all of those observations. Let's check the performance on the testing set.
Predict the outcome of the testing set with the .predict() method and save the results into a variable called test_preds:
test_preds = rf_model.predict(X_test)
Calculate the accuracy score on the testing set, save the result in a variable called test_acc, and print its value:
test_acc = accuracy_score(y_test, test_preds)
print(test_acc)
You should get the following output:

Figure 4.13: Accuracy score on the testing set

In this exercise, we trained a RandomForest to predict the type of animals based on their key attributes. Our model achieved a perfect accuracy score of 1 on the training set but only 0.88 on the testing set. This means our model is overfitting and is not general enough. The ideal situation would be for the model to achieve a very similar, high-accuracy score on both the training and testing sets.

Note

To access the source code for this specific section, please refer to https://packt.live/2Q4jpQK.

You can also run this example online at https://packt.live/3h6JieL.

Number of Trees Estimator

Now that we know how to fit a Random Forest classifier and assess its performance, it is time to dig into the details. In the coming sections, we will learn how to tune some of the most important hyperparameters for this algorithm. As mentioned in Chapter 1, Introduction to Data Science in Python, hyperparameters are parameters that are not learned automatically by machine learning algorithms. Their values have to be set by data scientists. These hyperparameters can have a huge impact on the performance of a model, its ability to generalize to unseen data, and the time taken to learn patterns from the data.

The first hyperparameter you will look at in this section is called n_estimators. This hyperparameter is responsible for defining the number of trees that will be trained by the RandomForest algorithm.

Before looking at how to tune this hyperparameter, we need to understand what a tree is and why it is so important for the RandomForest algorithm.

A tree is a logical graph that maps a decision and its outcomes at each of its nodes. Simply speaking, it is a series of yes/no (or true/false) questions that lead to different outcomes.

A leaf is a special type of node where the model will make a prediction. There will be no split after a leaf. A single node split of a tree may look like this:

Figure 4.14: Example of a single tree node

A tree node is composed of a question and two outcomes depending on whether the condition defined by the question is met or not. In the preceding example, the question is is avg_rss12 > 41? If the answer is yes, the outcome is the bending_1 leaf and if not, it will be the sitting leaf.

A tree is just a series of nodes and leaves combined together:

Figure 4.15: Example of a tree

In the preceding example, the tree is composed of three nodes with different questions. Now, for an observation to be predicted as sitting, it will need to meet the conditions: avg_rss13 <= 41, var_rss > 0.7, and avg_rss13 <= 16.25.

The RandomForest algorithm will build this kind of tree based on the training data it sees. We will not go through the mathematical details about how it defines the split for each node but, basically, it will go through every column of the dataset and see which split value will best help to separate the data into two groups of similar classes. Taking the preceding example, the first node with the avg_rss13 > 41 condition will help to get the group of data on the left-hand side with mostly the bending_1 class. The RandomForest algorithm usually builds several of this kind of tree and this is the reason why it is called a forest.

As you may have guessed now, the n_estimators hyperparameter is used to specify the number of trees the RandomForest algorithm will build. For example (as in the previous exercise), say we ask it to build 10 trees. For a given observation, it will ask each tree to make a prediction. Then, it will average those predictions and use the result as the final prediction for this input. For instance, if, out of 10 trees, 8 of them predict the outcome sitting, then the RandomForest algorithm will use this outcome as the final prediction.

Note

If you don't pass in a specific n_estimators hyperparameter, it will use the default value. The default depends on the version of scikit-learn you're using. In early versions, the default value is 10. From version 0.22 onwards, the default is 100. You can find out which version you are using by executing the following code:

import sklearn

sklearn.__version__

For more information, see here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In general, the higher the number of trees is, the better the performance you will get. Let's see what happens with n_estimators = 2 on the Activity Recognition dataset:

rf_model2 = RandomForestClassifier(random_state=1, \

n_estimators=2)

rf_model2.fit(X_train, y_train)

preds2 = rf_model2.predict(X_train)

test_preds2 = rf_model2.predict(X_test)

print(accuracy_score(y_train, preds2))

print(accuracy_score(y_test, test_preds2))

The output will be as follows:

Figure 4.16: Accuracy of RandomForest with n_estimators = 2

As expected, the accuracy is significantly lower than the previous example with n_estimators = 10. Let's now try with 50 trees:

rf_model3 = RandomForestClassifier(random_state=1, \

n_estimators=50)

rf_model3.fit(X_train, y_train)

preds3 = rf_model3.predict(X_train)

test_preds3 = rf_model3.predict(X_test)

print(accuracy_score(y_train, preds3))

print(accuracy_score(y_test, test_preds3))

The output will be as follows:

Figure 4.17: Accuracy of RandomForest with n_estimators = 50

With n_estimators = 50, we respectively gained 1% and 2% on the accuracy scored for the training and testing sets, which is great. But the main drawback of increasing the number of trees is that it requires more computational power. So, it will take more time to train a model. In a real project, you will need to find the right balance between performance and training duration.

Exercise 4.02: Tuning n_estimators to Reduce Overfitting

In this exercise, we will train a Random Forest classifier to predict the type of an animal based on its attributes and will try two different values for the n_estimators hyperparameter:

We will be using the same zoo dataset as in the previous exercise.

Open a new Colab notebook.
Import the pandas package, train_test_split, RandomForestClassifier, and accuracy_score from sklearn:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
Create a variable called file_url that contains the URL to the dataset:
file_url = 'https://raw.githubusercontent.com'\
           '/PacktWorkshops/The-Data-Science-Workshop'\
           '/master/Chapter04/Dataset'\
           '/openml_phpZNNasq.csv'
Load the dataset into a DataFrame using the .read_csv() method from pandas:
df = pd.read_csv(file_url)
Remove the animal column using .drop() and then extract the type target variable into a new variable called y using .pop():
df.drop(columns='animal', inplace=True)
y = df.pop('type')
Split the data into training and testing sets with train_test_split() and the test_size=0.4 and random_state=188 parameters:
X_train, X_test, y_train, y_test = train_test_split\
(df, y, test_size=0.4, \
random_state=188)
Instantiate RandomForestClassifier with random_state=42 and n_estimators=1, and then fit the model with the training set:
rf_model = RandomForestClassifier(random_state=42, \
n_estimators=1)
rf_model.fit(X_train, y_train)
You should get the following output:

Figure 4.18: Logs of RandomForestClassifier
Make predictions on the training and testing sets with .predict() and save the results into two new variables called train_preds and test_preds:
train_preds = rf_model.predict(X_train)
test_preds = rf_model.predict(X_test)
Calculate the accuracy score for the training and testing sets and save the results in two new variables called train_acc and test_acc:
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
Print the accuracy scores: train_acc and test_acc:
print(train_acc)
print(test_acc)
You should get the following output:

Figure 4.19: Accuracy scores for the training and testing sets
The accuracy score decreased for both the training and testing sets. But now the difference is smaller compared to the results from Exercise 4.01, Building a Model for Classifying Animal Type and Assessing Its Performance.
Instantiate another RandomForestClassifier with random_state=42 and n_estimators=30, and then fit the model with the training set:
rf_model2 = RandomForestClassifier(random_state=42, \
n_estimators=30)
rf_model2.fit(X_train, y_train)
You should get the following output:

Figure 4.20: Logs of RandomForest with n_estimators = 30
Make predictions on the training and testing sets with .predict() and save the results into two new variables called train_preds2 and test_preds2:
train_preds2 = rf_model2.predict(X_train)
test_preds2 = rf_model2.predict(X_test)
Calculate the accuracy score for the training and testing sets and save the results in two new variables called train_acc2 and test_acc2:
train_acc2 = accuracy_score(y_train, train_preds2)
test_acc2 = accuracy_score(y_test, test_preds2)
Print the accuracy scores: train_acc and test_acc:
print(train_acc2)
print(test_acc2)
You should get the following output:

Figure 4.21: Accuracy scores for the training and testing sets

This output shows us the model is overfitting less compared to the results from the previous step and still has a very high-performance level for the training set.

In the previous exercise, we achieved an accuracy score of 1 for the training set and 0.88 for the testing one. In this exercise, we trained two additional Random Forest models with n_estimators = 1 and 30. The model with the lowest number of trees has the lowest accuracy: 0.92 (training) and 0.8 (testing). On the other hand, increasing the number of trees to 30, we achieved a higher accuracy: 1 and 0.9. Our model is overfitting slightly less now. It is not perfect, but it is a good start.

Note

To access the source code for this specific section, please refer to https://packt.live/322x8gz.

You can also run this example online at https://packt.live/313gUV8.

官术网_书友最值得收藏!

The Data Science Workshop

Evaluating the Model's Performance

Exercise 4.01: Building a Model for Classifying Animal Type and Assessing Its Performance

Number of Trees Estimator

Exercise 4.02: Tuning n_estimators to Reduce Overfitting