佛龛

書名： The Data Science Workshop
作者名： Anthony So Thomas V. Joseph Robert Thas John Andrew Worsley Dr. Samuel Asare
本章字數： 1158字
更新時間： 2021-06-11 18:27:25

Training a Random Forest Classifier

In this chapter, we will use the Random Forest algorithm for multiclass classification. There are other algorithms on the market, but Random Forest is probably one of the most popular for such types of projects.

The Random Forest methodology was first proposed in 1995 by Tin Kam Ho but it was first developed by Leo Breiman in 2001.

So Random Forest is not really a recent algorithm per se. It has been in use for almost two decades already. But its popularity hasn't faded, thanks to its performance and simplicity.

For the examples in this chapter, we will be using a dataset called "Activity Recognition system based on Multisensor data." It was originally shared by F. Palumbo, C. Gallicchio, R. Pucci, and A. Micheli, Human activity recognition using multisensor data fusion based on Reservoir Computing, Journal of Ambient Intelligence and Smart Environments, 2016, 8 (2), pp. 87-107.

Note

The complete dataset can be found here: https://packt.live/3a5FI1s

Let's see how we can train a Random Forest classifier on this dataset. First, we need to load the data from the GitHub repository using pandas and then we will print its first five rows using the head() method.

Note

All the example code given outside of Exercises in this chapter relates to this Activity Recognition dataset. It is recommended that all code from these examples is entered and run in a single Google Colab Notebook, and kept separate from your Exercise Notebooks.

import pandas as pd

file_url = 'https://raw.githubusercontent.com/PacktWorkshops'\

'/The-Data-Science-Workshop/master/Chapter04/'\

'Dataset/activity.csv'

df = pd.read_csv(file_url)

df.head()

The output will be as follows:

Figure 4.1: First five rows of the dataset

Each row represents an activity that was performed by a person and the name of the activity is stored in the Activity column. There are seven different activities in this variable: bending1, bending2, cycling, lying, sitting, standing, and Walking. The other six columns are different measurements taken from sensor data.

In this example, you will accurately predict the target variable ('Activity') from the features (the six other columns) using Random Forest. For example, for the first row of the preceding example, the model will receive the following features as input and will predict the 'bending1' class:

Figure 4.2: Features for the first row of the dataset

But before that, we need to do a bit of data preparation. The sklearn package (we will use it to train Random Forest model) requires the target variable and the features to be separated. So, we need to extract the response variable using the .pop() method from pandas. The .pop() method extracts the specified column and removes it from the DataFrame:

target = df.pop('Activity')

Now the response variable is contained in the variable called target and all the features are in the DataFrame called df.

Now we are going to split the dataset into training and testing sets. The model uses the training set to learn relevant parameters in predicting the response variable. The test set is used to check whether a model can accurately predict unseen data. We say the model is overfitting when it has learned the patterns relevant only to the training set and makes incorrect predictions about the testing set. In this case, the model performance will be much higher for the training set compared to the testing one. Ideally, we want to have a very similar level of performance for the training and testing sets. This topic will be covered in more depth in Chapter 7, The Generalization of Machine Learning Models.

The sklearn package provides a function called train_test_split() to randomly split the dataset into two different sets. We need to specify the following parameters for this function: the feature and target variables, the ratio of the testing set (test_size), and random_state in order to get reproducible results if we have to run the code again:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split\

(df, target, test_size=0.33, \

random_state=42)

There are four different outputs to the train_test_split() function: the features for the training set, the target variable for the training set, the features for the testing set, and its target variable.

Now that we have got our training and testing sets, we are ready for modeling. Let's first import the RandomForestClassifier class from sklearn.ensemble:

from sklearn.ensemble import RandomForestClassifier

Now we can instantiate the Random Forest classifier with some hyperparameters. Remember from Chapter 1, Introduction to Data Science in Python, a hyperparameter is a type of parameter the model can't learn but is set by data scientists to tune the model's learning process. This topic will be covered more in depth in Chapter 8, Hyperparameter Tuning. For now, we will just specify the random_state value. We will walk you through some of the key hyperparameters in the following sections:

rf_model = RandomForestClassifier(random_state=1, \

n_estimators=10)

The next step is to train (also called fit) the model with the training data. During this step, the model will try to learn the relationship between the response variable and the independent variables and save the parameters learned. We need to specify the features and target variables as parameters:

rf_model.fit(X_train, y_train)

The output will be as follows:

Figure 4.3: Logs of the trained RandomForest

Now that the model has completed its training, we can use the parameters it learned to make predictions on the input data we will provide. In the following example, we are using the features from the training set:

preds = rf_model.predict(X_train)

Now we can print these predictions:

preds

The output will be as follows:

Figure 4.4: Predictions of the RandomForest algorithm on the training set

This output shows us the model predicted, respectively, the values lying, bending1, and cycling for the first three observations and cycling, bending1, and standing for the last three observations. Python, by default, truncates the output for a long list of values. This is why it shows only six values here.

These are basically the key steps required for training a Random Forest classifier. This was quite straightforward, right? Training a machine learning model is incredibly easy but getting meaningful and accurate results is where the challenges lie. In the next section, we will learn how to assess the performance of a trained model.

官术网_书友最值得收藏!

The Data Science Workshop

Training a Random Forest Classifier