- The Data Science Workshop
- Anthony So Thomas V. Joseph Robert Thas John Andrew Worsley Dr. Samuel Asare
- 2117字
- 2021-06-11 18:27:19
Scikit-Learn
Scikit-learn (also referred to as sklearn) is another extremely popular package used by data scientists. The main purpose of sklearn is to provide APIs for processing data and training machine learning algorithms. But before moving ahead, we need to know what a model is.
What Is a Model?
A machine learning model learns patterns from data and creates a mathematical function to generate predictions. A supervised learning algorithm will try to find the relationship between a response variable and the given features.
Have a look at the following example.
A mathematical function can be represented as a function, ?(), that is applied to some input variables, X (which is composed of multiple features), and will calculate an output (or prediction), ?:

Figure 1.37: Function f(X)
The function, ?(), can be quite complex and have different numbers of parameters. If we take a linear regression (this will be presented in more detail in Chapter 2, Regression) as an example, the model parameters can be represented as W=( w1, w2, ... , wn). So, the function we saw earlier will become as follows:

Figure 1.38: Function for linear regression
A machine learning algorithm will receive some examples of input X with the relevant output, y, and its goal will be to find the values of ( w1, w2, ... , wn) that will minimize the difference between its prediction, ? and the true output, y.
The previous formulas can be a bit intimidating, but this is actually quite simple. Let's say we have a dataset composed of only one target variable y and one feature X, such as the following one:

Figure 1.39: Example of a dataset with one target variable and one feature
If we fit a linear regression on this dataset, the algorithm will try to find a solution for the following equation:

Figure 1.40: Function f(x) for linear regression fitting on a dataset
o, it just needs to find the values of the w0 and w1 parameters that will approximate the data as closely as possible. In this case, the algorithm may come up with wo = 0 and w1 = 10. So, the function the model learns will be as follows:

Figure 1.41: Function f(x) using estimated values
We can visualize this on the same graph as for the data:

Figure 1.42: Fitted linear model on the example dataset
We can see that the fitted model (the orange line) is approximating the original data quite closely. So, if we predict the outcome for a new data point, it will be very close to the true value. For example, if we take a point that is close to 5 (let's say its values are x = 5.1 and y = 48), the model will predict the following:

Figure 1.43: Model prediction
This value is actually very close to the ground truth, 48 (red circle). So, our model prediction is quite accurate.
This is it. It is quite simple, right? In general, a dataset will have more than one feature, but the logic will be the same: the trained model will try to find the best parameters for each variable to get predictions as close as possible to the true values.
We just saw an example of linear models, but there are actually other types of machine learning algorithms, such as tree-based or neural networks, that can find more complex patterns from data.
Model Hyperparameters
On top of the model parameters that are learned automatically by the algorithm (now you understand why we call it machine learning), there is also another type of parameter called the hyperparameter. Hyperparameters cannot be learned by the model. They are set by data scientists in order to define some specific conditions for the algorithm learning process. These hyperparameters are different for each family of algorithms and they can, for instance, help fast-track the learning process or limit the risk of overfitting. In this book, you will learn how to tune some of these machine learning hyperparameters.
The sklearn API
As mentioned before, the scikit-learn (or sklearn) package has implemented an incredible amount of machine learning algorithms, such as logistic regression, k-nearest neighbors, k-means, and random forest.
Note
Do not worry about these terms—you are not expected to know what these algorithms involve just yet. You will see a simple random forest example in this chapter, but all of these algorithms will be explained in detail in later chapters of the book.
sklearn groups algorithms by family. For instance, RandomForest and GradientBoosting are part of the ensemble module. In order to make use of an algorithm, you will need to import it first like this:
from sklearn.ensemble import RandomForestClassifier
Another reason why sklearn is so popular is that all the algorithms follow the exact same API structure. So, once you have learned how to train one algorithm, it is extremely easy to train another one with very minimal code changes. With sklearn, there are four main steps to train a machine learning model:
- Instantiate a model with specified hyperparameters: this will configure the machine learning model you want to train.
- Train the model with training data: during this step, the model will learn the best parameters to get predictions as close as possible to the actual values of the target.
- Predict the outcome from input data: using the learned parameter, the model will predict the outcome for new data.
- Assess the performance of the model predictions: for checking whether the model learned the right patterns to get accurate predictions.
Note
In a real project, there might be more steps depending on the situation, but for simplicity, we will stick with these four for now. You will learn the remaining ones in the following chapters.
As mentioned before, each algorithm will have its own specific hyperparameters that can be tuned. To instantiate a model, you just need to create a new variable from the class you imported previously and specify the values of the hyperparameters. If you leave the hyperparameters blank, the model will use the default values specified by sklearn.
It is recommended to at least set the random_state hyperparameter in order to get reproducible results every time that you have to run the same code:
rf_model = RandomForestClassifier(random_state=1)
The second step is to train the model with some data. In this example, we will use a simple dataset that classifies 178 instances of Italian wines into 3 categories based on 13 features. This dataset is part of the few examples that sklearn provides within its API. We need to load the data first:
from sklearn.datasets import load_wine
features, target = load_wine(return_X_y=True)
Then using the .fit() method to train the model, you will provide the features and the target variable as input:
rf_model.fit(features, target)
You should get the following output:

Figure 1.44: Logs of the trained Random Forest model
In the preceding output, we can see a Random Forest model with the default hyperparameters. You will be introduced to some of them in Chapter 4, Multiclass Classification with RandomForest.
Once trained, we can use the .predict() method to predict the target for one or more observations. Here we will use the same data as for the training step:
preds = rf_model.predict(features)
preds
You should get the following output:

Figure 1.45: Predictions of the trained Random Forest model
From the preceding output, you can see that the 178 different wines in the dataset have been classified into one of the three different wine categories. The first lot of wines have been classified as being in category 0, the second lot are category 1, and the last lot are category 2. At this point, we do not know what classes 0, 1, or 2 represent (in the context of the "type" of wine in each category), but finding this out would form part of the larger data science project.
Finally, we want to assess the model's performance by comparing its predictions to the actual values of the target variable. There are a lot of different metrics that can be used for assessing model performance, and you will learn more about them later in this book. For now, though, we will just use a metric called accuracy. This metric calculates the ratio of correct predictions to the total number of observations:
from sklearn.metrics import accuracy_score
accuracy_score(target, preds)
You should get the following output

Figure 1.46: Accuracy of the trained Random Forest model
In this example, the Random Forest model learned to predict correctly all the observations from this dataset; it achieves an accuracy score of 1 (that is, 100% of the predictions matched the actual true values).
It's as simple as that! This may be too good to be true. In the following chapters, you will learn how to check whether the trained models are able to accurately predict unseen or future data points or if they have only learned the specific patterns of this input data (also called overfitting).
Exercise 1.03: Predicting Breast Cancer from a Dataset Using sklearn
In this exercise, we will build a machine learning classifier using RandomForest from sklearn to predict whether the breast cancer of a patient is malignant (harmful) or benign (not harmful).
The dataset we will use is the Breast Cancer Wisconsin (Diagnostic) dataset, which is available directly from the sklearn package at https://packt.live/2FcOTim.
The following steps will help you complete the exercise:
- Open a new Colab notebook.
- Import the load_breast_cancer function from sklearn.datasets:
from sklearn.datasets import load_breast_cancer
- Load the dataset from the load_breast_cancer function with the return_X_y=True parameter to return the features and response variable only:
features, target = load_breast_cancer(return_X_y=True)
- Print the variable features:
print(features)
You should get the following output:
Figure 1.47: Output of the variable features
The preceding output shows the values of the features. (You can learn more about the features from the link given previously.)
- Print the target variable:
print(target)
You should get the following output:
Figure 1.48: Output of the variable target
The preceding output shows the values of the target variable. There are two classes shown for each instance in the dataset. These classes are 0 and 1, representing whether the cancer is malignant or benign.
- Import the RandomForestClassifier class from sklearn.ensemble:
from sklearn.ensemble import RandomForestClassifier
- Create a new variable called seed, which will take the value 888 (chosen arbitrarily):
seed = 888
- Instantiate RandomForestClassifier with the random_state=seed parameter and save it into a variable called rf_model:
rf_model = RandomForestClassifier(random_state=seed)
- Train the model with the .fit() method with features and target as parameters:
rf_model.fit(features, target)
You should get the following output:
Figure 1.49: Logs of RandomForestClassifier
- Make predictions with the trained model using the .predict() method and features as a parameter and save the results into a variable called preds:
preds = rf_model.predict(features)
- Print the preds variable:
print(preds)
You should get the following output:
Figure 1.50: Predictions of the Random Forest model
The preceding output shows the predictions for the training set. You can compare this with the actual target variable values shown in Figure 1.48.
- Import the accuracy_score method from sklearn.metrics:
from sklearn.metrics import accuracy_score
- Calculate accuracy_score() with target and preds as parameters:
accuracy_score(target, preds)
You should get the following output:
Figure 1.51: Accuracy of the model
Note
To access the source code for this specific section, please refer to https://packt.live/3aBso5i.
You can also run this example online at https://packt.live/316OiKA.
You just trained a Random Forest model using sklearn APIs and achieved an accuracy score of 1 in classifying breast cancer observations.
Activity 1.01: Train a Spam Detector Algorithm
You are working for an email service provider and have been tasked with training an algorithm that recognizes whether an email is spam or not from a given dataset and checking its performance.
In this dataset, the authors have already created 57 different features based on some statistics for relevant keywords in order to classify whether an email is spam or not.
Note
The dataset was originally shared by Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt: https://packt.live/35fdUUA.
You can download it from the Packt GitHub at https://packt.live/2MPmnrl.
The following steps will help you to complete this activity:
- Import the required libraries.
- Load the dataset using .pd.read_csv().
- Extract the response variable using .pop() from pandas. This method will extract the column provided as a parameter from the DataFrame. You can then assign it a variable name, for example, target = df.pop('class').
- Instantiate RandomForestClassifier.
- Train a Random Forest model to predict the outcome with .fit().
- Predict the outcomes from the input data using .predict().
- Calculate the accuracy score using accuracy_score.
The output will be similar to the following:
Figure 1.52: Accuracy score for spam detector
Note
The solution to this activity can be found at the following address: https://packt.live/2GbJloz.
- .NET之美:.NET關鍵技術深入解析
- Mastering RabbitMQ
- C++ Builder 6.0下OpenGL編程技術
- Python從入門到精通(精粹版)
- Java游戲服務器架構實戰
- Web Application Development with R Using Shiny(Second Edition)
- Access 2010數據庫基礎與應用項目式教程(第3版)
- Java深入解析:透析Java本質的36個話題
- Java Web基礎與實例教程
- RISC-V體系結構編程與實踐(第2版)
- Node.js Design Patterns
- MySQL程序員面試筆試寶典
- Python Machine Learning Blueprints:Intuitive data projects you can relate to
- Arduino Wearable Projects
- Web編程基礎:HTML5、CSS3、JavaScript(第2版)