双头企鹅的捕鱼机有漏洞吗

書名： The Data Science Workshop
作者名： Anthony So Thomas V. Joseph Robert Thas John Andrew Worsley Dr. Samuel Asare
本章字數： 6254字
更新時間： 2021-06-11 18:27:23

Correlation Matrix and Visualization

Correlation, as you know, is a measure that indicates how two variables fluctuate together. Any correlation value of 1, or near 1, indicates that those variables are highly correlated. Highly correlated variables can sometimes be damaging for the veracity of models and, in many circumstances, we make the decision to eliminate such variables or to combine them to form composite or interactive variables.

Let's look at how data correlation can be generated and then visualized in the following exercise.

Exercise 3.05: Finding the Correlation in Data to Generate a Correlation Plot Using Bank Data

In this exercise, we will be creating a correlation plot and analyzing the results of the bank dataset.

The following steps will help you to complete the exercise:

Open a new Colab notebook, install the pandas packages and load the banking data:
import pandas as pd
file_url = 'https://raw.githubusercontent.com'\
'/PacktWorkshops/The-Data-Science-Workshop'\
'/master/Chapter03/bank-full.csv'
bankData = pd.read_csv(file_url, sep=";")
Now, import the set_option library from pandas, as mentioned here:
from pandas import set_option
The set_option function is used to define the display options for many operations.
Next, create a variable that would store numerical variables such as 'age','balance','day','duration','campaign','pdays','previous', as mentioned in the following code snippet. A correlation plot can be extracted only with numerical data. This is why the numerical data has to be extracted separately:
bankNumeric = bankData[['age','balance','day','duration',\
'campaign','pdays','previous']]
Now, use the .corr() function to find the correlation matrix for the dataset:
set_option('display.width',150)
set_option('precision',3)
bankCorr = bankNumeric.corr(method = 'pearson')
bankCorr
You should get the following output:

Figure 3.30: Correlation matrix
The method we use for correlation is the Pearson correlation coefficient. We can see from the correlation matrix that the diagonal elements have a correlation of 1. This is because the diagonals are a correlation of a variable with itself, which will always be 1. This is the Pearson correlation coefficient.
Now, plot the data:
from matplotlib import pyplot
corFig = pyplot.figure()
figAxis = corFig.add_subplot(111)
corAx = figAxis.matshow(bankCorr,vmin=-1,vmax=1)
corFig.colorbar(corAx)
pyplot.show()
You should get the following output:

Figure 3.31: Correlation plot

We used many plotting parameters in this code block. pyplot.figure() is the plotting class that is instantiated. .add_subplot() is a grid parameter for the plotting. For example, 111 means a 1 x 1 grid for the first subplot. The .matshow() function is to display the plot, and the vmin and vmax arguments are for normalizing the data in the plot.

Let's look at the plot of the correlation matrix to visualize the matrix for quicker identification of correlated variables. Some obvious candidates are the high correlation between 'balance' and 'balanceTran' and the 'asset index' with many of the transformed variables that we created in the earlier exercise. Other than that, there aren't many variables that are highly correlated.

Note

To access the source code for this specific section, please refer to https://packt.live/3kXr9SK.

You can also run this example online at https://packt.live/3gbfbkR.

In this exercise, we developed a correlation plot that allows us to visualize the correlation between variables.

Skewness of Data

Another area for feature engineering is skewness. Skewed data means data that is shifted in one direction or the other. Skewness can cause machine learning models to underperform. Many machine learning models assume normally distributed data or data structures to follow the Gaussian structure. Any deviation from the assumed Gaussian structure, which is the popular bell curve, can affect model performance. A very effective area where we can apply feature engineering is by looking at the skewness of data and then correcting the skewness through normalization of the data. Skewness can be visualized by plotting the data using histograms and density plots. We will investigate each of these techniques.

Let's take a look at the following example. Here, we use the .skew() function to find the skewness in data. For instance, to find the skewness of data in our bank-full.csv dataset, we perform the following:

# Skewness of numeric attributes

bankNumeric.skew()

Note

This code refers to the bankNumeric data, so you should ensure you are working in the same notebook as the previous exercise.

You should get the following output:

Figure 3.32: Degree of skewness

The preceding matrix is the skewness index. Any value closer to 0 indicates a low degree of skewness. Positive values indicate right skew and negative values, left skew. Variables that show higher values of right skew and left skew are candidates for further feature engineering by normalization. Let's now visualize the skewness by plotting histograms and density plots.

Histograms

Histograms are an effective way to plot the distribution of data and to identify skewness in data, if any. The histogram outputs of two columns of bankData are listed here. The histogram is plotted with the pyplot package from matplotlib using the .hist() function. The number of subplots we want to include is controlled by the .subplots() function. (1,2) in subplots would mean one row and two columns. The titles are set by the set_title() function:

# Histograms

from matplotlib import pyplot as plt

fig, axs = plt.subplots(1,2)

axs[0].hist(bankNumeric['age'])

axs[0].set_title('Distribution of age')

axs[1].hist(bankNumeric['balance'])

axs[1].set_title('Distribution of Balance')

# Ensure plots do not overlap

plt.tight_layout()

You should get the following output:

Figure 3.33: Code showing the generation of histograms

From the histogram, we can see that the age variable has a distribution closer to the bell curve with a lower degree of skewness. In contrast, the asset index shows a relatively higher right skew, which makes it a more probable candidate for normalization.

Density Plots

Density plots help in visualizing the distribution of data. A density plot can be created using the kind = 'density' parameter:

from matplotlib import pyplot as plt

# Density plots

bankNumeric['age'].plot(kind = 'density', subplots = False, \

layout = (1,1))

plt.title('Age Distribution')

plt.xlabel('Age')

plt.ylabel('Normalised age distribution')

pyplot.show()

You should get the following output:

Figure 3.34: Code showing the generation of a density plot

Density plots help in getting a smoother visualization of the distribution of the data. From the density plot of Age, we can see that it has a distribution similar to a bell curve.

Other Feature Engineering Methods

So far, we were looking at various descriptive statistics and visualizations that are precursors for applying many feature engineering techniques on data structures. We investigated one such feature engineering technique in Exercise 3.02, Business Hypothesis Testing for Age versus Propensity for a Term Loan where we applied the min max scaler for normalizing data.

We will now look into two other similar data transformation techniques, namely, standard scaler and normalizer. Standard scaler standardizes data to a mean of 0 and standard deviation of 1. The mean is the average of the data and the standard deviation is a measure of the spread of data. By standardizing to the same mean and standard deviation, comparison across different distributions of data is enabled.

The normalizer function normalizes the length of data. This means that each value in a row is pided by the normalization of the row vector to normalize the row. The normalizer function is applied on the rows while standard scaler is applied columnwise. The normalizer and standard scaler functions are important feature engineering steps that are applied to the data before downstream modeling steps. Let's look at both of these techniques:

# Standardize data (0 mean, 1 stdev)

from sklearn.preprocessing import StandardScaler

from numpy import set_printoptions

scaling = StandardScaler().fit(bankNumeric)

rescaledNum = scaling.transform(bankNumeric)

set_printoptions(precision = 3)

print(rescaledNum)

You should get the following output:

Figure 3.35: Output from standardizing the data

The following code uses the normalizer data transmission techniques:

# Normalizing Data (Length of 1)

from sklearn.preprocessing import Normalizer

normaliser = Normalizer().fit(bankNumeric)

normalisedNum = normaliser.transform(bankNumeric)

set_printoptions(precision = 3)

print(normalisedNum)

You should get the following output:

Figure 3.36 Output by the normalizer

The output from standard scaler is normalized along the columns. The output would have 11 columns corresponding to 11 numeric columns (age, balance, day, duration, and so on). If we observe the output, we can see that each value along a column is normalized so as to have a mean of 0 and standard deviation of 1. By transforming data in this way, we can easily compare across columns.

For instance, in the age variable, we have data ranging from 18 up to 95. In contrast, for the balance data, we have data ranging from -8,019 to 102,127. We can see that both of these variables have different ranges of data that cannot be compared. The standard scaler function converts these data points at very different scales into a common scale so as to compare the distribution of data. Normalizer rescales each row so as to have a vector with a length of 1.

The big question we have to think about is why do we have to standardize or normalize data? Many machine learning algorithms converge faster when the features are of a similar scale or are normally distributed. Standardizing is more useful in algorithms that assume input variables to have a Gaussian structure. Algorithms such as linear regression, logistic regression, and linear discriminate analysis fall under this genre. Normalization techniques would be more congenial for sparse datasets (datasets with lots of zeros) when using algorithms such as k-nearest neighbor or neural networks.

Summarizing Feature Engineering

In this section, we investigated the process of feature engineering from a business perspective and data structure perspective. Feature engineering is a very important step in the life cycle of a data science project and helps determine the veracity of the models that we build. As seen in Exercise 3.02, Business Hypothesis Testing for Age versus Propensity for a Term Loan we translated our understanding of the domain and our intuitions to build intelligent features. Let's summarize the processes that we followed:

We obtain intuitions from a business perspective through EDA
Based on the business intuitions, we devised a new feature that is a combination of three other variables.
We verified the influence of constituent variables of the new feature and devised an approach for weights to be applied.
Converted ordinal data into corresponding weights.
Transformed numerical data by normalizing them using an appropriate normalizer.
Combined all three variables into a new feature.
Observed the relationship between the composite index and the propensity to purchase term deposits and derived our intuitions.
Explored techniques for visualizing and extracting summary statistics from data.
Identified techniques for transforming data into feature engineered data structures.

Now that we have completed the feature engineering step, the next question is where do we go from here and what is the relevance of the new feature we created? As you will see in the subsequent sections, the new features that we created will be used for the modeling process. The preceding exercises are an example of a trail we can follow in creating new features. There will be multiple trails like these, which should be thought of as based on more domain knowledge and understanding. The veracity of the models that we build will be dependent on all such intelligent features we can build by translating business knowledge into data.

Building a Binary Classification Model Using the Logistic Regression Function

The essence of data science is about mapping a business problem into its data elements and then transforming those data elements to get our desired business outcomes. In the previous sections, we discussed how we do the necessary transformation on the data elements. The right transformation of the data elements can highly influence the generation of the right business outcomes by the downstream modeling process.

Let's look at the business outcome generation process from the perspective of our use case. The desired business outcome, in our use case, is to identify those customers who are likely to buy a term deposit. To correctly identify which customers are likely to buy a term deposit, we first need to learn the traits or features that, when present in a customer, helps in the identification process. This learning of traits is what is achieved through machine learning.

By now, you may have realized that the goal of machine learning is to estimate a mapping function (f) between an output variable and input variables. In mathematical form, this can be written as follows:

Figure 3.37: A mapping function in mathematical form

Let's look at this equation from the perspective of our use case.

Y is the dependent variable, which is our prediction as to whether a customer has the probability to buy a term deposit or not.

X is the independent variable(s), which are those attributes such as age, education, and marital status and are part of the dataset.

f() is a function that connects various attributes of the data to the probability or whether a customer will buy a term deposit or not. This function is learned during the machine learning process. This function is a combination of different coefficients or parameters applied to each of the attributes to get the probability of term deposit purchases. Let's unravel this concept using a simple example of our bank data use case.

For simplicity, let's assume that we have only two attributes, age and bank balance. Using these, we have to predict whether a customer is likely to buy a term deposit or not. Let the age be 40 years and the balance $1,000. With all of these attribute values, let's assume that the mapping equation is as follows:

Figure 3.38: Updated mapping equation

Using the preceding equation, we get the following:

Y = 0.1 + 0.4 * 40 + 0.002 * 1000

Y = 18.1

Now, you might be wondering, we are getting a real number and how does this represent a decision of whether a customer will buy a term deposit or not? This is where the concept of a decision boundary comes in. Let's also assume that, on analyzing the data, we have also identified that if the value of Y goes above 15 (an assumed value in this case), then the customer is likely to buy the term deposit, otherwise they will not buy a term deposit. This means that, as per this example, the customer is likely to buy a term deposit.

Let's now look at the dynamics in this example and try to decipher the concepts. The values such as 0.1, 0.4, and 0.002, which are applied to each of the attributes, are the coefficients. These coefficients, along with the equation connecting the coefficients and the variables, are the functions that we are learning from the data. The essence of machine learning is to learn all of these from the provided data. All of these coefficients along with the functions can also be called by another common name called the model. A model is an approximation of the data generation process. During machine learning, we are trying to get as close to the real model that has generated the data we are analyzing. To learn or estimate the data generating models, we use different machine learning algorithms.

Machine learning models can be broadly classified into two types, parametric models and non-parametric models. Parametric models are where we assume the form of the function we are trying to learn and then learn the coefficients from the training data. By assuming a form for the function, we simplify the learning process.

To understand the concept better, let's take the example of a linear model. For a linear model, the mapping function takes the following form:

Figure 3.39: Linear model mapping function

The terms C0, M1, and M2 are the coefficients of the line that influences the intercept and slope of the line. X1 and X2 are the input variables. What we are doing here is that we assume that the data generating model is a linear model and then, using the data, we estimate the coefficients, which will enable the generation of the predictions. By assuming the data generating model, we have simplified the whole learning process. However, these simple processes also come with their pitfalls. Only if the underlying function is linear or similar to linear will we get good results. If the assumptions about the form are wrong, we are bound to get bad results.

Some examples of parametric models include:

Linear and logistic regression
Na?ve Bayes
Linear support vector machines
Perceptron

Machine learning models that do not make strong assumptions on the function are called non-parametric models. In the absence of an assumed form, non-parametric models are free to learn any functional form from the data. Non-parametric models generally require a lot of training data to estimate the underlying function. Some examples of non-parametric models include the following:

Decision trees
K –nearest neighbors
Neural networks
Support vector machines with Gaussian kernels

Logistic Regression Demystified

Logistic regression is a linear model similar to the linear regression that was covered in the previous chapter. At the core of logistic regression is the sigmoid function, which quashes any real-valued number to a value between 0 and 1, which renders this function ideal for predicting probabilities. The mathematical equation for a logistic regression function can be written as follows:

Figure 3.40: Logistic regression function

Here, Y is the probability of whether a customer is likely to buy a term deposit or not.

The terms C0 + M1 * X1 + M2 * X2 are very similar to the ones we have seen in the linear regression function, covered in an earlier chapter. As you would have learned by now, a linear regression function gives a real-valued output. To transform the real-valued output into a probability, we use the logistic function, which has the following form:

Figure 3.41: An expression to transform the real-valued output to a probability

Here, e is the natural logarithm. We will not pe deep into the math behind this; however, let's realize that, using the logistic function, we can transform the real-valued output into a probability function.

Let's now look at the logistic regression function from the business problem that we are trying to solve. In the business problem, we are trying to predict the probability of whether a customer would buy a term deposit or not. To do that, let's return to the example we derived from the problem statement:

Figure 3.42: The logistic regression function updated with the business problem statement

Adding the following values, we get Y = 0.1 + 0.4 * 40 + 0.002 * 100.

To get the probability, we must transform this problem statement using the logistic function, as follows:

Figure 3.43: Transformed problem statement to find the probability of using the logistic function

In applying this, we get a value of Y = 1, which is a 100% probability that the customer will buy the term deposit. As discussed in the previous example, the coefficients of the model such as 0.1, 0.4, and 0.002 are what we learn using the logistic regression algorithm during the training process.

Metrics for Evaluating Model Performance

As a data scientist, you always have to make decisions on the models you build. These evaluations are done based on various metrics on the predictions. In this section, we introduce some of the important metrics that are used for evaluating the performance of models.

Note

Model performance will be covered in much more detail in Chapter 6, How to Assess Performance. This section provides you with an introduction to work with classification models.

Confusion Matrix

As you will have learned, we evaluate a model based on its performance on a test set. A test set will have its labels, which we call the ground truth, and, using the model, we also generate predictions for the test set. The evaluation of model performance is all about comparison of the ground truth and the predictions. Let's see this in action with a dummy test set:

Figure 3.44: Confusion matrix generation

The preceding table shows a dummy dataset with seven examples. The second column is the ground truth, which are the actual labels, and the third column contains the results of our predictions. From the data, we can see that four have been correctly classified and three were misclassified.

A confusion matrix generates the resultant comparison between prediction and ground truth, as represented in the following table:

Figure 3.45: Confusion matrix

As you can see from the table, there are five examples whose labels (ground truth) are Yes and the balance is two examples that have the labels of No.

The first row of the confusion matrix is the evaluation of the label Yes. True positive shows those examples whose ground truth and predictions are Yes (examples 1, 3, and 5). False negative shows those examples whose ground truth is Yes and who have been wrongly predicted as No (examples 2 and 7).

Similarly, the second row of the confusion matrix evaluates the performance of the label No. False positive are those examples whose ground truth is No and who have been wrongly classified as Yes (example 6). True negative examples are those examples whose ground truth and predictions are both No (example 4).

The generation of a confusion matrix is used for calculating many of the matrices such as accuracy and classification reports, which are explained later. It is based on metrics such as accuracy or other detailed metrics shown in the classification report such as precision or recall the models for testing. We generally pick models where these metrics are the highest.

Accuracy

Accuracy is the first level of evaluation, which we will resort to in order to have a quick check on model performance. Referring to the preceding table, accuracy can be represented as follows:

Figure 3.46: A function that represents accuracy

Accuracy is the proportion of correct predictions out of all of the predictions.

Classification Report

A classification report outputs three key metrics: precision, recall, and the F1 score.

Precision is the ratio of true positives to the sum of true positives and false positives:

Figure 3.47: The precision ratio

Precision is the indicator that tells you, out of all of the positives that were predicted, how many were true positives.

Recall is the ratio of true positives to the sum of true positives and false negatives:

Figure 3.48: The recall ratio

Recall manifests the ability of the model to identify all true positives.

The F1 score is a weighted score of both precision and recall. An F1 score of 1 indicates the best performance and 0 indicates the worst performance.

In the next section, let's take a look at data preprocessing, which is an important process to work with data and come to conclusions in data analysis.

Data Preprocessing

Data preprocessing has an important role to play in the life cycle of data science projects. These processes are often the most time-consuming part of the data science life cycle. Careful implementation of the preprocessing steps is critical and will have a strong bearing on the results of the data science project.

The various preprocessing steps include the following:

Data loading: This involves loading the data from different sources into the notebook.
Data cleaning: Data cleaning process entails removing anomalies, for instance, special characters, duplicate data, and identification of missing data from the available dataset. Data cleaning is one of the most time-consuming steps in the data science process.
Data imputation: Data imputation is filling missing data with new data points.
Converting data types: Datasets will have different types of data such as numerical data, categorical data, and character data. Running models will necessitate the transformation of data types.
Note
Data processing will be covered in depth in the following chapters of this book.

We will implement some of these preprocessing steps in the subsequent sections and in Exercise 3.06, A Logistic Regression Model for Predicting the Propensity of Term Deposit Purchases in a Bank.

Exercise 3.06: A Logistic Regression Model for Predicting the Propensity of Term Deposit Purchases in a Bank

In this exercise, we will build a logistic regression model, which will be used for predicting the propensity of term deposit purchases. This exercise will have three parts. The first part will be the preprocessing of the data, the second part will deal with the training process, and the last part will be spent on prediction, analysis of metrics, and deriving strategies for further improvement of the model.

You begin with data preprocessing.

In this part, we will first load the data, convert the ordinal data into dummy data, and then split the data into training and test sets for the subsequent training phase:

Open a Colab notebook, mount the drives, install necessary packages, and load the data, as in previous exercises:
import pandas as pd
import altair as alt
file_url = 'https://raw.githubusercontent.com'\
'/PacktWorkshops/The-Data-Science-Workshop'\
'/master/Chapter03/bank-full.csv'
bankData = pd.read_csv(file_url, sep=";")
Now, load the library functions and data:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
Now, find the data types:
bankData.dtypes
You should get the following output:

Figure 3.49: Data types
Convert the ordinal data into dummy data.
As you can see in the dataset, we have two types of data: the numerical data and the ordinal data. Machine learning algorithms need numerical representation of data and, therefore, we must convert the ordinal data into a numerical form by creating dummy variables. The dummy variable will have values of either 1 or 0 corresponding to whether that category is present or not. The function we use for converting ordinal data into numerical form is pd.get_dummies(). This function converts the data structure into a long form or horizontal form. So, if there are three categories in a variable, there will be three new variables created as dummy variables corresponding to each of the categories.
The value against each variable would be either 1 or 0, depending on whether that category was present in the variable as an example. Let's look at the code for doing that:
"""
Converting all the categorical variables to dummy variables
"""
bankCat = pd.get_dummies\
          (bankData[['job','marital',\
                     'education','default','housing',\
                     'loan','contact','month','poutcome']])
bankCat.shape
You should get the following output:
(45211, 44)
We now have a new subset of the data corresponding to the categorical data that was converted into numerical form. Also, we had some numerical variables in the original dataset, which did not need any transformation. The transformed categorical data and the original numerical data have to be combined to get all of the original features. To combine both, let's first extract the numerical data from the original DataFrame.
Now, separate the numerical variables:
bankNum = bankData[['age','balance','day','duration',\
'campaign','pdays','previous']]
bankNum.shape
You should get the following output:
(45211, 7)
Now, prepare the X and Y variables and print the Y shape. The X variable is the concatenation of the transformed categorical variable and the separated numerical data:
# Preparing the X variables
X = pd.concat([bankCat, bankNum], axis=1)
print(X.shape)
# Preparing the Y variable
Y = bankData['y']
print(Y.shape)
X.head()
The output shown below is truncated:

Figure 3.50 Combining categorical and numerical DataFrames
Once the DataFrame is created, we can split the data into training and test sets. We specify the proportion in which the DataFrame must be split into training and test sets.
Split the data into training and test sets:
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split\
(X, Y, test_size=0.3, \
random_state=123)
Now, the data is all prepared for the modeling task. Next, we begin with modeling.
In this part, we will train the model using the training set we created in the earlier step. First, we call the logistic regression function and then fit the model with the training set data.
Define the LogisticRegression function:
bankModel = LogisticRegression()
bankModel.fit(X_train, y_train)
You should get the following output:

Figure 3.51: Parameters of the model that fits
Now, that the model is created, use it for predicting on the test sets and then getting the accuracy level of the predictions:
pred = bankModel.predict(X_test)
print('Accuracy of Logistic regression model' \
'prediction on test set: {:.2f}'\
.format(bankModel.score(X_test, y_test)))
You should get the following output:

Figure 3.52: Prediction with the model
From an initial look, an accuracy metric of 90% gives us the impression that the model has done a decent job of approximating the data generating process. Or is it otherwise? Let's take a closer look at the details of the prediction by generating the metrics for the model. We will use two metric-generating functions, the confusion matrix and classification report:
# Confusion Matrix for the model
from sklearn.metrics import confusion_matrix
confusionMatrix = confusion_matrix(y_test, pred)
print(confusionMatrix)
You should get the following output in the following format; however, the values can vary as the modeling task will involve variability:

Figure 3.53: Generation of the confusion matrix
Note
The end results that you get will be different from what you see here as it depends on the system you are using. This is because the modeling part is stochastic in nature and there will always be differences.
Next, let's generate a classification_report:
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))
You should get a similar output; however, with different values due to variability in the modeling process:

Figure 3.54: Confusion matrix and classification report

Note

To access the source code for this specific section, please refer to https://packt.live/2CGFYYU.

You can also run this example online at https://packt.live/3aDq8KX.

From the metrics, we can see that, out of the total 11,998 examples of no, 11,754 were correctly classified as no and the balance, 244, were classified as yes. This gives a recall value of 11,754/11,998, which is nearly 98%. From a precision perspective, out of the total 12,996 examples that were predicted as no, only 11,754 of them were really no, which takes our precision to 11,754/12,996 or 90%.

However, the metrics for yes give a different picture. Out of the total 1,566 cases of yes, only 324 were correctly identified as yes. This gives us a recall of 324/1,566 = 21%. The precision is 324 / (324 + 244) = 57%.

From an overall accuracy level, this can be calculated as follows: correctly classified examples / total examples = (11754 + 324) / 13564 = 89%.

The metrics might seem good when you look only at the accuracy level. However, looking at the details, we can see that the classifier, in fact, is doing a poor job of classifying the yes cases. The classifier has been trained to predict mostly no values, which from a business perspective is useless. From a business perspective, we predominantly want the yes estimates, so that we can target those cases for focused marketing to try to sell term deposits. However, with the results we have, we don't seem to have done a good job in helping the business to increase revenue from term deposit sales.

In this exercise, we have preprocessed data, then we performed the training process, and finally, we found useful prediction, analysis of metrics, and deriving strategies for further improvement of the model.

What we have now built is the first model or a benchmark model. The next step is to try to improve on the benchmark model through different strategies. One such strategy is to feature engineer variables and build new models with new features. Let's achieve that in the next activity.

Activity 3.02: Model Iteration 2 – Logistic Regression Model with Feature Engineered Variables

As the data scientist of the bank, you created a benchmark model to predict which customers are likely to buy a term deposit. However, management wants to improve the results you got in the benchmark model. In Exercise 3.04, Feature Engineering – Creating New Features from Existing Ones, you discussed the business scenario with the marketing and operations teams and created a new variable, assetIndex, by feature engineering three raw variables. You are now fitting another logistic regression model on the feature engineered variables and are trying to improve the results.

In this activity, you will be feature engineering some of the variables to verify their effects on the predictions.

The steps are as follows:

Open the Colab notebook used for the feature engineering in Exercise 3.04, Feature Engineering – Creating New Features from Existing Ones, and execute all of the steps from that exercise.
Create dummy variables for the categorical variables using the pd.get_dummies() function. Exclude original raw variables such as loan and housing, which were used to create the new variable, assetIndex.
Select the numerical variables including the new feature engineered variable, assetIndex, that was created.
Transform some of the numerical variables by normalizing them using the MinMaxScaler() function.
Concatenate the numerical variables and categorical variables using the pd.concat() function and then create X and Y variables.
Split the dataset using the train_test_split() function and then fit a new model using the LogisticRegression() model on the new features.
Analyze the results after generating the confusion matrix and classification report.
You should get the following output:

Figure 3.55: Expected output with the classification report

The classification report will be similar to the one shown here. However, the values can differ due to the variability in the modeling process.

Note

The solution to this activity can be found at the following address: https://packt.live/2GbJloz.

Let's now discuss the next steps that need to be adopted in order to improve upon the metrics we got from our two iterations.

Next Steps

The next obvious question we can ask is where do we go from all of the processes that we have implemented in this chapter? Let's discuss strategies that we can adopt for further improvement:

Class imbalance: Class imbalance implies use cases where one class outnumbers the other class(es) in the dataset. In the dataset that we used for training, out of the total 31,647 examples, 27,953 or 88% of them belonged to the no class. When there are class imbalances, there is a high likelihood that the classifier overfits to the majority class. This is what we have seen in our example. This is also the reason why we shouldn't draw our conclusions on the performance of our classifier by only looking at the accuracy values.
Class imbalance is very prevalent in many use cases such as fraud detection, medical diagnostics, and customer churn, to name a few. There are different strategies for addressing use cases where there are class imbalances. We will deal with class imbalance scenarios in Chapter 13, Imbalanced Datasets.
Feature engineering: Data science is an iterative science. Getting the desired outcome will depend on the variety of experiments we undertake. One big area to make improvements in the initial model is to make changes to the raw variables through feature engineering. We dealt with feature engineering and built a model using feature engineered variables. In building the new features, we followed a trail of creating a new feature related to the asset portfolio. Similarly, there would be other trails that we could follow from a business perspective, which have the potential to yield more features similar to what we created. Identification of such trails would depend on extending the business knowledge we apply through the hypotheses we formulate and the exploratory analysis we do to validate those business hypotheses. A very potent way to improve the veracity of the models is to identify more business trails and then build models through innovative feature engineering.
Model selection strategy: When we discussed parametric and non-parametric models, we touched upon the point that if the real data generation process is not similar to the model that we have assumed, we will get poor results. In our case, we assumed linearity and, therefore, adopted a linear model. What if the real data generation process is not linear? Or, what if there are other parametric or non-parametric models that are much more suitable for this use case? These are all considerations when we try to analyze results and try to improve the model. We must adopt a strategy called model spot checking, which entails working out the use case with different models and checking the initial metrics before adopting a model for the use case. In subsequent chapters, we will discuss other modeling techniques and it will be advisable to try out this use case with other types of models to spot check which modeling technique is more apt for this use case.

官术网_书友最值得收藏!

The Data Science Workshop

Correlation Matrix and Visualization

Exercise 3.05: Finding the Correlation in Data to Generate a Correlation Plot Using Bank Data

Skewness of Data

Histograms

Density Plots

Other Feature Engineering Methods

Summarizing Feature Engineering

Building a Binary Classification Model Using the Logistic Regression Function

Logistic Regression Demystified

Metrics for Evaluating Model Performance

Confusion Matrix

Accuracy

Classification Report

Data Preprocessing

Exercise 3.06: A Logistic Regression Model for Predicting the Propensity of Term Deposit Purchases in a Bank

Activity 3.02: Model Iteration 2 – Logistic Regression Model with Feature Engineered Variables

Next Steps