- The Data Science Workshop
- Anthony So Thomas V. Joseph Robert Thas John Andrew Worsley Dr. Samuel Asare
- 2839字
- 2021-06-11 18:27:23
Feature Engineering
In the previous section, we traversed the process of EDA. As part of the earlier process, we tested our business hypotheses by slicing and dicing the data and through visualizations. You might be wondering where we will use the intuitions that we derived from all of the analysis we did. The answer to that question will be addressed in this section.
Feature engineering is the process of transforming raw variables to create new variables and this will be covered later in the chapter. Feature engineering is one of the most important steps that influence the accuracy of the models that we build.
There are two broad types of feature engineering:
- Here, we transform raw variables based on intuitions from a business perspective. These intuitions are what we build during the exploratory analysis.
- The transformation of raw variables is done from a statistical and data normalization perspective.
We will look into each type of feature engineering next.
Note
Feature engineering will be covered in much more detail in Chapter 12, Feature Engineering. In this section you will see the purpose of learning about classification.
Business-Driven Feature Engineering
Business-driven feature engineering is the process of transforming raw variables based on business intuitions that were derived during the exploratory analysis. It entails transforming data and creating new variables based on business factors or drivers that influence a business problem.
In the previous exercises on exploratory analysis, we explored the relationship of a single variable with the dependent variable. In this exercise, we will combine multiple variables and then derive new features. We will explore the relationship between an asset portfolio and the propensity for term deposit purchases. An asset portfolio is the combination of all assets and liabilities the customer has with the bank. We will combine assets and liabilities such as bank balance, home ownership, and loans to get a new feature called an asset index.
These feature engineering steps will be split into two exercises. In Exercise 3.03, Feature Engineering – Exploration of Inpidual Features, we explore inpidual variables such as balance, housing, and loans to understand their relationship to a propensity for term deposits.
In Exercise 3.04, Feature Engineering – Creating New Features from Existing Ones, we will transform inpidual variables and then combine them to form a new feature.
Exercise 3.03: Feature Engineering – Exploration of Inpidual Features
In this exercise, we will explore the relationship between two variables, which are whether an inpidual owns a house and whether an inpidual has a loan, to the propensity for term deposit purchases by these inpiduals.
The following steps will help you to complete this exercise:
- Open a new Colab notebook.
- Import the pandas package.
import pandas as pd
- Assign the link to the dataset to a variable called file_url:
file_url = 'https://raw.githubusercontent.com'\
'/PacktWorkshops/The-Data-Science-Workshop'\
'/master/Chapter03/bank-full.csv'
- Read the banking dataset using the .read_csv() function:
# Reading the banking data
bankData = pd.read_csv(file_url, sep=";")
- Next, we will find a relationship between housing and the propensity for term deposits, as mentioned in the following code snippet:
# Relationship between housing and propensity for term deposits
bankData.groupby(['housing', 'y'])['y']\
.agg(houseTot='count').reset_index()
You should get the following output:
Figure 3.14: Housing status versus propensity to buy term deposits
The first part of the code is to group customers based on whether they own a house or not. The count of customers under each category is calculated with the .agg() method. From the values, we can see that the propensity to buy term deposits is much higher for people who do not own a house compared with those who do own one: ( 3354 / ( 3354 + 16727) = 17% to 1935 / ( 1935 + 23195) = 8%).
- Explore the 'loan' variable to find its relationship with the propensity for a term deposit, as mentioned in the following code snippet:
"""
Relationship between having a loan and propensity for term
deposits
"""
bankData.groupby(['loan', 'y'])['y']\
.agg(loanTot='count').reset_index()
Note
The triple-quotes ( """ ) shown in the code snippet above are used to denote the start and end points of a multi-line code comment. This is an alternative to using the # symbol.
You should get the following output:
Figure 3.15: Loan versus term deposit propensity
In the case of loan portfolios, the propensity to buy term deposits is higher for customers without loans: ( 4805 / ( 4805 + 33162) = 12 % to 484/ ( 484 + 6760) = 6%).
Housing and loans were categorical data and finding a relationship was straightforward. However, bank balance data is numerical and to analyze it, we need to have a different strategy. One common strategy is to convert the continuous numerical data into ordinal data and look at how the propensity varies across each category.
- To convert numerical values into ordinal values, we first find the quantile values and take them as threshold values. The quantiles are obtained using the following code snippet:
#Taking the quantiles for 25%, 50% and 75% of the balance data
import numpy as np
np.quantile(bankData['balance'],[0.25,0.5,0.75])
You should get the following output:
Figure 3.16: Quantiles for bank balance data
Quantile values represent certain threshold values for data distribution. For example, when we say the 25th quantile percentile, we are talking about a value below which 25% of the data exists. The quantile can be calculated using the np.quantile() function in NumPy. In the code snippet of Step 4, we calculated the 25th, 50th, and 75th percentiles, which resulted in 72, 448, and 1428.
- Now, convert the numerical values of bank balances into categorical values, as mentioned in the following code snippet:
bankData['balanceClass'] = 'Quant1'
bankData.loc[(bankData['balance'] > 72) \
& (bankData['balance'] < 448), \
'balanceClass'] = 'Quant2'
bankData.loc[(bankData['balance'] > 448) \
& (bankData['balance'] < 1428), \
'balanceClass'] = 'Quant3'
bankData.loc[bankData['balance'] > 1428, \
'balanceClass'] = 'Quant4'
bankData.head()
You should get the following output:
Figure 3.17: New features from bank balance data
We did this is by looking at the quantile thresholds we took in the Step 4, and categorizing the numerical data into the corresponding quantile class. For example, all values lower than the 25th quantile value, 72, were classified as Quant1, values between 72 and 448 were classified as Quant2, and so on. To store the quantile categories, we created a new feature in the bank dataset called balanceClass and set its default value to Quan1. After this, based on each value threshold, the data points were classified to the respective quantile class.
- Next, we need to find the propensity of term deposit purchases based on each quantile the customers fall into. This task is similar to what we did in Exercise 3.02, Business Hypothesis Testing for Age versus Propensity for a Term Loan:
# Calculating the customers under each quantile
balanceTot = bankData.groupby(['balanceClass'])['y']\
.agg(balanceTot='count').reset_index()
balanceTot
You should get the following output:
Figure 3:18: Classification based on quantiles
- Calculate the total number of customers categorized by quantile and propensity classification, as mentioned in the following code snippet:
"""
Calculating the total customers categorised as per quantile
and propensity classification
"""
balanceProp = bankData.groupby(['balanceClass', 'y'])['y']\
.agg(balanceCat='count').reset_index()
balanceProp
You should get the following output:
Figure 3.19: Total number of customers categorized by quantile and propensity classification
- Now, merge both DataFrames:
# Merging both the data frames
balanceComb = pd.merge(balanceProp, balanceTot, \
on = ['balanceClass'])
balanceComb['catProp'] = (balanceComb.balanceCat \
/ balanceComb.balanceTot)*100
balanceComb
You should get the following output:
Figure 3.20: Propensity versus balance category
From the distribution of data, we can see that, as we move from Quantile 1 to Quantile 4, the proportion of customers who buy term deposits keeps on increasing. For instance, of all of the customers who belong to Quant 1, 7.25% have bought term deposits (we get this percentage from catProp). This proportion increases to 10.87 % for Quant 2 and thereafter to 12.52 % and 16.15% for Quant 3 and Quant4, respectively. From this trend, we can conclude that inpiduals with higher balances have more propensity for term deposits.
In this exercise, we explored the relationship of each variable to the propensity for term deposit purchases. The overall trend that we can observe is that people with more cash in hand (no loans and a higher balance) have a higher propensity to buy term deposits.
Note
To access the source code for this specific section, please refer to https://packt.live/3g7rK0w.
You can also run this example online at https://packt.live/2PZbcNV.
In the next exercise, we will use these intuitions to derive a new feature.
Exercise 3.04: Feature Engineering – Creating New Features from Existing Ones
In this exercise, we will combine the inpidual variables we analyzed in Exercise 3.03, Feature Engineering – Exploration of Inpidual Features to derive a new feature called an asset index. One methodology to create an asset index is by assigning weights based on the asset or liability of the customer.
For instance, a higher bank balance or home ownership will have a positive bearing on the overall asset index and, therefore, will be assigned a higher weight. In contrast, the presence of a loan will be a liability and, therefore, will have to have a lower weight. Let's give a weight of 5 if the customer has a house and 1 in its absence. Similarly, we can give a weight of 1 if the customer has a loan and 5 in case of no loans:
- Open a new Colab notebook.
- Import the pandas and numpy package:
import pandas as pd
import numpy as np
- Assign the link to the dataset to a variable called 'file_url'.
file_url = 'https://raw.githubusercontent.com'\
'/PacktWorkshops/The-Data-Science-Workshop'\
'/master/Chapter03/bank-full.csv'
- Read the banking dataset using the .read_csv() function:
# Reading the banking data
bankData = pd.read_csv(file_url,sep=";")
- The first step we will follow is to normalize the numerical variables. This is implemented using the following code snippet:
# Normalizing data
from sklearn import preprocessing
x = bankData[['balance']].values.astype(float)
- As the bank balance dataset contains numerical values, we need to first normalize the data. The purpose of normalization is to bring all of the variables that we are using to create the new feature into a common scale. One effective method we can use here for the normalizing function is called MinMaxScaler(), which converts all of the numerical data between a scaled range of 0 to 1. The MinMaxScaler function is available within the preprocessing method in sklearn:
minmaxScaler = preprocessing.MinMaxScaler()
- Transform the balance data by normalizing it with minmaxScaler:
bankData['balanceTran'] = minmaxScaler.fit_transform(x)
In this step, we created a new feature called 'balanceTran' to store the normalized bank balance values.
- Print the head of the data using the .head() function:
bankData.head()
You should get the following output:
Figure 3.21: Normalizing the bank balance data
- After creating the normalized variable, add a small value of 0.001 so as to eliminate the 0 values in the variable. This is mentioned in the following code snippet:
# Adding a small numerical constant to eliminate 0 values
bankData['balanceTran'] = bankData['balanceTran'] + 0.00001
The purpose of adding this small value is because, in the subsequent steps, we will be multiplying three transformed variables together to form a composite index. The small value is added to avoid the variable values becoming 0 during the multiplying operation.
- Now, add two additional columns for introducing the transformed variables for loans and housing, as per the weighting approach discussed at the start of this exercise:
# Let us transform values for loan data
bankData['loanTran'] = 1
# Giving a weight of 5 if there is no loan
bankData.loc[bankData['loan'] == 'no', 'loanTran'] = 5
bankData.head()
You should get the following output:
Figure 3.22: Additional columns with the transformed variables
We transformed values for the loan data as per the weighting approach. When a customer has a loan, it is given a weight of 1, and when there's no loan, the weight assigned is 5. The value of 1 and 5 are intuitive weights we are assigning. What values we assign can vary based on the business context you may be provided with.
- Now, transform values for the Housing data, as mentioned here:
# Let us transform values for Housing data
bankData['houseTran'] = 5
- Give a weight of 1 if the customer has a house and print the results, as mentioned in the following code snippet:
bankData.loc[bankData['housing'] == 'no', 'houseTran'] = 1
print(bankData.head())
You should get the following output:
Figure 3.23: Transforming loan and housing data
Once all the transformed variables are created, we can multiply all of the transformed variables together to create a new index called assetIndex. This is a composite index that represents the combined effect of all three variables.
- Now, create a new variable, which is the product of all of the transformed variables:
"""
Let us now create the new variable which is a product of all
these
"""
bankData['assetIndex'] = bankData['balanceTran'] \
* bankData['loanTran'] \
* bankData['houseTran']
bankData.head()
You should get the following output:
Figure 3.24: Creating a composite index
- Explore the propensity with respect to the composite index.
We observe the relationship between the asset index and the propensity of term deposit purchases. We adopt a similar strategy of converting the numerical values of the asset index into ordinal values by taking the quantiles and then mapping the quantiles to the propensity of term deposit purchases, as mentioned in Exercise 3.03, Feature Engineering – Exploration of Inpidual Features:
# Finding the quantile
np.quantile(bankData['assetIndex'],[0.25,0.5,0.75])
You should get the following output:
Figure 3.25: Conversion of numerical values into ordinal values
- Next, create quantiles from the assetindex data, as mentioned in the following code snippet:
bankData['assetClass'] = 'Quant1'
bankData.loc[(bankData['assetIndex'] > 0.38) \
& (bankData['assetIndex'] < 0.57), \
'assetClass'] = 'Quant2'
bankData.loc[(bankData['assetIndex'] > 0.57) \
& (bankData['assetIndex'] < 1.9), \
'assetClass'] = 'Quant3'
bankData.loc[bankData['assetIndex'] > 1.9, \
'assetClass'] = 'Quant4'
bankData.head()
bankData.assetClass[bankData['assetIndex'] > 1.9] = 'Quant4'
bankData.head()
You should get the following output:
Figure 3.26: Quantiles for the asset index
- Calculate the total of each asset class and the category-wise counts, as mentioned in the following code snippet:
# Calculating total of each asset class
assetTot = bankData.groupby('assetClass')['y']\
.agg(assetTot='count').reset_index()
# Calculating the category wise counts
assetProp = bankData.groupby(['assetClass', 'y'])['y']\
.agg(assetCat='count').reset_index()
- Next, merge both DataFrames:
# Merging both the data frames
assetComb = pd.merge(assetProp, assetTot, on = ['assetClass'])
assetComb['catProp'] = (assetComb.assetCat \
/ assetComb.assetTot)*100
assetComb
You should get the following output:
Figure 3.27: Composite index relationship mapping
From the new feature we created, we can see that 18.88% (we get this percentage from catProp) of customers who are in Quant2 have bought term deposits compared to 10.57 % for Quant1, 8.78% for Quant3, and 9.28% for Quant4. Since Quant2 has the highest proportion of customers who have bought term deposits, we can conclude that customers in Quant2 have higher propensity to purchase the term deposits than all other customers.
Note
To access the source code for this specific section, please refer to https://packt.live/316hUrO.
You can also run this example online at https://packt.live/3kVc7Ny.
Similar to the exercise that we just completed, you should think of new variables that can be created from the existing variables based on business intuitions. Creating new features based on business intuitions is the essence of business-driven feature engineering. In the next section, we will look at another type of feature engineering called data-driven feature engineering.
- Getting Started with Citrix XenApp? 7.6
- ASP.NET Web API:Build RESTful web applications and services on the .NET framework
- 在最好的年紀學(xué)Python:小學(xué)生趣味編程
- Visual C++實例精通
- C和C++安全編碼(原書第2版)
- Java加密與解密的藝術(shù)(第2版)
- Data Analysis with IBM SPSS Statistics
- Windows Server 2012 Unified Remote Access Planning and Deployment
- Java程序設(shè)計與實踐教程(第2版)
- 利用Python進行數(shù)據(jù)分析
- 從零開始學(xué)C#
- Fast Data Processing with Spark(Second Edition)
- Python Essentials
- 遠方:兩位持續(xù)創(chuàng)業(yè)者的點滴思考
- PHP 8從入門到精通(視頻教學(xué)版)