官术网_书友最值得收藏!

Understanding the Business Context

The best way to work using a concept is with an example you can relate to. To understand the business context, let's, for instance, consider the following example.

The marketing head of the bank where you are a data scientist approaches you with a problem they would like to be addressed. The marketing team recently completed a marketing campaign where they have collated a lot of information on existing customers. They require your help to identify which of these customers are likely to buy a term deposit plan. Based on your assessment of the customer base, the marketing team will chalk out strategies for target marketing. The marketing team has provided access to historical data of past campaigns and their outcomes—that is, whether the targeted customers really bought the term deposits or not. Equipped with the historical data, you have set out on the task to identify the customers with the highest propensity (an inclination) to buy term deposits.

Business Discovery

The first process when embarking on a data science problem like the preceding is the business discovery process. This entails understanding various drivers influencing the business problem. Getting to know the business drivers is important as it will help in formulating hypotheses about the business problem, which can be verified during the exploratory data analysis (EDA). The verification of hypotheses will help in formulating intuitions for feature engineering, which will be critical for the veracity of the models that we build.

Let's understand this process in detail from the context of our use case. The problem statement is to identify those customers who have a propensity to buy term deposits. As you might be aware, term deposits are bank instruments where your money will be locked for a certain period, assuring higher interest rates than saving accounts or interest-bearing checking accounts. From an investment propensity perspective, term deposits are generally popular among risk-averse customers. Equipped with the business context, let's look at some questions on business factors influencing a propensity to buy term deposits:

  • Would age be a factor, with more propensity shown by the elderly?
  • Is there any relationship between employment status and the propensity to buy term deposits?
  • Would the asset portfolio of a customer—that is, house, loan, or higher bank balance—influence the propensity to buy?
  • Will demographics such as marital status and education influence the propensity to buy term deposits? If so, how are demographics correlated to a propensity to buy?

Formulating questions on the business context is critical as this will help in arriving at various trails that we can take when we do exploratory analysis. We will deal with that in the next section. First, let's explore the data related to the preceding business problem.

Exercise 3.01: Loading and Exploring the Data from the Dataset

In this exercise, we will load the dataset in our Colab notebook and do some basic explorations such as printing the dimensions of the dataset using the .shape() function and generating summary statistics of the dataset using the .describe() function.

Note

The dataset for this exercise is the bank dataset, courtesy of S. Moro, P. Cortez and P. Rita: A Data-Driven Approach to Predict the Success of Bank Telemarketing.

It is from the UCI Machine Learning Repository: https://packt.live/2MItXEl and can be downloaded from our GitHub at: https://packt.live/2Wav1nJ.

The following steps will help you to complete this exercise:

  1. Open a new Colab notebook.
  2. Now, import pandas as pd in your Colab notebook:

    import pandas as pd

  3. Assign the link to the dataset to a variable called file_url

    file_url = 'https://raw.githubusercontent.com/PacktWorkshops'\

               '/The-Data-Science-Workshop/master/Chapter03'\

               '/bank-full.csv'

  4. Now, read the file using the pd.read_csv() function from the pandas DataFrame:

    # Loading the data using pandas

    bankData = pd.read_csv(file_url, sep=";")

    bankData.head()

    Note

    The # symbol in the code snippet above denotes a code comment. Comments are added into code to help explain specific bits of logic.

    The pd.read_csv() function's arguments are the filename as a string and the limit separator of a CSV, which is ";". After reading the file, the DataFrame is printed using the .head() function. Note that the # symbol in the code above denotes a comment. Comments are added into code to help explain specific bits of logic.

    You should get the following output:

    Figure 3.2: Loading data into a Colab notebook

    Here, we loaded the CSV file and then stored it as a pandas DataFrame for further analysis.

  5. Next, print the shape of the dataset, as mentioned in the following code snippet:

    # Printing the shape of the data

    print(bankData.shape)

    The .shape function is used to find the overall shape of the dataset.

    You should get the following output:

    (45211, 17)

  6. Now, find the summary of the numerical raw data as a table output using the .describe() function in pandas, as mentioned in the following code snippet:

    # Summarizing the statistics of the numerical raw data

    bankData.describe()

    You should get the following output:

Figure 3.3: Loading data into a Colab notebook

As seen from the shape of the data, the dataset has 45211 examples with 17 variables. The variable set has both categorical and numerical variables. The preceding summary statistics are derived only for the numerical data.

Note

To access the source code for this specific section, please refer to https://packt.live/31UQhAU.

You can also run this example online at https://packt.live/2YdiSAF.

You have completed the first tasks that are required before embarking on our journey. In this exercise, you have learned how to load data and to derive basic statistics, such as the summary statistics, from the dataset. In the subsequent dataset, we will take a deep pe into the loaded dataset.

Testing Business Hypotheses Using Exploratory Data Analysis

In the previous section, you approached the problem statement from a domain perspective, thereby identifying some of the business drivers. Once business drivers are identified, the next step is to evolve some hypotheses about the relationship of these business drivers and the business outcome you have set out to achieve. These hypotheses need to be verified using the data you have. This is where exploratory data analysis (EDA) plays a big part in the data science life cycle.

Let's return to the problem statement we are trying to analyze. From the previous section, we identified some business drivers such as age, demographics, employment status, and asset portfolio, which we feel will influence the propensity for buying a term deposit. Let's go ahead and formulate our hypotheses on some of these business drivers and then verify them using EDA.

Visualization for Exploratory Data Analysis

Visualization is imperative for EDA. Effective visualization helps in deriving business intuitions from the data. In this section, we will introduce some of the visualization techniques that will be used for EDA:

  • Line graphs: Line graphs are one of the simplest forms of visualization. Line graphs are the preferred method for revealing trends in the data. These types of graphs are mostly used for continuous data. We will be generating this graph in Exercise 3.02, Business Hypothesis Testing for Age versus Propensity for a Term Loan.

    Here is what a line graph looks like:

Figure 3.4: Example of a line graph

  • Histograms: Histograms are plots of the proportion of data along with some specified intervals. They are mostly used for visualizing the distribution of data. Histograms are very effective for identifying whether data distribution is symmetric and for identifying outliers in data. We will be looking at histograms in much more detail later in this chapter.

    Here is what a histogram looks like:

Figure 3.5: Example of a histogram

  • Density plots: Like histograms, density plots are also used for visualizing the distribution of data. However, density plots give a smoother representation of the distribution. We will be looking at this later in this chapter.

    Here is what a density plot looks like:

Figure 3.6: Example of a density plot

  • Stacked bar charts: A stacked bar chart helps you to visualize the various categories of data, one on top of the other, in order to give you a sense of proportion of the categories; for instance, if you want to plot a bar chart showing the values, Yes and No, on a single bar. This can be done using the stacked bar chart, which cannot be done on the other charts.

    Let's create some dummy data and generate a stacked bar chart to check the proportion of jobs in different sectors.

    Note

    Do not execute any of the following code snippets until the final step. Enter all the code in the same cell.

    Import the library files required for the task:

    # Importing library files

    import matplotlib.pyplot as plt

    import numpy as np

    Next, create some sample data detailing a list of jobs:

    # Create a simple list of categories

    jobList = ['admin','scientist','doctor','management']

    Each job will have two categories to be plotted, yes and No, with some proportion between yes and No. These are detailed as follows:

    # Getting two categories ( 'yes','No') for each of jobs

    jobYes = [20,60,70,40]

    jobNo = [80,40,30,60]

    In the next steps, the length of the job list is taken for plotting xlabels and then they are arranged using the np.arange() function:

    # Get the length of x axis labels and arranging its indexes

    xlabels = len(jobList)

    ind = np.arange(xlabels)

    Next, let's define the width of each bar and do the plotting. In the plot, p2, we define that when stacking, yes will be at the bottom and No at top:

    # Get width of each bar

    width = 0.35

    # Getting the plots

    p1 = plt.bar(ind, jobYes, width)

    p2 = plt.bar(ind, jobNo, width, bottom=jobYes)

    Define the labels for the Y axis and the title of the plot:

    # Getting the labels for the plots

    plt.ylabel('Proportion of Jobs')

    plt.title('Job')

    The indexes for the X and Y axes are defined next. For the X axis, the list of jobs are given, and, for the Y axis, the indices are in proportion from 0 to 100 with an increment of 10 (0, 10, 20, 30, and so on):

    # Defining the x label indexes and y label indexes

    plt.xticks(ind, jobList)

    plt.yticks(np.arange(0, 100, 10))

    The last step is to define the legends and to rotate the axis labels to 90 degrees. The plot is finally displayed:

    # Defining the legends

    plt.legend((p1[0], p2[0]), ('Yes', 'No'))

    # To rotate the axis labels

    plt.xticks(rotation=90)

    plt.show()

Here is what a stacked bar chart looks like based on the preceding example:

Figure 3.7: Example of a stacked bar plot

Let's use these graphs in the following exercises and activities.

Exercise 3.02: Business Hypothesis Testing for Age versus Propensity for a Term Loan

The goal of this exercise is to define a hypothesis to check the propensity for an inpidual to purchase a term deposit plan against their age. We will be using a line graph for this exercise.

The following steps will help you to complete this exercise:

  1. Begin by defining the hypothesis.

    The first step in the verification process will be to define a hypothesis about the relationship. A hypothesis can be based on your experiences, domain knowledge, some published pieces of knowledge, or your business intuitions.

    Let's first define our hypothesis on age and propensity to buy term deposits:

    The propensity to buy term deposits is more with elderly customers compared to younger ones. This is our hypothesis.

    Now that we have defined our hypothesis, it is time to verify its veracity with the data. One of the best ways to get business intuitions from data is by taking cross-sections of our data and visualizing them.

  2. Import the pandas and altair packages:

    import pandas as pd

    import altair as alt

  3. Next, you need to load the dataset, just like you loaded the dataset in Exercise 3.01, Loading and Exploring the Data from the Dataset:

    file_url = 'https://raw.githubusercontent.com/'\

               'PacktWorkshops/The-Data-Science-Workshop/'\

               'master/Chapter03/bank-full.csv'

    bankData = pd.read_csv(file_url, sep=";")

    Note

    Steps 2-3 will be repeated in the following exercises for this chapter.

    We will be verifying how the purchased term deposits are distributed by age.

  4. Next, we will count the number of records for each age group. We will be using the combination of .groupby(), .agg(), .reset_index() methods from pandas.

    Note

    You will see further details of these methods in Chapter 12, Feature Engineering.

    filter_mask = bankData['y'] == 'yes'

    bankSub1 = bankData[filter_mask]\

               .groupby('age')['y'].agg(agegrp='count')\

               .reset_index()

    We first take the pandas DataFrame, bankData, which we loaded in Exercise 3.01, Loading and Exploring the Data from the Dataset and then filter it for all cases where the term deposit is yes using the mask bankData['y'] == 'yes'. These cases are grouped through the groupby() method and then aggregated according to age through the agg() method. Finally we need to use .reset_index() to get a well-structure DataFrame that will be stored in a new DataFrame called bankSub1.

  5. Now, plot a line chart using altair and the .Chart().mark_line().encode() methods and we will define the x and y variables, as shown in the following code snippet:

    # Visualising the relationship using altair

    alt.Chart(bankSub1).mark_line().encode(x='age', y='agegrp')

    You should get the following output:

    Figure 3.8: Relationship between age and propensity to purchase

    From the plot, we can see that the highest number of term deposit purchases are done by customers within an age range between 25 and 40, with the propensity to buy tapering off with age.

    This relationship is quite counterintuitive from our assumptions in the hypothesis, right? But, wait a minute, aren't we missing an important point here? We are taking the data based on the absolute count of customers in each age range. If the proportion of banking customers is higher within the age range of 25 to 40, then we are very likely to get a plot like the one that we have got. What we really should plot is the proportion of customers, within each age group, who buy a term deposit.

    Let's look at how we can represent the data by taking the proportion of customers. Just like you did in the earlier steps, we will aggregate the customer propensity with respect to age, and then pide each category of buying propensity by the total number of customers in that age group to get the proportion.

  6. Group the data per age using the groupby() method and find the total number of customers under each age group using the agg() method:

    # Getting another perspective

    ageTot = bankData.groupby('age')['y']\

             .agg(ageTot='count').reset_index()

    ageTot.head()

    The output is as follows:

    Figure 3.9: Customers per age group

  7. Now, group the data by both age and propensity of purchase and find the total counts under each category of propensity, which are yes and no:

    # Getting all the details in one place

    ageProp = bankData.groupby(['age','y'])['y']\

              .agg(ageCat='count').reset_index()

    ageProp.head()

    The output is as follows:

    Figure 3.10: Propensity by age group

  8. Merge both of these DataFrames based on the age variable using the pd.merge() function, and then pide each category of propensity within each age group by the total customers in the respective age group to get the proportion of customers, as shown in the following code snippet:

    # Merging both the data frames

    ageComb = pd.merge(ageProp, ageTot,left_on = ['age'], \

                       right_on = ['age'])

    ageComb['catProp'] = (ageComb.ageCat/ageComb.ageTot)*100

    ageComb.head()

    The output is as follows:

    Figure 3.11: Merged DataFrames with proportion of customers by age group

  9. Now, display the proportion where you plot both categories (yes and no) as separate plots. This can be achieved through a method within altair called facet():

    # Visualising the relationship using altair

    alt.Chart(ageComb).mark_line()\

       .encode(x='age', y='catProp').facet(column='y')

    This function makes as many plots as there are categories within the variable. Here, we give the 'y' variable, which is the variable name for the yes and no categories to the facet() function, and we get two different plots: one for yes and another for no.

    You should get the following output:

Figure 3.12: Visualizing normalized relationships

By the end of this exercise, you were able to get two meaningful plots showing the propensity of people to buy term deposit plans. The final output for this exercise shows two graphs in which the left graph shows the proportion of people who do not buy term deposits and the right one shows those customers who buy term deposits.

We can see, in the first graph, with the age group beginning from 22 to 60, inpiduals would not be inclined to purchase the term deposit. However, in the second graph, we see the opposite, where the age group of 60 and over are much more inclined to purchase the term deposit plan.

Note

To access the source code for this specific section, please refer to https://packt.live/3iOw7Q4.

This section does not currently have an online interactive example, but can be run as usual on Google Colab.

In the following section, we will begin to analyze our plots based on our intuitions.

Intuitions from the Exploratory Analysis

What are the intuitions we can take out of the exercise that we have done so far? We have seen two contrasting plots by taking the proportion of users and without taking the proportions. As you can see, taking the proportion of users is the right approach to get the right perspective in which we must view data. This is more in line with the hypothesis that we have evolved. We can see from the plots that the propensity to buy term deposits is low for age groups from 22 to around 60.

After 60, we see a rising trend in the demand for term deposits. Another interesting fact we can observe is the higher proportion of term deposit purchases for ages younger than 20.

In Exercise 3.02, Business Hypothesis Testing for Age versus Propensity for a Term Loan we discovered how to develop our hypothesis and then verify the hypothesis using EDA. After the following activity, we will delve into another important step in the journey, Feature Engineering.

Activity 3.01: Business Hypothesis Testing to Find Employment Status versus Propensity for Term Deposits

You are working as a data scientist for a bank. You are provided with historical data from the management of the bank and are asked to try to formulate a hypothesis between employment status and the propensity to buy term deposits.

In Exercise 3.02, Business Hypothesis Testing for Age versus Propensity for a Term Loan we worked on a problem to find the relationship between age and the propensity to buy term deposits. In this activity, we will use a similar route and verify the relationship between employment status and term deposit purchase propensity.

The steps are as follows:

  1. Formulate the hypothesis between employment status and the propensity for term deposits. Let the hypothesis be as follows: High paying employees prefer term deposits than other categories of employees.
  2. Open a Colab notebook file similar to what was used in Exercise 3.02, Business Hypothesis Testing for Age versus Propensity for a Term Loan and install and import the necessary libraries such as pandas and altair.
  3. From the banking DataFrame, bankData, find the distribution of employment status using the .groupby(), .agg() and .reset_index() methods.

    Group the data with respect to employment status using the .groupby() method and find the total count of propensities for each employment status using the .agg() method.

  4. Now, merge both DataFrames using the pd.merge() function and then find the propensity count by calculating the proportion of propensity for each type of employment status. When creating the new variable for finding the propensity proportion.
  5. Plot the data and summarize intuitions from the plot using matplotlib. Use the stacked bar chart for this activity.

    Note

    The bank-full.csv dataset to be used in this activity can be found at https://packt.live/2Wav1nJ.

Expected output: The final plot of the propensity to buy with respect to employment status will be similar to the following plot:

Figure 3.13: Visualizing propensity of purchase by job

Note

The solution to this activity can be found at the following address: https://packt.live/2GbJloz.

Now that we have seen EDA, let's pe into feature engineering.

主站蜘蛛池模板: 安国市| 高陵县| 禹州市| 济南市| 德令哈市| 耒阳市| 绥德县| 和龙市| 泗阳县| 永丰县| 莒南县| 五华县| 历史| 朝阳区| 双江| 临猗县| 浮山县| 瓮安县| 湄潭县| 玉门市| 镇远县| 县级市| 娄烦县| 云林县| 盐亭县| 香河县| 巴中市| 苏州市| 南部县| 巢湖市| 称多县| 威海市| 四平市| 唐山市| 南澳县| 哈尔滨市| 托克逊县| 丽江市| 兰溪市| 景泰县| 田东县|