官术网_书友最值得收藏!

Predicting a credit card dataset 

Let's take an example of a credit card dataset. This dataset comes from a financial institution in Taiwan and can be found here: https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset. Take a look at the following screenshot, which shows you the dataset's information and its features:

Here, we have the following detailed information about each customer:

  • It contains the limit balance, that is, the credit limit provided to the customer that is using the credit card
  • Then, we have a few features regarding personal information about each customer, such as gender, education, marital status, and age 
  • We also have a history of past payments
  • We also have the bill statement's amount
  • We have the history of the bill's amount and previous payment amounts from the previous month up to six months prior, which was done by the customer

With this information, we are going to predict next month's payment status of the customer. We will first do a little transformation on these features to make them easier to interpret.

In this case, the positive class will be the default, so the number 1 represents the customers that fall under the default status category and the number 0 represents the customers who have paid their credit card dues.

Now, before we start, we need to import the required libraries by running a few commands, as shown in the following code snippet:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

The following screenshot shows the line of code that was used to prepare the credit card dataset:

Let's produce the dummy feature for education in grad _schooluniversity, and high_school. Instead of using the word sex, use the male dummy feature, and instead of using marriage, let's use the married feature. This feature is given value of 1 when the person is married, and 0 otherwise. For the pay_1 feature, we will do a little simplification process. If we see a positive number here, it means that the customer was late in his/her payments for i months. This means that this customer with an ID of 1 delayed the payment for the first two months. We can see that, 3 months ago, he/she was not delayed on his/her payments. This is what the dataset looks like:

Before fitting our models, the last thing we will do is rescale all the features because, as we can see here, we have features that are in very different scales. For example, limit_bal is in a very different scale than age.

This is why we will be using the RobustScaler method from scikit-learn—to try and transform all the features to a similar scale:

As we can see in the preceding screenshot in the last line of code, we are partitioning our dataset into a training set and a testing set and below that, the CMatrix function is used to print the confusion matrix for each model. This function is explained in the following code snippet:

def CMatrix(CM, labels=['pay', 'default']):
df = pd.DataFrame(data=CM, index=labels, columns=labels)
df.index.name='TRUE'
df.columns.name='PREDICTION'
df.loc['Total'] = df.sum()
df['Total'] = df.sum(axis=1)
return df

主站蜘蛛池模板: 三门峡市| 黄陵县| 资兴市| 蚌埠市| 隆回县| 黄浦区| 涞水县| 镇江市| 文成县| 阳泉市| 玛沁县| 丹棱县| 贵阳市| 确山县| 鄂托克旗| 承德市| 吴忠市| 绿春县| 信丰县| 扎鲁特旗| 东阳市| 宁强县| 太湖县| 峨山| 三门峡市| 宜州市| 沂南县| 历史| 南漳县| 肃宁县| 永城市| 昌江| 抚顺市| 黎川县| 文山县| 获嘉县| 凤庆县| 钦州市| 沙河市| 兰溪市| 龙井市|