- Neural Networks with Keras Cookbook
- V Kishore Ayyadevara
- 748字
- 2021-07-02 12:46:31
How to do it...
We'll code up the strategy as follows (Please refer to the Credit default prediction.ipynb file in GitHub while implementing the code):
- Import the relevant packages and the dataset:
import pandas as pd
data = pd.read_csv('...') # Please add path to the file you downloaded
The first three rows of the dataset we downloaded are as follows:

The preceding screenshot is a subset of variables in the original dataset. The variable named Defaultin2yrs is the output variable that we need to predict, based on the rest of the variables present in the dataset.
- Summarize the dataset to understand the variables better:
data.describe()
Once you look at the output you will notice the following:
- Certain variables have a small range (age), while others have a much bigger range (Income).
- Certain variables have missing values (Income).
- Certain variables have outlier values (Debt_income_ratio). In the next steps, we will go ahead and correct all the issues flagged previously.
- Impute missing values in a variable with the variable's median value:
vars = data.columns[1:]
import numpy as np
for var in vars:
data[var]= np.where(data[var].isnull(),data[var].median(),data[var])
In the preceding code, we excluded the first variable, as it is the variable that we are trying to predict, and then we imputed the missing values in the rest of the variables (provided the variable does have a missing value).
- Cap each variable to its corresponding 95th percentile value so that we do not have outliers in our input variables:
for var in vars:
x=data[var].quantile(0.95)
data[var+"outlier_flag"]=np.where(data[var]>x,1,0)
data[var]=np.where(data[var]>x,x,data[var])
In the preceding code, we have identified the 95th percentile value of each variable, created a new variable that has a value of one if the row contains an outlier in the given variable, and zero otherwise. Additionally, we have capped the variable values to the 95th percentile value of the original value.
- Once we summarize the modified data, we notice that except for the Debt_income_ratio variable every other variable does not seem to have outliers anymore. Hence, let's constrain Debt_income_ratio further to have a limited range of output, by capping it at the 80th percentile value:
data['Debt_income_ratio_outlier']=np.where(data['Debt_incomeratio']>1,1,0)
data['Debt_income_ratio']=np.where(data['Debt_income_ratio']>1,1,data['Debt_income_ratio'])
- Normalize all variables to the same scale for a value between zero and one:
for var in vars:
data[var]= data[var]/data[var].max()
In the preceding code, we are limiting all the variables to a similar range of output, which is between zero and one, by dividing each input variable value with the input variable column's maximum value.
- Create the input and the output dataset:
X = data.iloc[:,1:]
Y = data['Defaultin2yrs']
- Split the datasets into train and test datasets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state= 42)
In the preceding step, we use the train_test_split method to split the input and output arrays into train and test datasets where the test dataset has 30% of the total number of data points in the input and the corresponding output arrays.
- Now that the datasets are created, let's define the neural network model, as follows:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.utils import np_utils
model = Sequential()
model.add(Dense(1000, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
A summary of the model is as follows:

In the previous architecture, we connect the input variables to a hidden layer that has 1,000 hidden units.
- Compile the model. We shall employ binary cross entropy, as the output variable has only two classes. Additionally, we will specify that optimizer is an adam optimization:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
- Fit the model:
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=20, batch_size=1024, verbose=1)
The variation of training and test loss, accuracy over increasing epochs is as follows:

- Make predictions on the test dataset:
pred = model.predict(X_test)
- Check for the number of actual defaulters that are captured in the top 10% of the test dataset when ranked in order of decreasing probability:
test_data = pd.DataFrame([y_test]).T
test_data['pred']=pred
test_data = test_data.reset_index(drop='index')
test_data = test_data.sort_values(by='pred',ascending=False)
print(test_data[:4500]['Defaultin2yrs'].sum())
In the preceding code, we concatenated the predicted values with actual values and then sorted the dataset by probability. We checked the actual number of defaulters that are captured in the top 10% of the test dataset (which is the first 4,500 rows).
We should note that there are 1,580 actual defaulters that we have captured by going through the 4,500 high-probability customers. This is a good prediction, as on average only 6% of the total customers default. Hence, in this case, ~35% of customers who have a high probability of default actually defaulted.