Credit Card Default Prediction Using Machine Learning Techniques
This project aims to develop and assess machine learning models in predicting customer defaults, assisting businesses in evaluating the risk of customer loan or credit repayment default. The project undertakes data preparation steps, dealing with the unbalanced classes by applying techniques such as oversampling and downsampling using SMOTE, and feature engineering processes such as Box-Cox transformations to alleviate skewness. As a way of making the model more interpretable, we employ SHAP and LIME for explainability purposes to see how the models reach certain predictions. These steps are intended to enhance the functionality and clarity of the default prediction models.
Project Overview
The project begins with data preprocessing, which involves dealing with the missing values, converting the categorical variables to the numerical format using Label Encoding and applying Box-Cox transformation for skewed features where applicable. Next is feature engineering where additional features are generated to capture the relationship between features already in existence; for example, the delinquency columns are merged, and financial ratios are computed. There is also the implementation of a range of classification models such as Random Forest, XGBoost, Logistic Regression, LightGBM, and so on. Hyperparameter tuning is deployed along with class balancing techniques such as SMOTE and class weight adjustment to this end to mitigate the effects of an imbalanced dataset. The models are assessed based on key metrics including accuracy, precision, recall, F1 score and AUC-ROC among others to determine how effective they are. Additionally, to simplify model predictions, LIME and SHAP are used as improvements to address interpretability, where particular features are shown to contribute to the overall prediction. Finally, the project considers a case study demonstrating how to incorporate many algorithms and preprocessing techniques to make predictions on customer defaults for the benefit of financial institutions.
Prerequisites
- Python Programming: Ability to use basic Python programming skills to implement algorithms and manipulate data.
- Machine Learning Basics: Basic concepts of classification algorithms, evaluators, and how to prevent overfitting.
- Pandas and NumPy: Skills in data processing and basic mathematical working with data.
- Scikit-learn: Knowledge of widely used machine learning frameworks for training and evaluation and data preprocessing.
- SMOTE and Downsampling/Undersampling Techniques: Skills in approaches to address imbalanced learning in order to enhance the performance of models.
- SHAP and LIME: Knowledge of tools that can be used to understand and explain prediction and feature contribution.
- LightGBM, XGBoost, Random Forest, Logistic Regression: Knowledge of some widespread classification algorithms employed in machine learning.
Approach
The procedure starts at Data Preprocessing, where data is cleaned and the dataset is prepared by performing actions such as dealing with missing values, Encoding Categorical Variables through Label Encoding, and correcting box-cox transformed Numerical Features to skewness to prepare the data for machine learning algorithms. Later, during feature engineering, it is aimed to create additional features utilizing the interactions of those already present that are of key importance, such as merging the columns about the contractions and computing such financial metrics as Debt Ratio, and Revolving Utilization.
In order to resolve the class disproportion in the set of data, we impute the minority class via SMOTE (Synthetic Minority Over-sampling Technique) and take down the majority class's size through down-sampling to achieve class balance. Upon data preparation, we opt for a few chosen machine learning model algorithms including Random Forest, XGBoost, Logistic Regression, and LightGBM and train these algorithms on the processed data.
Subsequently, the trained models are assessed against a number of performance measures such as accuracy and precision, recall, F1 score and AUC /ROC with the purpose of establishing how well the models predict the likelihood of customer defaulting: the performance of the best models is also improved through tuning. The best option is chosen with respect to the performance of the model.
In order to guarantee uniformity and intelligibility of the models, we use SHAP and LIME for model explainability which aids in understanding how different features affect predictions. Finally, the results are assessed and conclusions are made that provide practical recommendations to an organization in order to mitigate the risks of customer default.
Workflow and Methodology
- Data Collection and Preprocessing:
- The project begins by gathering and preparing the dataset for analysis.
- Missing values are handled appropriately, categorical variables are encoded using Label Encoding, and skewed features are transformed using the Box-Cox technique.
- Feature Engineering:
- New features are volunteered to reflect meaningful communications within the existing features by mixing up delinquency features and also giving up analysis by metrics such as the Debt Ratio and the Revolving Utilization.
- Further, class imbalance is attempted to be moderated through SMOTE (to facilitate over-sampling of the under-represented category) and regular down-sampling (for the over-represented category) approaches.
- Model Selection and Training:
- Several classifiers such as Random Forest, XGBoost, Logistic Regression, and LightGBM are opted for the purpose.
- Every model is trained after preprocessing the data and adjusting hyperparameters for the best fit. Simultaneously class balancing methods such as class weights or SMOTE to the achieved results for better prediction.
- Model Evaluation:
- Finally, each model fitted to the training data is put through performance evaluation using standard metrics, namely accuracy, precision, recall, F score and AUC. ROC for effectiveness.
- It's worth mentioning that cross-validation is also done to check how robust the models are.
- Model Explainability:
- The prediction and importance of every feature in the model are then explained using SHAP and LIME. This also gives room for the model to explain how different features affect the predictions resulting in more clarity to the model.
- Performance Evaluation:
- The models are empirically analyzed based on the evaluation measures, and the most accurate models are selected for implementation or enhancement.
- Conclusion and Recommendations:
- The analysis gives a picture of which models and strategies seem to yield the best results for predicting customer defaults.
Data Collection and Preparation
Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.
Data Preparation Workflow:
- Loading the Dataset: Downloading the dataset and examining its contents to gain insight into the types of data and descriptive statistics present.
- Handling Missing Values: Assess the extent of missing data and the appropriate imputation or elimination of them, depending on the nature and distribution of data.
- Handling Categorical Variables: Apply Label Encoding or One-Hot Encoding to represent categorical variables as numerical variables.
- Feature Engineering: Modify existing variables to include additional variables that provide ratios deemed important.
- Addressing Class Imbalance: Applying SMOTE, downscaling, or modifying class weights to remedy the problem of disproportionate classes.
- Requirements Transformation: Adjust skewed distribution by normalizing or standardizing the data or using Box-Cox transformation.
- Dataset Partitioning: Use either train_test_split or cross-validation to partition the available data into training datasets and test datasets.
Code Explanation
STEP 1:
Mounting Google Drive
First, mount Google Drive to access the dataset that is stored in the cloud.
from google.colab import drive
drive.mount('/content/drive')
Library Installation
This piece of code installs various libraries: SHAP for understanding the model, LIME for understanding each prediction, Keras for building deep learning models, XGBoost and LightGBM for gradient boosting techniques, and Imbalanced-learn for the problem of imbalanced datasets.
!pip install shap
!pip install lime
!pip install keras
!pip install xgboost
!pip install lightgbm
!pip install imblearn
Ignore Warning
The filterwarnings('ignore') function prevents any warnings from being shown during the execution of the program. This can be useful when you don't want the warnings to clutter the output, but keep in mind that ignoring warnings can sometimes hide important information about potential issues in your code.
# Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore')
Importing necessary Libraries
This code imports the necessary libraries for the successful execution of the program, which are SHAP for explainability, math for calculations, Keras for deep learning net architecture, NumPy and Pandas for handling data, XGBoost for implementing almost all the gradient boosting algorithms, and Seaborn for effective plotting and presentation of data.
import shap
import math
import keras
import numpy as np
import pandas as pd
import xgboost as xgb
import seaborn as sns
import tensorflow as tf
import keras.backend as K
import matplotlib.pyplot as plt
from keras import models
from keras import layers
from sklearn.svm import SVC
from scipy.stats import skew
from matplotlib import pyplot
from collections import Counter
from scipy.stats import kurtosis
from scipy import stats, special
from xgboost import XGBClassifier
from sklearn.utils import resample
from lightgbm import LGBMClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from lime.lime_tabular import LimeTabularExplainer
from sklearn.metrics import precision_recall_curve
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import KFold,StratifiedKFold
from keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import mean_squared_error, accuracy_score,confusion_matrix, roc_curve, auc,classification_report, recall_score, precision_score, f1_score,roc_auc_score,auc,roc_curve
STEP 2:
Loading Data and Checking Dimensions:
This code loads the CSV file. After loading the dataset it prints the dataset’s shape to check the number of rows and columns.
Aionlinecourse_data = pd.read_csv("/content/drive/MyDrive/New 90 Projects/Project_5/Data/cs-training.csv")
print(Aionlinecourse_data.shape)
The purpose of the given code is to provide a summary of the DataFrame Aionlinecourse_housing by displaying the number of records, names of the columns, types of columns, count of non-null values, and the size in memory.
Aionlinecourse_data.info()
Calculating missing values percentage
This script determines the extent of each column in the Aionlinecourse_data, dataset that is missing data, expressed as a percentage and rounded off to two decimal places, allowing ease of spotting features with missing data.
round(Aionlinecourse_data.isnull().sum(axis=0)/len(Aionlinecourse_data),2)*100
Previewing Data
This block of code displays the first few rows of the dataset to have a quick overview of the structure of the dataset.
Aionlinecourse_data.head()
Checking Unique Borrowers
This code computes the ratio of unique borrowers (denoted by the column 'Unnamed: 0') to all the records contained in the data set to understand the uniqueness of the data set.
# Checking the unique number of borrowers
Aionlinecourse_data['Unnamed: 0'].nunique()/len(Aionlinecourse_data)
Analyzing Target Variable
In this section, we delve into the analysis of the target variable, which is represented by the code ‘SeriousDlqin2yrs’ which refers to serious delinquency. The above code also gives the percentage of borrowers who go into serious delinquency. This gives more knowledge on the distribution of the target class.
# Target Variable
print(Aionlinecourse_data['SeriousDlqin2yrs'].unique())
print()
print('{}% of the borrowers falling in the serious delinquency '.format((Aionlinecourse_data['SeriousDlqin2yrs'].sum()/len(Aionlinecourse_data))*100))
STEP 3:
Visualizing Target Variable
This snippet of code generates two visualizations for the attempting variable ‘SeriousDlqin2yrs’. The first one is a pie chart that depicts the share of serious delinquency, and the second one is a count plot that illustrates the distribution of the dependent variable understanding the class target distribution in the sample.
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
# Pie chart
Aionlinecourse_data['SeriousDlqin2yrs'].value_counts().plot.pie(
explode=[0, 0.1],
autopct='%1.1f%%',
ax=axes[0],
colors=['skyblue', 'lightcoral']
)
axes[0].set_title('SeriousDlqin2yrs')
# Count plot
sns.countplot(
x='SeriousDlqin2yrs',
data=Aionlinecourse_data,
ax=axes[1],
palette=['skyblue', 'lightcoral'],
hue='SeriousDlqin2yrs'
)
axes[1].set_title('SeriousDlqin2yrs')
# axes[1].legend_.remove() # Removing legend if it's not needed
plt.show()
Counting Target Variable Occurrences
This code counts the occurrences of each unique value in the 'SeriousDlqin2yrs' column. This shows how many borrowers fall into each category of serious delinquency
Aionlinecourse_data['SeriousDlqin2yrs'].value_counts()
Descriptive Statistics
This code displays a summary of the numerical variables contained in the Data Frame, including mean, standard deviation, minimum, maximum, and quartiles, etc.
Aionlinecourse_data.describe()
Dividing Train and Test dataset
In this code, the target variable “SeriousDlqin2yrs” is separated from the feature columns, and then the dataset is divided into training and testing for model training and evaluation purposes (80% training: 20% testing). Furthermore, the size of each of the datasets corresponding to the split is displayed.
data = Aionlinecourse_data.drop(columns = ['SeriousDlqin2yrs'], axis=1)
y = Aionlinecourse_data['SeriousDlqin2yrs']
data_train, data_test, y_train, y_test = train_test_split(data, y, test_size=0.2, random_state=42)
data_train.shape, data_test.shape, y_train.shape, y_test.shape
Calculating Event Rate
The purpose of this code is to determine the event rate (percentage of borrowers with serious delinquency for the training, testing, and overall dataset. This assists in exploring the prevalence of the target variable in various datasets.
print('Event rate in the training dataset : ',np.mean(y_train))
print()
print('Event rate in the test dataset : ',np.mean(y_test))
print()
print('Event rate in the entire dataset : ',np.mean(y))
Combining Features and Target for Training
In this snippet, we concat the target variable available in y_train to the feature set data_train to form the entire training dataset. This makes the dataset ready for model training. Lastly prints the shape of this combined dataset.
train = pd.concat([data_train, y_train], axis=1)
train.shape
Combining Features and Target for Test Dataset
In this snippet, we concat the target variable available in y_test to the feature set data_test to form the entire testing dataset. This makes the dataset ready for model evaluation. Lastly prints the shape of this combined dataset.
test = pd.concat([data_test, y_test], axis=1)
test.shape
Creating Histogram and Box plot for Distribution
This function generates two plots for a specified column, one of which is a histogram to represent the distribution whereas the other one is a boxplot which helps in finding the outliers. It also computes and displays the statistics of skewness and kurtosis of data which shed light on the form and the tail of the distribution provided.
def plot_hist_boxplot(column):
fig,[ax1,ax2]=plt.subplots(1,2,figsize=(12,5))
sns.distplot(train[train[column].notnull()][column],ax=ax1)
sns.boxplot(y=train[train[column].notnull()][column],ax=ax2)
print("skewness : ",skew(train[train[column].notnull()][column]))
print("kurtosis : ",kurtosis(train[train[column].notnull()][column]))
plt.show()
Plotting Count and Boxplot for Categorical Data
This function provides two illustrations based on a certain categorical column. A countplot is employed to represent how often each category occurs. A boxplot is used to look at how the data is spread and to highlight any outlying values. It also computes and displays skewness and kurtosis statistics that describe the asymmetry and the heaviness of the tails of the distribution respectively.
def plot_count_boxplot(column):
fig,[ax1,ax2]=plt.subplots(1,2,figsize=(12,6))
sns.countplot(train[train[column].notnull()][column],ax=ax1)
sns.boxplot(y=train[train[column].notnull()][column],ax=ax2)
print("skewness : ",skew(train[train[column].notnull()][column]))
print("kurtosis : ",kurtosis(train[train[column].notnull()][column]))
plt.show()
Creating Histograms and Boxplots for Several Columns
The code applies the plot_hist_boxplot function on a multitude of columns of the training dataset. A histogram and a boxplot per column are generated to observe how the data is distributed, how many outliers are present, and figure skewness and kurtosis to study more about the shape and the tails of the distribution respectively. Such distributional properties help to understand the features present in the data.
plot_hist_boxplot('RevolvingUtilizationOfUnsecuredLines')
plot_hist_boxplot('age')
plot_hist_boxplot('DebtRatio')
plot_hist_boxplot('MonthlyIncome')
plot_hist_boxplot('NumberOfOpenCreditLinesAndLoans')
plot_hist_boxplot('NumberRealEstateLoansOrLines')
plot_hist_boxplot('NumberOfDependents')
plot_hist_boxplot('NumberOfTimes90DaysLate')
plot_hist_boxplot('NumberOfTime30-59DaysPastDueNotWorse')
Calculating Skewness and Kurtosis for Multiple Columns
This code figures out the level of skewness and kurtosis for several columns and places the results in a Data Frame. It also orders the Data Frame in a decreasing order of its values column to help examine whether there are any features whose distributions are the most skewed. This helps in understanding how the features in the dataset look like and their extent of possessing tails.
cols_for_stats = ['RevolvingUtilizationOfUnsecuredLines', 'age',
'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome',
'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate',
'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse',
'NumberOfDependents']
skewness = [] ; kurt = []
for column in cols_for_stats:
skewness.append(skew(train[train[column].notnull()][column]))
kurt.append(kurtosis(train[train[column].notnull()][column]))
stats = pd.DataFrame({'Skewness' : skewness, 'Kurtosis' : kurt},index=[col for col in cols_for_stats])
stats.sort_values(by=['Skewness'], ascending=False)
Analyzing Unique Values in Test Data
This code validates and displays the distinct entries in the '30-59 Days', '60-89 Days', and '90 Days' columns in both scenarios when the '30-59 Days' value is >= 90 and when it is \< 90. It aids in comprehending the structural behavior of these delinquency columns in the training dataset.
print("Unique values in '30-59 Days' values that are more than or equal to 90:",np.unique(train[train['NumberOfTime30-59DaysPastDueNotWorse']>=90]
['NumberOfTime30-59DaysPastDueNotWorse']))
print("Unique values in '60-89 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(train[train['NumberOfTime30-59DaysPastDueNotWorse']>=90]
['NumberOfTime60-89DaysPastDueNotWorse']))
print("Unique values in '90 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(train[train['NumberOfTime30-59DaysPastDueNotWorse']>=90]
['NumberOfTimes90DaysLate']))
print("Unique values in '30-59 Days' values that are less than 90:",np.unique(train[train['NumberOfTime30-59DaysPastDueNotWorse']<90]
['NumberOfTime30-59DaysPastDueNotWorse']))
print("Unique values in '60-89 Days' when '30-59 Days' values are less than 90:",np.unique(train[train['NumberOfTime30-59DaysPastDueNotWorse']<90]
['NumberOfTime60-89DaysPastDueNotWorse']))
print("Unique values in '90 Days' when '30-59 Days' values are less than 90:",np.unique(train[train['NumberOfTime30-59DaysPastDueNotWorse']<90]
['NumberOfTimes90DaysLate']))
print("Proportion of positive class with special 96/98 values:",
round(train[train['NumberOfTime30-59DaysPastDueNotWorse']>=90]['SeriousDlqin2yrs'].sum()*100/
len(train[train['NumberOfTime30-59DaysPastDueNotWorse']>=90]['SeriousDlqin2yrs']),2),'%')
Changing Delinquency Values and Validating Uniqueness of Values
The delinquency columns in the code will be modified: the values of '30 - 59 Days', '60 - 89 Days', and '90 Days' will be changed to predetermined values (12, 11, and 17 respectively) if the value exceeds the threshold 90. It lastly shows the unique values in these columns after the changes which assists in the uniformity of data for the analysis.
train.loc[train['NumberOfTime30-59DaysPastDueNotWorse'] >= 90, 'NumberOfTime30-59DaysPastDueNotWorse'] = 12
train.loc[train['NumberOfTime60-89DaysPastDueNotWorse'] >= 90, 'NumberOfTime60-89DaysPastDueNotWorse'] = 11
train.loc[train['NumberOfTimes90DaysLate'] >= 90, 'NumberOfTimes90DaysLate'] = 17
print("Unique values in 30-59Days", np.unique(train['NumberOfTime30-59DaysPastDueNotWorse']))
print("Unique values in 60-89Days", np.unique(train['NumberOfTime60-89DaysPastDueNotWorse']))
print("Unique values in 90Days", np.unique(train['NumberOfTimes90DaysLate']))
Analyzing Unique Values in Test Data
This code validates and displays the distinct entries in the '30-59 Days', '60-89 Days', and '90 Days' columns in both scenarios when the '30-59 Days' value is >= 90 and when it is \< 90. It aids in comprehending the structural behavior of these delinquency columns in the test dataset.
print("Unique values in '30-59 Days' values that are more than or equal to 90:",np.unique(test[test['NumberOfTime30-59DaysPastDueNotWorse']>=90]
['NumberOfTime30-59DaysPastDueNotWorse']))
print("Unique values in '60-89 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(test[test['NumberOfTime30-59DaysPastDueNotWorse']>=90]
['NumberOfTime60-89DaysPastDueNotWorse']))
print("Unique values in '90 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(test[test['NumberOfTime30-59DaysPastDueNotWorse']>=90]
['NumberOfTimes90DaysLate']))
print("Unique values in '30-59 Days' values that are less than 90:",np.unique(test[test['NumberOfTime30-59DaysPastDueNotWorse']<90]
['NumberOfTime30-59DaysPastDueNotWorse']))
print("Unique values in '60-89 Days' when '30-59 Days' values are less than 90:",np.unique(test[test['NumberOfTime30-59DaysPastDueNotWorse']<90]
['NumberOfTime60-89DaysPastDueNotWorse']))
print("Unique values in '90 Days' when '30-59 Days' values are less than 90:",np.unique(test[test['NumberOfTime30-59DaysPastDueNotWorse']<90]
['NumberOfTimes90DaysLate']))
Changing Delinquency Values and Validating Uniqueness of Values
The delinquency columns in the code will be modified: the values of '30 - 59 Days', '60 - 89 Days', and '90 Days' will be changed to predetermined values (13, 7, and 15 respectively) if the value exceeds the threshold 90. It lastly shows the unique values in these columns after the changes which assists in the uniformity of data for the analysis.
test.loc[test['NumberOfTime30-59DaysPastDueNotWorse'] >= 90, 'NumberOfTime30-59DaysPastDueNotWorse'] = 13
test.loc[test['NumberOfTime60-89DaysPastDueNotWorse'] >= 90, 'NumberOfTime60-89DaysPastDueNotWorse'] = 7
test.loc[test['NumberOfTimes90DaysLate'] >= 90, 'NumberOfTimes90DaysLate'] = 15
print("Unique values in 30-59Days", np.unique(test['NumberOfTime30-59DaysPastDueNotWorse']))
print("Unique values in 60-89Days", np.unique(test['NumberOfTime60-89DaysPastDueNotWorse']))
print("Unique values in 90Days", np.unique(test['NumberOfTimes90DaysLate']))
Summarizing Key Features
The code generates the summary statistics (mean, min, max, etc.) for 'DebtRatio' and 'RevolvingUtilizationOfUnsecuredLines' present in the training dataset. It aids in visualizing the distribution and extent of these significant variables for further evaluation.
print('Debt Ratio: \n',train['DebtRatio'].describe())
print('\nRevolving Utilization of Unsecured Lines: \n',train['RevolvingUtilizationOfUnsecuredLines'].describe())
Analyzing High Debt Ratio Values
This particular piece of code ensures that the training set consists of filtered rows where, ‘DebtRatio’ is above or equal to the 95th percentile, before giving summary statistics for ‘SeriousDlqin2yrs’ (dependent variable) and ‘MonthlyIncome’. This serves to analyze the borrowers with the highest levels of debt ratios especially when it comes to their income and the status of their delinquency.
train[train['DebtRatio'] >= train['DebtRatio'].quantile(0.95)][['SeriousDlqin2yrs','MonthlyIncome']].describe()
Counting Specific Conditions
The section of the code above counts the number of records having both conditions met (i.e. DebtRatio>95thPercentile and) where SeriousDlqin2yrs is equivalent to MonthlyIncome likely in error. It also determines if there are such records and this helps identify conflicts in the dataset i.e. issues or errors.
train[(train["DebtRatio"] > train["DebtRatio"].quantile(0.95)) & (train['SeriousDlqin2yrs'] == train['MonthlyIncome'])].shape[0]
Elimination of Certain Conditions from the Dataset
This particular code eliminates records from the training dataset in which the variable ‘DebtRatio’ surpasses the 95th percentile and ‘SeriousDlqin2yrs’ equals ‘MonthlyIncome’ to create a clean dataset (new_train). The code also provides the shape of the new dataset cut off from these records for clarity that the cleaned data does not contain the discrepancy in question.
new_train = train[-((train["DebtRatio"] > train["DebtRatio"].quantile(0.95)) & (train['SeriousDlqin2yrs'] == train['MonthlyIncome']))]
new_train.shape
This code further narrows the preprocessed training dataset (new_train) by eliminating rows where ‘RevolvingUtilizationOfUnsecuredLines’ has a value equal to 10 and above. And then presents its descriptive statistics for such rows. It assists in understanding the profile of borrowers who exhibit excessive norms of revolving utilization rate, which is abnormal. This may be outliers or certainly represent a risky sort of financial behavior.
new_train[new_train['RevolvingUtilizationOfUnsecuredLines']>10].describe()
Elimination of High-Revolving Utilization Values
The following code snippet filters the dataset (new_train) to exclude the cases with ‘RevolvingUtilizationOfUnsecuredLines’ of more than 10. It further shows the shape of the output dataset, hence ensuring that it contains only the rows with reasonable values of revolving utilization.
new_train = new_train[new_train['RevolvingUtilizationOfUnsecuredLines']
new_train.shape
Cleaning Test Data for Revolving Utilization
This program serves to refine the test dataset (test) such that only those records where revolving utilization is less than or equal to 10 are retained. It also displays the dimensions of the cleaned test dataset (new_test) making sure that rows with very high revolving utilization values have been removed.
new_test = test[test['RevolvingUtilizationOfUnsecuredLines']
new_test.shape
Descriptive Statistics of Age and Serious Delinquency
This code calculates the descriptive statistics (mean, min, max, and so on) about the ‘age’ and ‘SeriousDlqin2yrs’ columns of the processed training dataset (new_train). This analysis assists in understanding the range of ages and how it relates to serious delinquency within the context of the dataset.
new_train[['age', 'SeriousDlqin2yrs']].describe()
Detecting Incorrect Age Rows
This code excludes select rows from the pre-processed training data ‘new_train’ which has an ‘age’ value of less than. It assists in finding logically absurd age values that require modification or deletion.
new_train[new_train['age']
Dealing with Invalid Age Values
This code turns 'age' values which are equal to 0, to the most occurring value (mode) of the 'age' column. It also checks for the smallest value in the 'age' column to confirm that there are no more invalid values left.
new_train.loc[new_train['age'] == 0, 'age'] = new_train.age.mode()[0]
new_train['age'].min()
Dealing with Missing Data
This code makes use of the MissingHandler function which computes the percentage of missing values for each column in the dataset and ranks the columns in order of the percentage of missing data. Further, it addresses the problem of the missing variables by replacing the ‘MonthlyIncome’ Column with the median and imputing ‘NumberOfDependents’ with 0 for both new_train and new_test datasets. Last, but not the least, MissingHandler is invoked on new_train to ascertain if there is any missing value left unaddressed.
def MissingHandler(df):
DataMissing = df.isnull().sum()*100/len(df)
DataMissingByColumn = pd.DataFrame({'Percentage Nulls':DataMissing})
DataMissingByColumn.sort_values(by='Percentage Nulls',ascending=False,inplace=True)
return DataMissingByColumn[DataMissingByColumn['Percentage Nulls']>0]
#MissingHandler(new_train)
new_train['MonthlyIncome'].fillna(new_train['MonthlyIncome'].median(), inplace=True)
new_train['NumberOfDependents'].fillna(0, inplace = True)
new_test['MonthlyIncome'].fillna(new_test['MonthlyIncome'].median(), inplace=True)
new_test['NumberOfDependents'].fillna(0, inplace = True)
MissingHandler(new_train)
Verifying Test Data for Missing Values
The MissingHandler function is applied to the new_test dataset in the code to determine and present the missing value percentage per column. This helps to confirm that all the identified gaps in the test data have been remedied.
MissingHandler(new_test)
STEP 4:
Checking Missing Values in the Original Dataset
This piece of code employs the MissingHandler function for the original dataset Aionlinecourse_data to calculate and represent missing values percentage for each of the columns. Helps in particular to assess how many columns require further cleaning or imputation before training the models.
MissingHandler(Aionlinecourse_data)
Creating a Correlation Matrix Heatmap
This script generates a heatmap that helps illustrate the correlations of the new_train data. It uses the heatmap function provided by Seaborn to show the feature correlations in a correlation matrix where color intensity represents the correlation value and assists in visualizing correlation relationships among features.
plt.figure(figsize=(10, 6))
sns.heatmap(new_train.corr(), annot=True, cmap='coolwarm')
plt.show()
Creating Box and Violin Plots for Various Features
The function creates both box and violin plots for different features in the new_train dataset split according to the target variable ‘SeriousDlqin2yrs’. It is useful in enabling analysis of the distribution and outliers of each feature against the target class giving an understanding of the spread of the data and the features related to serious delinquency. This allows the users to perform an easy visual comparison of the plots.
import matplotlib.pyplot as plt
import seaborn as sns
def boxplot_violinplot_all():
# Define columns for x and y values
x_col = 'SeriousDlqin2yrs'
y_cols = [
'age', 'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans',
'NumberRealEstateLoansOrLines', 'RevolvingUtilizationOfUnsecuredLines',
'NumberOfDependents', 'NumberOfTime30-59DaysPastDueNotWorse',
'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfTimes90DaysLate',
'DebtRatio'
]
fig, axes = plt.subplots(5, 4, figsize=(20, 20))
axes = axes.flatten() # Flatten the array to iterate easily
for i, y_col in enumerate(y_cols):
sns.boxplot(x=x_col, y=y_col, data=new_train, palette='Set3', ax=axes[2*i])
sns.violinplot(x=x_col, y=y_col, data=new_train, palette='Set3', ax=axes[2*i + 1])
axes[2*i].set_title(f'Boxplot of {y_col}')
axes[2*i + 1].set_title(f'Violin Plot of {y_col}')
plt.tight_layout()
plt.show()
# Call the function to plot all 10 graphs
boxplot_violinplot_all()
Creating Combined Features
The following code generates 2 additional features; one is 'CombinedPastDue', which is the sum of the '30-59 Days', '60-89 Days', and '90 Days' delinquency columns, while the other is 'CombinedCreditLoans' which combines both 'NumberOfOpenCreditLinesAndLoans' and 'NumberRealEstateLoansOrLines'. These modifications concern new_train as well as new_test. Thereafter, the columns of the new_train dataset are shown to illustrate which new features have been incorporated.
# Making combined features
dataset = [new_train, new_test]
for data in dataset:
data['CombinedPastDue'] = data['NumberOfTime30-59DaysPastDueNotWorse'] + data['NumberOfTime60-89DaysPastDueNotWorse'] + data['NumberOfTimes90DaysLate']
data['CombinedCreditLoans'] = data['NumberOfOpenCreditLinesAndLoans'] + data['NumberRealEstateLoansOrLines']
new_train.columns
Summary of Interaction Features
This code generates new interaction features in the new_train and new_test data sets, such as:
‘MonthlyIncomePerPerson’: The income is calculated based on the number of dependents.
‘MonthlyDebt’: The calculated debt for a person in a month given their income and the debt to income ratio.
‘isRetired’: Used to identify retail customers at what age level, that is greater than 65 years.
‘RevolvingLines’: Credit lines over real estate loans.
‘hasRevolvingLines’: Logical variables for customers assigned with revolving lines.
‘hasMultipleRealEstates’: Logical ones for customers who have more than two real estate loans.
‘IsAlone’: Customers are flagged if there are no dependents.
These features assist in the understanding of various customer characteristics along with financial tendencies.
# Interaction of the features
for data in dataset:
data['MonthlyIncomePerPerson'] = data['MonthlyIncome']/(data['NumberOfDependents']+1)
data['MonthlyDebt'] = data['MonthlyIncome']*data['DebtRatio']
data['isRetired'] = np.where((data['age'] > 65), 1, 0)
data['RevolvingLines'] = data['NumberOfOpenCreditLinesAndLoans']-data['NumberRealEstateLoansOrLines']
data['hasRevolvingLines'] = np.where((data['RevolvingLines']>0),1,0)
data['hasMultipleRealEstates'] = np.where((data['NumberRealEstateLoansOrLines']>=2),1,0)
data['IsAlone'] = np.where((data['NumberOfDependents']==0),1,0)
new_train.columns
Checking the shape of the datasets
The command provides the dimensions or shapes of the new_train and new_test datasets giving the row and column counts respectively after the completion of all data cleaning and feature engineering stages.
new_train.shape, new_test.shape
Computing the proportion of serious delinquency
This function computes and displays the ratio of serious delinquency (target variable ‘SeriousDlqin2yrs’) in both new_train and new_test datasets. It serves to evaluate the target class balance, explaining the delinquency rates captured in the training and testing datasets.
print(new_train['SeriousDlqin2yrs'].sum()/len(new_train))
print()
print(new_test['SeriousDlqin2yrs'].sum()/len(new_test))
Stage of Data Preparation and Class Balancing
The purpose of the code is to extract features and target variables from the train and test datasets and then balance classes by upsampling the minority (defaulters) to the majority class size. It up-samples the dataset (df_upsampled) to be free from class distribution bias and checks for the distribution of the classes.
df_train = new_train.drop(columns=['Customer_ID', 'SeriousDlqin2yrs'],axis=1)
y_train = new_train['SeriousDlqin2yrs']
df_test = new_test.drop(columns=['Customer_ID', 'SeriousDlqin2yrs'],axis=1)
y_test = new_test['SeriousDlqin2yrs']
df_majority = new_train[new_train['SeriousDlqin2yrs']==0]
df_minority = new_train[new_train['SeriousDlqin2yrs']==1]
# replacing the samples keeping 100000 as the defaulters to keep in line with the non defaulters
df_minority_upsampled = resample(df_minority, replace=True, n_samples=100000, random_state=42)
df_upsampled = pd.concat([df_majority,df_minority_upsampled])
df_upsampled['SeriousDlqin2yrs'].value_counts()
Concluding the Upsampled Training Data
This script extracts the target variable, which is 'SeriousDlqin2yrs', from the upsampled training set (df_upsampled) and stores it in y_train_upsampled. In addition, it eliminates 'Customer_ID' and 'SeriousDlqin2yrs' from df_upsampled. Lastly, it outputs the shapes of the training data after upsampling, test data along with the corresponding target variables, thereby proving the sizes of the completed datasets.
y_train_upsampled = df_upsampled['SeriousDlqin2yrs']
df_upsampled.drop(columns=['Customer_ID', 'SeriousDlqin2yrs'],axis=1, inplace=True)
df_upsampled.shape, df_test.shape, y_train_upsampled.shape, y_test.shape
Checking the Target Variable's Distribution.
This code displays the frequency of different values found in the ‘SeriousDlqin2yrs’ column of the new_train dataset classifying the customers into defaulters (1) and non-defaulters (0) to get a sense of the distribution of the target variable.
new_train['SeriousDlqin2yrs'].value_counts()
Dealing with Class Imbalance through Downsampling
This method downsamples the majority class (non-defaulters) in the training dataset to 8,000 samples to equal the number of defaulters in the minority class It then mixes the downsampled majority class with the minority class to produce a balanced dataset (df_downsampled) and displays the class count to check for balance.
# keeping 8000 as non defaulters to keep in line with the defaulters
df_majority_downsampled = resample(df_majority, n_samples=8000, random_state=42)
df_downsampled = pd.concat([df_minority,df_majority_downsampled])
df_downsampled['SeriousDlqin2yrs'].value_counts()
After executing this code, the ‘SeriousDlqin2yrs’ target column in the downsampled dataset (df_downsampled) is assigned to y_train_downsampled to separate the target variable. Moreover, the ‘Customer_ID’ and ‘SeriousDlqin2yrs’ columns will be dropped from df_downsampled. Lastly, the shapes of the downsampled training data and the test data as well as the target variables are displayed, in order to verify the final sizes of the datasets.
y_train_downsampled = df_downsampled['SeriousDlqin2yrs']
df_downsampled.drop(columns=['Customer_ID', 'SeriousDlqin2yrs'],axis=1, inplace=True)
df_downsampled.shape, df_test.shape, y_train_downsampled.shape, y_test.shape
Utilizing SMOTE Technique For Balancing Classes
This script implements the SMOTE (Synthetic Minority Over-sampling Technique) method for generating synthetic samples to equalize the class distribution. It performs SMOTE on the training data (df_train and y_train) with a k_neighbors parameter set to 2. Finally, it displays the formation of the oversized dataset, its prediction feature set (os_data_X), and the size of its minority class in the target variable (os_data_y) confirming the balance.
smote = SMOTE(sampling_strategy = 'minority',k_neighbors = 2,random_state=42)
os_data_X,os_data_y=smote.fit_resample(df_train,y_train)
os_data_X.shape, sum(os_data_y)/len(os_data_y)
Creating Transformed Datasets
The following code generates copies of different datasets for transformation or standardization purposes:
os_data_X_transformed: Duplication of the SMOTE resampled feature set.
df_test_transformed: Duplication of the native test database.
df_test_standardized: Duplication of the test dataset, done probably for standardization.
df_downsampled_transformed: Duplication of the downsampled training dataset.
df_upsampled_transformed: Duplication of the upsampled training dataset.
Such copies are made for additional processing and transformations while leaving the original datasets safe.
os_data_X_tranformed = os_data_X.copy()
df_test_transformed = df_test.copy()
df_test_standaradized = df_test.copy()
df_downsampled_transformed = df_downsampled.copy()
df_upsampled_transformed = df_upsampled.copy()
Assessing the Skewness of the Numeric Features
The SkewMeasure function measures the skewness of all the numerical columns of the dataframe (df). It screens and provides only those columns with skewness values more than 0.5 or less than -0.5, which implies a strong skewness of the data. Here, the function is applied to os_data_X_transformed in order to locate features that may need transformations due to excessive skewness.
def SkewMeasure(df):
nonObjectColList = df.dtypes[df.dtypes != 'object'].index
skewM = df[nonObjectColList].apply(lambda x: skew(x.dropna())).sort_values(ascending = False)
skewM=pd.DataFrame({'skew':skewM})
return skewM[abs(skewM)>0.5].dropna()
SkewMeasure(os_data_X_tranformed)
Using Box-Cox Transformation to Deal with Skewness
The script applies the Box-Cox transformation at λ 0.15 to the columns in os_data_X_transformed and df_test_transformed that have a high skewness as detected by the SkewMeasure function. It reduces the skewness and makes the data more normally distributed which in turn helps to increase the performance of the model. After the transformation has been applied, the skewness is rechecked to validate the changes made.
skewM = SkewMeasure(os_data_X_tranformed)
for i in skewM.index:
os_data_X_tranformed[i] = special.boxcox1p(os_data_X_tranformed[i],0.15) #lambda = 0.15
df_test_transformed[i] = special.boxcox1p(df_test_transformed[i],0.15) #lambda = 0.15
SkewMeasure(os_data_X_tranformed)
Understanding Feature Distributions
The following code produces a matrix of histograms with KDE plots for all the features in os_data_X_transformed. For every column, the implementation loops over it drawing and modifying the distribution when changing the KDE's bandwidth to make the plot prettier. The histograms are presented in a 6 by 3 grid where different colors are used to display the various characteristics which helps understand the interdistribution and possible problems such as skewness and outliers.
import matplotlib.pyplot as plt
import seaborn as sns
columnList = list(df_test_transformed.columns)
colors = ['purple', 'blue', 'green', 'orange', 'red', 'cyan', 'brown', 'pink', 'gray', 'yellow', 'lime', 'magenta', 'navy', 'teal', 'olive', 'coral', 'indigo', 'maroon']
fig = plt.figure(figsize=[20, 20])
for col, i, color in zip(columnList, range(1, 19), colors):
axes = fig.add_subplot(6, 3, i)
sns.histplot(os_data_X_tranformed[col], ax=axes, kde=True, kde_kws={'bw_adjust': 1.5}, color=color)
axes.set_title(col) # Adding title for better readability
plt.tight_layout()
plt.show()
Addressing the Skewness in the Training Data
The current implementation copies the df_train dataset, then applies a Box-Cox transformation, setting the λ parameter to 0.15, to columns that are found to be significantly skewed by the SkewMeasure function. The overall goal of the transformation is to reduce the skewness in the data and make it more normal which will help in fitting the machine learning models.
df_train_transformed = df_train.copy()
skewM = SkewMeasure(df_train)
for i in skewM.index:
df_train_transformed[i] = special.boxcox1p(df_train_transformed[i],0.15) #lambda = 0.15
Standardized the Data
This piece of code employs the StandardScaler technique to perform the standardization of the os_data_x (training sample) and df_test (test sample) feature sets. It allows for fitting the scaler only to the training data and then transforming both training and test datasets even after that which results in transforming the features such that the mean is 0 and the standard deviation is 1.
scaler = StandardScaler().fit(os_data_X)
X_train_scaled = scaler.transform(os_data_X)
X_test_scaled = scaler.transform(df_test)
Standardizing Data After Processing Upsampling and Test Data
This script standardizes the df_upsampled_transformed (upsampled training set) and df_test_standaradized (standardized test set) using StandardScaler. Initially, it adjusts the parameters of the scaler using the upsampled training set and subsequently scales the upsampled training set and the test set, in both cases, centering (mean 0) and scaling (unit standard deviation) the data. This is an important step to avoid inconsistent feature scaling between training and testing datasets.
scaler = StandardScaler().fit(df_upsampled_transformed)
X_train_scaled_upsampled = scaler.transform(df_upsampled_transformed)
X_test_scaled_upsampled = scaler.transform(df_test_standaradized)
STEP 5:
Normalizing the Downsampled and Testing Data
The purpose of this code is to employ the StandardScaler method to standardize the df_downsampled_transformed (training provided in the downsampled form) and the df_test_standaradized (the test data in a standardized form). It is fitted to downscaled training data first then both the downscaled training as well as the test sets are transformed, which maintains the mean of 0 and standard deviation of 1 for these sets in a way enhancing the fitting of models that are based on standardization of datasets.
scaler = StandardScaler().fit(df_downsampled_transformed)
X_train_scaled_downsampled = scaler.transform(df_downsampled_transformed)
X_test_scaled_downsampled = scaler.transform(df_test_standaradized)
Custom Metrics of Precision and Recall
The following code provides a function of the custom precision and recall metrics using Keras backend operations. Precision refers to the number of true positives divided by the number of positive predictions, whereas recall refers to the number of true positives divided by the number of actual positives. In both cases, a small number is added in order to avoid division-by-zero errors.
import tensorflow.keras.backend as K
# custom metrics
def precision(y_true, y_pred):
# Cast y_true to float32 to match the data type of y_pred
y_true = K.cast(y_true, dtype='float32')
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
precision = true_positives / (predicted_positives + K.epsilon())
return precision
def recall(y_true, y_pred):
# Cast y_true to float32 to match the data type of y_pred
y_true = K.cast(y_true, dtype='float32')
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
recall = true_positives / (possible_positives + K.epsilon())
return recall
Model Training Function
The fit_model function outlines a neural network that features dense layers, and dropout to avoid overfitting and utilizes a sigmoid function at the output layer for binary class prediction. It employs the Adam optimizer alongside a binary cross entropy loss and specifies the precision and recall as the metrics to be monitored. The best model is saved via ModelCheckpoint and overfitting during training is avoided using EarlyStopping.
def fit_model(data, labels, test_data,test_label, epochs, batch_size):
n_inputs = data.shape[1]
model = keras.Sequential()
model.add(layers.Dense(16, activation ='relu', input_shape =(n_inputs, )))
model.add(layers.Dropout(0.25))
model.add(layers.Dense(32,activation = 'relu'))
model.add(layers.Dropout(0.25))
model.add(layers.Dense(1,activation ='sigmoid'))
model_file_name = 'MLP_predict_default_case_study.keras'
ckpt = ModelCheckpoint(model_file_name, monitor='val_precision',
save_best_only=True, mode='max')
early = EarlyStopping(monitor="val_recall", mode="max", patience=15)
model.compile(optimizer = 'adam',
loss= 'binary_crossentropy',
metrics = [precision,recall])
history = model.fit(data,
labels,
epochs=epochs,
batch_size=batch_size,
callbacks=[ckpt, early],
validation_data=(test_data,test_label))
return model
Calculating metrics and making plots at different thresholds for Precision, Recall, and F1 measures
The compute_precisions_thresholds function first fits the model using the fit_model function and then evaluates the quality of the predictions produced from different thresholds. It generates predictions on the test set for the modeled values and computes for precision, recall, and F1 score both for macro and micro averaging at different threshold values (0.25, 0.3,...0.96). The results are held in a DataFrame format and are arranged by a threshold making it possible to analyze the network performance over a range of decision thresholds.
def compute_precisions_thresolds(data, labels, test_data, test_label, epochs, batch_size):
trained_model = fit_model(data, labels, test_data, test_label, epochs=epochs, batch_size=batch_size)
y_test_pred = trained_model.predict(test_data)
P_macro = [] ; P_micro = []; R_macro = [] ;R_micro = []; F1_macro = [] ;F1_micro = []; cut_off = [] ; metrics = pd.DataFrame()
threshold_list = [0.25,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.96]
for thres in threshold_list:
cut_off.append(thres)
y_test_pred_new = [1 if el>thres else 0 for el in y_test_pred]
prec_macro = round(precision_score(test_label, y_test_pred_new, pos_label=1, average='macro'),2)
P_macro.append(prec_macro)
prec_micro = round(precision_score(test_label, y_test_pred_new, pos_label=1, average='micro'),2)
P_micro.append(prec_micro)
rec_macro = round(recall_score(test_label, y_test_pred_new, pos_label=1,average='macro'),2)
R_macro.append(rec_macro)
rec_micro = round(recall_score(test_label, y_test_pred_new, pos_label=1,average='micro'),2)
R_micro.append(rec_micro)
f1_macro = round(f1_score(test_label, y_test_pred_new, average='macro'),2)
F1_macro.append(f1_macro)
f1_micro = round(f1_score(test_label, y_test_pred_new, average='micro'),2)
F1_micro.append(f1_micro)
metrics = pd.DataFrame({'Threshold' : cut_off, 'Precision Macro' : P_macro, 'Precision Micro' : P_micro,'Recall Macro' : R_macro, 'Recall Micro' : R_micro,'F1 Score Macro' : F1_macro, 'F1 Score Micro' : F1_micro})
return metrics.sort_values(by=['Threshold'], ascending=False)
Assessing the Performance of Box-Cox Transformed Data Models
This code makes a call to the compute_precisions_thresholds function while providing the Box-Cox transformation of the training data, the target labels, the testing data, and their respective labels. The model is run for a total of 15 epochs with 128 batches and the precision, recall, and F1 scores at different thresholds are calculated and recorded into the box_cox_metrics DataFrame. This facilitates the assessment of the model at different decision thresholds.
box_cox_metrics = compute_precisions_thresolds(os_data_X_tranformed, os_data_y, df_test_transformed, y_test,epochs=15, batch_size=128)
This will display the metrics for each threshold, helping you assess the model’s performance at various decision cutoffs.
box_cox_metrics
Assessing the Performance of Standardized Data
This code makes a call to the compute_precisions_thresholds function while providing the standardized training data (X_train_scaled), target labels (os_data_y), standardized test data (X_test_scaled), and test labels (y_test). The model is run for a total of 15 epochs with 128 batches and the precision, recall and F1 scores at different thresholds are calculated and recorded into the standardized_metrics DataFrame.
standardized_metrics = compute_precisions_thresolds(X_train_scaled, os_data_y, X_test_scaled, y_test,epochs=15, batch_size=128)
This will display the metrics for each threshold, helping you assess the model’s performance at various decision cutoffs.
standardized_metrics
Normalizing the Train and Test Set
This code implements StandardScaler to standardize df_train and df_test datasets. Initially, the scaler is fitted on the fitting dataset and then the training and testing datasets are transformed to have a means of zero and a standard deviation of one. It is an important step for algorithms that are sensitive to feature scaling, as it helps avoid training the model on unscaled data.
scaler = StandardScaler().fit(df_train)
df_train_scaled = scaler.transform(df_train)
df_test_scaled = scaler.transform(df_test)
Assessing Metrics of the Model Under the Standardized Data
This code calls the compute_precisions_thresholds function while specifying the standardized training data (df_train_scaled), training targets (y_train), standardized testing data (df_test_scaled), and test targets (y_test). The model is trained for 10 epochs with 128 as the batch size and precision, recall, and F1 scores are calculated at several thresholds hence the base_metrics DataFrame.
base_metrics = compute_precisions_thresolds(df_train_scaled, y_train, df_test_scaled, y_test, epochs=10, batch_size=128)
This will display the metrics for each threshold, helping you assess the model’s performance at various decision cutoffs.
base_metrics
Assessing the Performance of the Model on Upsampled Data
This piece of code invokes the function compute_precisions_thresholds with the following parameters: the scaled-up training data (X_train_scaled_upsampled), corresponding labels for the upsampled training data (y_train_upsampled), scaled test data (df_test_standaradized), and obesity test labels (y_test). The training is performed for 10 epochs with 128 batch sizes and precision, recall, and F1 scores are computed for different thresholds, with the outcomes recorded into upsampled_metrics DataFrame. This serves to assess the model's performance concerning the upsampled data across various decision levels.
upsampled_metrics = compute_precisions_thresolds(X_train_scaled_upsampled, y_train_upsampled, df_test_standaradized, y_test, epochs=10, batch_size=128)
This will display the metrics for each threshold, helping you assess the model’s performance at various decision cutoffs.
upsampled_metrics
Assessing the Performance of the Model on Downsampled Data
This piece of code invokes the function compute_precisions_thresholds with the following parameters: the scaled-up training data (X_train_scaled_downsampled), corresponding labels for the upsampled training data (y_train_downsampled), scaled test data (df_test_standaradized), and obesity test labels (y_test). The training is performed for 10 epochs with 128 batch sizes and precision, recall, and F1 scores are computed for different thresholds, with the outcomes recorded into downsampled_metrics DataFrame. This serves to assess the model's performance concerning the upsampled data across various decision levels.
downsampled_metrics \= compute_precisions_thresolds(X_train_scaled_downsampled, y_train_downsampled, df_test_standaradized, y_test, epochs=10, batch_size=128)
This will display the metrics for each threshold, helping you assess the model’s performance at various decision cutoffs.
downsampled_metrics
cal_score: It calculates performance measures such as precision, recall, F1 score and a confusion matrix.
metrics_calculation: Measures the performance of classifiers through cross-validation, fitting the model, precision, recall, F1 score, and confusion matrix calculation on testing data.
clf_dict: This specifies a number of classifiers used for evaluation - Random Forest, XGBoost, Logistic Regression, and Light GBM.
Such functions provide an extensive evaluation of different models.
def cal_score(y_test, y_pred):
cm = confusion_matrix(y_test, y_pred)
prec_scr_macro = precision_score(y_test, y_pred, average='macro')*100
prec_scr_micro = precision_score(y_test, y_pred, average='micro')*100
rec_scr_macro = recall_score(y_test ,y_pred, average='macro')*100
rec_scr_micro = recall_score(y_test ,y_pred, average='micro')*100
f1_scr_macro = f1_score(y_test, y_pred, average='macro')*100
f1_scr_micro = f1_score(y_test, y_pred, average='micro')*100
return prec_scr_macro, prec_scr_micro, rec_scr_macro, rec_scr_micro, f1_scr_macro, f1_scr_micro, cm
def metrics_calculation(classifier, training_data, testing_data, training_label, testing_label):
result = []
cols = ['Mean Accuracy', 'Accuracy deviation', 'Precision Macro', 'Precision Micro', 'Recall Macro','Recall Micro', 'F1 Score Macro', 'F1 Score Micro', 'Confusion Matrix']
crs_val = cross_val_score(classifier, training_data, training_label, cv=5)
mean_acc = round(np.mean(crs_val),3)
std_acc = round(np.std(crs_val),3)
classifier.fit(training_data, training_label)
predictions = classifier.predict(testing_data)
prec_scr_macro, prec_scr_micro, rec_scr_macro, rec_scr_micro, f1_scr_macro, f1_scr_micro, cm = cal_score(testing_label, predictions)
result.extend([mean_acc,std_acc, prec_scr_macro, prec_scr_micro, rec_scr_macro, rec_scr_micro, f1_scr_macro, f1_scr_micro, cm])
series_result = pd.Series(data=result, index=cols)
return series_result
clf_dict = {
'Random Forest': RandomForestClassifier(random_state=42),
'XGBoost': XGBClassifier(random_state=42),
'Logistic Regression' : LogisticRegression(random_state=42),
'Light GBM' : LGBMClassifier(random_state=42)
}
Evaluating Multiple Classifiers and Storing Results
The function metrics_calculation was applied to Random Forest, XGBoost, Logistic Regression, and Light GBM classifiers included in clf_dict and the results obtained on the SMOTE-transformed training data (os_data_X_transformed and os_data_y) along with using the test data (df_test_transformed and y_test). The results are first stored in a dictionary (frame) and stored in a DataFrame (box_cox_smote_df) for better model performance comparison across different metrics.
frame = {}
for key in clf_dict:
classifier_result = metrics_calculation(clf_dict[key], os_data_X_tranformed, df_test_transformed, os_data_y, y_test)
frame[key] = classifier_result
box_cox_smote_df = pd.DataFrame(frame)
box_cox_smote_df
Assessing Classifiers on Normalized Data
This code calls metrics_calculation function to evaluate each classifier in clf_dict (Random forest, XGboost, Logistic Regression and Light GBM) on X_train_scaled and X_test_scaled with resampled target labels os_data_y and y_test respectively. The findings are preserved in a dictionary frame_std, which is later, transformed into a DataFrame standardized_smote_df in order to facilitate performance comparison of alternative metrics evaluation for the models.
frame_std = {}
for key in clf_dict:
classifier_result_std = metrics_calculation(clf_dict[key], X_train_scaled, X_test_scaled, os_data_y, y_test)
frame_std[key] = classifier_result_std
standardized_smote_df = pd.DataFrame(frame_std)
standardized_smote_df
Defining Classifiers with Class Balance
In this code, a new set of classifiers (clf_dict_balanced) is introduced that mitigates class imbalance by altering the weights of the minority class (class 1). The weights of the given classes are considered by Random Forest, Logistic Regression and XGBoost classifiers and whereas only XGBoost and Light GBM consider scale_pos_weight parameter to alleviate class imbalance in there models making then hostile to the threshold rare class (defaulters).
clf_dict_balanced = {
'Random Forest': RandomForestClassifier(random_state=42, class_weight = {0:1, 1:10}),
'XGBoost': XGBClassifier(random_state=42, scale_pos_weight = 10),
'Logistic Regression' : LogisticRegression(random_state=42, class_weight = {0:1, 1:10}),
'Light GBM' : LGBMClassifier(random_state=42, scale_pos_weight = 10)
}
Assessing Classifier Performance with Different Class Weights
The procedure analyzes every classifier in clf_dict_balanced (Random Forest, XGBoost, Logistic Regression, Light GBM) using the metrics_calculation function on original training set data (df_train) and post test set data (df_test) after adjusting the class weights to deal with the problem of class imbalance. The outputs are kept in a dictionary (frame_balanced), which is then transformed into a DataFrame (balanced_df) for the purposes of enhancing model deployment through the use of class balance techniques.
frame_balanced = {}
for key in clf_dict_balanced:
classifier_result_balanced = metrics_calculation(clf_dict_balanced[key], df_train, df_test, y_train, y_test)
frame_balanced[key] = classifier_result_balanced
balanced_df = pd.DataFrame(frame_balanced)
balanced_df
Evaluating Classifiers on Transformed Data with Class Weighting
The code evaluates each classifier within clf_dict_balanced (Random Forest, XGBoost, Logistic Regression, Light GBM) using the metrics_calculation method on both transformed training (df_train_transformed) and testing datasets (df_test_transformed), employing class weighting strategy to mitigate class balance issues. The outcomes are kept in a dictionary (frame_balanced_scaled) before being transformed into a DataFrame (balanced_df_scaled) to assess the performance of models put through transformation and class balancing techniques.
frame_balanced_scaled= {}
for key in clf_dict_balanced:
classifier_result_balanced_scaled = metrics_calculation(clf_dict_balanced[key], df_train_transformed, df_test_transformed, y_train, y_test)
frame_balanced_scaled[key] = classifier_result_balanced_scaled
balanced_df_scaled = pd.DataFrame(frame_balanced_scaled)
balanced_df_scaled
Drawing ROC Curve for Various Models
The plot_multiplt_rocauc function fits and assesses several classifiers, specifically Random Forest, XGBoost, Logistic Regression, and LightGBM, on the altered training set (os_data_X_tranformed and os_data_y) and evaluation set (df_test_transformed and y_test). The receiver operating characteristics (ROC) curve and area under the curve (AUC) are determined for each model. The results of the ROC analysis for all the models are presented in a single graph allowing a conceptual assessment as to the trade off between the sensitivity (True Positive Rate) versus 1-specificity (False Positive Rate) for each model.
models = [
{
'label': 'Random Forest',
'model': RandomForestClassifier(random_state=42)
},
{
'label' : 'XGBoost',
'model' : XGBClassifier(random_state=42)
},
{
'label' : 'Logistic Regression',
'model' : LogisticRegression(random_state=42)
},
{
'label' : 'Light GBM',
'model' : LGBMClassifier(random_state=42)
}
]
def plot_multiplt_rocauc(models,train_X, train_y ,dev_X, dev_y):
for m in models:
model = m['model']
model.fit(train_X, train_y)
y_pred = model.predict(dev_X)
pred = model.predict_proba(dev_X)
pred_new = [i[1] for i in pred]
fpr, tpr, thresholds = roc_curve(dev_y, pred_new)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label='%s ROC (area = %0.2f)' % (m['label'], roc_auc))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1-Specificity(False Positive Rate)')
plt.ylabel('Sensitivity(True Positive Rate)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show() # Display
return
plot_multiplt_rocauc(models,os_data_X_tranformed,os_data_y, df_test_transformed, y_test)
Assessing Several Models
The compare_models function allows for testing and comparing the performance of four machine learning models: Logistic Regression, Random Forest, XGBoost, and LightGBM training and test data. It applies several evaluation criteria to each of these models: accuracy, precision, recall, and F1 score, and saves the results in a dictionary which is later output in a DataFrame for all models comparison purposes. Also, the function depicts the ROC Curve corresponding to each model to present the results of the model in terms of True Positive Rate versus False Positive Rate. The AUC (Area Under the Curve) score of each model is shown inside the legend as well.
# 4 model comparision
from sklearn.metrics import roc_curve, auc, confusion_matrix, precision_score, recall_score, f1_score, accuracy_score
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
def compare_models(models, X_train, y_train, X_test, y_test):
"""
Compares the performance of multiple machine learning models.
Args:
models: A dictionary of models where keys are model names and values are model instances.
X_train: Training data features.
y_train: Training data labels.
X_test: Test data features.
y_test: Test data labels.
"""
results = {}
for model_name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
results[model_name] = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1
}
# Create a DataFrame to display the results
df_results = pd.DataFrame(results).transpose()
print(df_results)
# Plot ROC AUC for each model
plt.figure(figsize=(10, 6))
for model_name, model in models.items():
y_pred_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'{model_name} (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc='lower right')
plt.show()
# Example usage (assuming you have your models, X_train, y_train, X_test, y_test defined)
models = {
'Logistic Regression': LogisticRegression(),
'Random Forest': RandomForestClassifier(),
'XGBoost': XGBClassifier(),
'LightGBM': LGBMClassifier()
}
compare_models(models, X_train_scaled, os_data_y, X_test_scaled, y_test)
Model Comparison using Hyperparameters and Class Weights
This code illustrates how it defines four classifiers: Random forest, XG boost classifier, Logistic regression and Light GBM classifier, with particular hyperparameter settings and classes pertaining to the number of classes to mitigate class imbalance. The function plot_multiplt_rocauc is employed to provide the ROC Curves for these models and measure their performance for AUC (Area Under the Curve) based on training and testing datasets for the models.
models_balanced = [
{
'label': 'Random Forest',
'model': RandomForestClassifier(max_depth=15, min_samples_leaf=8, n_estimators=200, random_state=42, class_weight={0:1,1:10})
},
{
'label' : 'XGBoost',
'model' : XGBClassifier(gamma=1, max_depth=8, n_estimators=200, random_state=42, reg_alpha=0.5, reg_lambda=1.15, scale_pos_weight=10)
},
{
'label' : 'Logistic Regression',
'model' : LogisticRegression(random_state=42, class_weight={0:1,1:10})
},
{
'label' : 'Light GBM',
'model' : LGBMClassifier(colsample_bytree=0.65, max_depth=4, min_data_in_leaf=400, min_split_gain=0.25, num_leaves=70, random_state=42, reg_lambda=5, subsample=0.65, scale_pos_weight=10)
}
]
plot_multiplt_rocauc(models,df_train,y_train, df_test, y_test)
Training LightGBM Model with Custom Hyperparameters
The present code describes the implementation of a LightGBM classifier (model_lgb) training procedure with individual hyperparameters such as max_depth, num_leaves, subsample, reg_lambda, and scale_pos_weight to compensate for class imbalance. The model is fit to the df_train data with the corresponding y_train labels. The verbose=-1 setting mutes the output while training for tidier logs.
from lightgbm import LGBMClassifier
model_lgb = LGBMClassifier(
colsample_bytree=0.65,
max_depth=4,
min_data_in_leaf=400,
min_split_gain=0.25,
num_leaves=70,
random_state=42,
reg_lambda=5,
subsample=0.65,
scale_pos_weight=10,
verbose=-1 # Suppress output
)
model_lgb.fit(df_train, y_train)
Ensuring Consistency in Train and Test Data Column
The purpose of the code is to ensure the same structure of columns in train and test data sets with the help of economically adding some missing columns from train to test and eliminating any extra of those columns from the test data. Thereafter, the code uses the saved LightGBM model to predict the outcome of test data and includes it in a newly created ‘predictions’ column of the df_test dataset.
train_cols = df_train.columns.tolist()
# Create a list of test data columns
test_cols = df_test.columns.tolist()
# Find columns that are missing from train data but present in test data
missing_cols_in_train = [col for col in test_cols if col not in train_cols]
# If there are missing columns, drop them from the test data
df_test = df_test.drop(columns=missing_cols_in_train)
# Add columns that are in train data but not in test data
missing_cols_in_test = [col for col in train_cols if col not in test_cols]
for col in missing_cols_in_test:
df_test[col] = 0 # Fill with 0 or any suitable value based on your dataset
y_pred = model_lgb.predict(df_test)
df_test['predictions'] = y_pred
The following code defines a lambda function called predict_model_lgb which accepts data x as an argument and returns the probability prediction made by the LightGBM model (model_lgb) as a float. In this way, predictions can be made in probabilities rather than in pure class predictions alone.
predict_model_lgb = lambda x: model_lgb.predict_proba(x).astype(float)
Explaining Model Predictions with SHAP
The following code implements SHAP to interpret the predictions produced by the LightGBM model. It derives SHAP values for the test data and displays a summary plot indicating the feature importance with respect to the binary classification class 1 or all the classes in a multi-class case.
X_importance = df_test.drop(columns=['predictions']) # Drop the 'predictions' column
# Explain model predictions using shap library:
explainer = shap.TreeExplainer(model_lgb)
shap_values = explainer.shap_values(X_importance)
# Check if shap_values is a list (indicating binary classification)
if isinstance(shap_values, list):
# If binary classification, plot for class 1 (or 0 for the other class)
shap.summary_plot(shap_values[1], X_importance) # Use shap_values[0] for class 0
else:
# If multi-class, plot directly
shap.summary_plot(shap_values, X_importance)
The following code employs LIME for research of forecasts made by the LightGBM model. Its label encodes the categorical variables for the light gbm model. Then it builds up a classifier LIME explainer and produces an explanation for the instance (row 45). The explanation indicates what features had an impact on the prediction, which improves the understandability of the model.
import pandas as pd
import shap
import lime
import lime.lime_tabular
from sklearn.preprocessing import LabelEncoder
# Assuming 'df_test' is your DataFrame and 'model_lgb' is your trained model
# 1. Preprocessing: Handle categorical features if any
# Convert categorical features to numerical using Label Encoding
categorical_features = df_test.select_dtypes(include=['object']).columns # Identify categorical columns
for feature in categorical_features:
le = LabelEncoder()
df_test[feature] = le.fit_transform(df_test[feature]) # Encode categorical features
# 2. Create the explainer with the correct feature information
predict_model_lgb = lambda x: model_lgb.predict_proba(x).astype(float)
explainer = lime.lime_tabular.LimeTabularExplainer(
training_data=df_test.drop(columns=['predictions']).values, # Use training data without 'predictions'
feature_names=df_test.drop(columns=['predictions']).columns.tolist(), # Provide feature names
class_names=[0, 1], # Assuming binary classification
mode="classification", # Specify classification mode
categorical_features=list(range(len(categorical_features))), # Specify indices of categorical features
discretize_continuous=True # Discretize continuous features if needed
)
# 3. Explain the instance
i = 45
X_observation = df_test.iloc[[i], :].drop(columns=['predictions']) # Drop 'predictions' from observation
explanation = explainer.explain_instance(
data_row=X_observation.values[0],
predict_fn=predict_model_lgb,
num_features=X_observation.shape[1] # Use the correct number of features
)
explanation.show_in_notebook(show_table=True, show_all=False)
print(explanation.score)
i = 25
X_observation = df_test.iloc[[i], :].drop(columns=['predictions']) # Drop 'predictions' from the observation
# explanation using the random forest model
explanation = explainer.explain_instance(X_observation.values[0], predict_model_lgb)
explanation.show_in_notebook(show_table=True, show_all=False)
print(explanation.score)
Conclusion
This project successfully demonstrates the process of building a machine learning model for predicting customer defaults. By applying various data preprocessing techniques, including handling missing values, feature engineering, and addressing class imbalance using SMOTE and downsampling, the dataset was prepared for model training. We evaluated multiple machine learning models such as Random Forest, XGBoost, Logistic Regression, and LightGBM using performance metrics like accuracy, precision, recall, and AUC-ROC. Additionally, we used LIME and SHAP to explain and interpret the model predictions, providing valuable insights into how specific features influence the outcomes. Ultimately, this approach not only ensures robust predictions but also enhances the interpretability and transparency of the machine learning models, making them more practical for real-world applications in financial decision-making.
This project has effectively exhibited the ability to construct a machine learning model to forecast customer default. Some of the data preprocessing strategies that were first employed included dealing with missing data, conducting feature engineering, and fixing the class imbalance using SAMOTE and downsampling to create a training set for model building. Several machine learning models were tried and assessed such as Random Forest, XGBoost, Logistic Regression, and LightGBM which were evaluated based on accuracy, precision, recall, and AUC-ROC. In addition, we also incorporated LIME and SHAP to model the predictions made and how these specific factors contribute towards the model outcomes. Finally, this approach is capable of guaranteeing sound predictions, but more so improves the explanation and transparency of the machine learning models making them useful in practice particularly in these turbulent economic times when making financial decisions.
Challenges New Coders Might Face
Challenge: Handling Missing Data
Solution: Implement imputation methods such as replacing the missing values by mean or median values or more advanced methods such as KNN imputation and K-nearest neighbor imputation should be used.Challenge: Class Imbalance
Solution: Leverage SMOTE (Synthetic Minority Over-sampling Technique) to create synthetic samples of the minority class to balance the data set. This makes sure that the model learns to train.Challenge: Model Interpretability
Solution: Numerous machine learning algorithms, for instance, XGBoost and the Random Forest model are usually referred to as ‘black boxes,’ making it challenging to interpret how conclusions are drawn. Employ LIME and SHAP for the purpose of model interpretability, which aids in recognizing significant features and rationalizing model predictions, which enhances the transparency and the confidence in the model.Challenge: Feature Selection
Solution: Perform feature selection using techniques like RFE (Recursive Feature Elimination) or feature importance from models like Random Forest to keep only the most significant features.Challenge: Hyperparameter Tuning for Optimization:
Solution: Use Grid Search or Random Search for hyperparameter tuning to systematically find the optimal settings. These techniques carry out the tuning process automatically, which tends to enhance the performance of the model with very minimal effort.
Frequently Asked Questions (FAQs)
Question 1: What are the common challenges in building a customer default prediction model?
Answer: Common challenges include class imbalance, missing data, overfitting, feature selection, and model interpretability. These arise due to the nature of the data and the algorithms employed and may warrant the adoption of SMOTE, cross-validation, and lime/shap techniques among others for good measures.
Question 2:How can I handle class imbalance when predicting customer defaults?
Answer: Apply SMOTE techniques to oversample the minority class or use class weight in logistic regression and lightgbm to penalize the majority class and enhance the predictive accuracy of the modeling process on the minority class (defaulters).
Question 3: How could I enhance customers’ default predicating modeling interpretability?
Answer: Use SHAP and LIME, for instance, to explain model output by showing how each class and feature contributes to the output which is critical, especially in financial cases.
Question 4: 4: Why is feature engineering crucial in terms of customer default prediction?
Answer: As data scientists we use feature engineering to come up with new features through the creation or transformation of existing ones which can help in improving the performance of the model. If I take a look at the model for example I may include some of the features relating to delinquency in order to be able to structure them together or perhaps create a separate feature such as Debt Ratio.
Question 5: How do I handle imbalanced datasets in machine learning for default prediction?
Answer: Use SMOTE (Synthetic Minority Over-sampling Technique) or similar techniques on the dataset in order to generate more samples of the minority class and or downsampling techniques in order to achieve the ratio between defaulters and non-defaulters’s accurate predictions.