
Credit Card Default Prediction Using Machine Learning Techniques
This project aims to develop and assess machine learning models in predicting customer defaults, assisting businesses in evaluating the risk of customer loan or credit repayment default. The project undertakes data preparation steps, dealing with the unbalanced classes by applying techniques such as oversampling and downsampling using SMOTE, and feature engineering processes such as Box-Cox transformations to alleviate skewness. As a way of making the model more interpretable, we employ SHAP and LIME for explainability purposes to see how the models reach certain predictions. These steps are intended to enhance the functionality and clarity of the default prediction models.
Project Overview
The project begins with data preprocessing, which involves dealing with the missing values, converting the categorical variables to the numerical format using Label Encoding and applying Box-Cox transformation for skewed features where applicable. Next is feature engineering where additional features are generated to capture the relationship between features already in existence; for example, the delinquency columns are merged, and financial ratios are computed. There is also the implementation of a range of classification models such as Random Forest, XGBoost, Logistic Regression, LightGBM, and so on. Hyperparameter tuning is deployed along with class balancing techniques such as SMOTE and class weight adjustment to this end to mitigate the effects of an imbalanced dataset. The models are assessed based on key metrics including accuracy, precision, recall, F1 score and AUC-ROC among others to determine how effective they are. Additionally, to simplify model predictions, LIME and SHAP are used as improvements to address interpretability, where particular features are shown to contribute to the overall prediction. Finally, the project considers a case study demonstrating how to incorporate many algorithms and preprocessing techniques to make predictions on customer defaults for the benefit of financial institutions.
Prerequisites
- Python Programming: Ability to use basic Python programming skills to implement algorithms and manipulate data.
- Machine Learning Basics: Basic concepts of classification algorithms, evaluators, and how to prevent overfitting.
- Pandas and NumPy: Skills in data processing and basic mathematical working with data.
- Scikit-learn: Knowledge of widely used machine learning frameworks for training and evaluation and data preprocessing.
- SMOTE and Downsampling/Undersampling Techniques: Skills in approaches to address imbalanced learning in order to enhance the performance of models.
- SHAP and LIME: Knowledge of tools that can be used to understand and explain prediction and feature contribution.
- LightGBM, XGBoost, Random Forest, Logistic Regression: Knowledge of some widespread classification algorithms employed in machine learning.
Approach
The procedure starts at Data Preprocessing, where data is cleaned and the dataset is prepared by performing actions such as dealing with missing values, Encoding Categorical Variables through Label Encoding, and correcting box-cox transformed Numerical Features to skewness to prepare the data for machine learning algorithms. Later, during feature engineering, it is aimed to create additional features utilizing the interactions of those already present that are of key importance, such as merging the columns about the contractions and computing such financial metrics as Debt Ratio, and Revolving Utilization.
In order to resolve the class disproportion in the set of data, we impute the minority class via SMOTE (Synthetic Minority Over-sampling Technique) and take down the majority class's size through down-sampling to achieve class balance. Upon data preparation, we opt for a few chosen machine learning model algorithms including Random Forest, XGBoost, Logistic Regression, and LightGBM and train these algorithms on the processed data.
Subsequently, the trained models are assessed against a number of performance measures such as accuracy and precision, recall, F1 score and AUC /ROC with the purpose of establishing how well the models predict the likelihood of customer defaulting: the performance of the best models is also improved through tuning. The best option is chosen with respect to the performance of the model.
In order to guarantee uniformity and intelligibility of the models, we use SHAP and LIME for model explainability which aids in understanding how different features affect predictions. Finally, the results are assessed and conclusions are made that provide practical recommendations to an organization in order to mitigate the risks of customer default.
Workflow and Methodology
- Data Collection and Preprocessing:
- The project begins by gathering and preparing the dataset for analysis.
- Missing values are handled appropriately, categorical variables are encoded using Label Encoding, and skewed features are transformed using the Box-Cox technique.
- Feature Engineering:
- New features are volunteered to reflect meaningful communications within the existing features by mixing up delinquency features and also giving up analysis by metrics such as the Debt Ratio and the Revolving Utilization.
- Further, class imbalance is attempted to be moderated through SMOTE (to facilitate over-sampling of the under-represented category) and regular down-sampling (for the over-represented category) approaches.
- Model Selection and Training:
- Several classifiers such as Random Forest, XGBoost, Logistic Regression, and LightGBM are opted for the purpose.
- Every model is trained after preprocessing the data and adjusting hyperparameters for the best fit. Simultaneously class balancing methods such as class weights or SMOTE to the achieved results for better prediction.
- Model Evaluation:
- Finally, each model fitted to the training data is put through performance evaluation using standard metrics, namely accuracy, precision, recall, F score and AUC. ROC for effectiveness.
- It's worth mentioning that cross-validation is also done to check how robust the models are.
- Model Explainability:
- The prediction and importance of every feature in the model are then explained using SHAP and LIME. This also gives room for the model to explain how different features affect the predictions resulting in more clarity to the model.
- Performance Evaluation:
- The models are empirically analyzed based on the evaluation measures, and the most accurate models are selected for implementation or enhancement.
- Conclusion and Recommendations:
- The analysis gives a picture of which models and strategies seem to yield the best results for predicting customer defaults.
Data Collection and Preparation
Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.
Data Preparation Workflow:
- Loading the Dataset: Downloading the dataset and examining its contents to gain insight into the types of data and descriptive statistics present.
- Handling Missing Values: Assess the extent of missing data and the appropriate imputation or elimination of them, depending on the nature and distribution of data.
- Handling Categorical Variables: Apply Label Encoding or One-Hot Encoding to represent categorical variables as numerical variables.
- Feature Engineering: Modify existing variables to include additional variables that provide ratios deemed important.
- Addressing Class Imbalance: Applying SMOTE, downscaling, or modifying class weights to remedy the problem of disproportionate classes.
- Requirements Transformation: Adjust skewed distribution by normalizing or standardizing the data or using Box-Cox transformation.
- Dataset Partitioning: Use either train_test_split or cross-validation to partition the available data into training datasets and test datasets.
Code Explanation
STEP 1:
Mounting Google Drive
First, mount Google Drive to access the dataset that is stored in the cloud.
from google.colab import drive
drive.mount('/content/drive')
Library Installation
This piece of code installs various libraries: SHAP for understanding the model, LIME for understanding each prediction, Keras for building deep learning models, XGBoost and LightGBM for gradient boosting techniques, and Imbalanced-learn for the problem of imbalanced datasets.
!pip install shap
!pip install lime
!pip install keras
!pip install xgboost
!pip install lightgbm
!pip install imblearn
Ignore Warning
The filterwarnings('ignore') function prevents any warnings from being shown during the execution of the program. This can be useful when you don't want the warnings to clutter the output, but keep in mind that ignoring warnings can sometimes hide important information about potential issues in your code.
# Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore')
Importing necessary Libraries
This code imports the necessary libraries for the successful execution of the program, which are SHAP for explainability, math for calculations, Keras for deep learning net architecture, NumPy and Pandas for handling data, XGBoost for implementing almost all the gradient boosting algorithms, and Seaborn for effective plotting and presentation of data.
import shap
import math
import keras
import numpy as np
import pandas as pd
import xgboost as xgb
import seaborn as sns
import tensorflow as tf
import keras.backend as K
import matplotlib.pyplot as plt
from keras import models
from keras import layers
from sklearn.svm import SVC
from scipy.stats import skew
from matplotlib import pyplot
from collections import Counter
from scipy.stats import kurtosis
from scipy import stats, special
from xgboost import XGBClassifier
from sklearn.utils import resample
from lightgbm import LGBMClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from lime.lime_tabular import LimeTabularExplainer
from sklearn.metrics import precision_recall_curve
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import KFold,StratifiedKFold
from keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import mean_squared_error, accuracy_score,confusion_matrix, roc_curve, auc,classification_report, recall_score, precision_score, f1_score,roc_auc_score,auc,roc_curve
STEP 2:
Loading Data and Checking Dimensions:
This code loads the CSV file. After loading the dataset it prints the dataset’s shape to check the number of rows and columns.