Loan Eligibility Prediction using Gradient Boosting Classifier
Loan eligibility prediction is an essential method for banking and financial organizations. This makes it easier for the former to decide on whether or not an applicant is worthy of being issued a loan. Using this kind of prediction system, lenders are capable of minimizing risks, while at the same time making proper decisions much more quickly. Due to the ever-increasing demand for loans, the institutions must have a better way to predict the eligibility of the applicants. It is for the benefit of the lenders just as much as the applicants, the process is made easier, faster, and less complicated.
Project Overview:
This project will guide the learner through creating a loan-based eligibility forecast model. This model utilizes income, credit score, loan amount, and applicant background as its data points. Machine learning is applied so that the model finds previous data patterns to correctly forecast future eligibility. This will be done through data processing with the help of recognized libraries such as Pandas or initiating the models with the help of Scikit-Learn. Controlling for each stage throughout this work, common steps will include data cleaning and feature selection, as well as model training and assessment. This approach restores order, reliability, and efficiency in the lending/borrowing process to benefit the lenders and borrowers in the market.
This guide is your one-stop source of Loan Eligibility Prediction explained simply and in a manner you can easily follow.
Prerequisites
Before commencing with loan eligibility application prediction, you must possess or acquire the following knowledge and tools:
- Knowledge of the basic Python programming language, including its data formats
- The concepts of model training, and assessment of its performance using different metrics.
- Familiarity with cleaning, filtering, and reshaping data with Pandas.
- Statistical knowledge such as averages, variance, and related statistics.
- Create either a Jupyter Notebook Setup or Google Colab where coding and visualization will be done.
- Ensure all the packages required such as Panda, Numpy, and Scikit-learn are installed.
- An understanding of Matplotlib or Seaborn to present the analysis of data in noticeable trends.
- A basic understanding of how predictions are made with the models and the data.
Approach
When building a loan eligibility prediction system, following a continuous and detailed step-wise process helps in saving biases and inaccuracies. First, data is collected and examined – looking for patterns and investigating for missing values and outliers. Data is then cleaned by addressing the null values and encoding categorical data through numerical representation for simplicity. Data cleaning is done then we carry out feature selection by considering only significant predictors such as income, credit score, and loan amount that determine eligibility.
Then we create the training and testing datasets and assess the performance of the model created. In this process, we train the model on a training set using machine learning algorithms. After the training phase is complete, the model is evaluated for accuracy. This enhances the performance of the model as well as its adaptability to new data. Finally, the implementation of the model facilitates fast decision-making for lenders, thus offering the institution and the applicant an easy interface.
Workflow and Methodology
Here's a step-by-step workflow you'll follow to build a successful loan eligibility prediction model:
- Data Collection and Loading: Begin by collecting the Loan eligibility data set from Kaggle and loading and importing them to a Pandas data frame
- Data Cleaning: To improve data quality, you need to detect and deal with missing values, ensure the correct type of data, and the handling of outliers.
- Exploratory Data Analysis (EDA): By applying EDA, you can understand the distribution of data and its prominent features.
- Data Preprocessing: You have to scale the numerical data and convert the categorical data into numeric data for better model training.
- Model Selection: Use classification models as this is a classification task.
- Model Training: Train all models with cleaned data and prepared data.
- Model Evaluation: Compare the model using other parameters such as the precision, recall, F1-score, ROC, and AUC curve metrics.
- Hyperparameter Tuning: Optimize model parameters to improve prediction accuracy.
Data Collection and Preparation
Data collection
The Loan Eligibility dataset is available in Kaggale. It is possible to conveniently and securely access a Kaggle dataset from within Google Colab after configuring your Kaggle credentials to prevent compromising sensitive information. It brings in the user’s data to collect securely the Kaggle API key and username and assigns them as environment variables. This enables the use of Kaggle’s CLI command which authenticates the user and downloads the dataset straight into Colab.
Data Preparation
Data pre-processing refers to cleaning and formatting of raw data into an analysis preparing the data for analysis and model development. This pre-processing stage prepares the dataset meaning it deals with missing values, categorical features, and scaling numerical features to make the dataset ready for modeling.
Data preparation workflow
- Data Cleaning: Handling missing values with median or mode. Then convert data types into correct formats.
- Outlier Management: Detect outliers for better model performance using statistical methods like IQR.
- Feature Engineering: Transform categorical variables with label encoding or one-hot encoding. Create additional features if they can improve model performance.
- Scaling and Normalization: Use StandardScaler to normalize numeric columns.
- Data Splitting: Split data into training and testing sets to prepare for model training.
Code Explanation
STEP 1:
Mounting Google Drive
First, mount Google Drive to access the dataset that is stored in the cloud.
from google.colab import drive
drive.mount('/content/drive')
The first line in this code block deals with the issue of warning messages by using warnings.filterwarnings("ignore") to make the output as clean as possible. This code imports several data manipulation (pandas, numpy), data visualization (matplotlib, seaborn), and machine learning (sklearn, xgboost, imblearn) libraries and modules. It also tries to resolve compatibility issues with six and sys.modules. The command %matplotlib inline has the utility of plotting the images within the Colab notebook rather than in a separate window.
import warnings
warnings.filterwarnings("ignore")
import six
import sys
sys.modules['sklearn.externals.six'] = six
import os
import joblib
import operator
import statistics
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import metrics
from matplotlib import pyplot
import sklearn.neighbors._base
from scipy.stats import boxcox
import matplotlib.pyplot as plt
from sklearn.svm import LinearSVC
from xgboost import XGBClassifier
from sklearn import preprocessing
from xgboost import plot_importance
from sklearn.metrics import roc_curve
from sklearn.utils import _safe_indexing
from imblearn.over_sampling import SMOTE
from sklearn.metrics import roc_auc_score
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelBinarizer,StandardScaler,OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split,GridSearchCV,cross_val_score
from sklearn.linear_model import LogisticRegression,RidgeClassifier, PassiveAggressiveClassifier
%matplotlib inline
This ensures the smooth execution of code that relies on the older sklearn structure without modifying the source code.
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
sys.modules['sklearn.utils.safe_indexing'] = sklearn.utils._safe_indexing
STEP 2:
Load the dataset
This line of code loads data from Google Drive into the pandas DataFrame.
#Importing the datasets
data =pd.read_csv("/content/drive/MyDrive/Aionlinecourse_badhon/Project/Loan Eligibility Prediction using Gradient Boosting Classifier/LoansTrainingSetV2.csv")
This code shows the dataset structure and its first 10 rows. This helps you to understand the dataset overview and its structure.
data_info = data.info()
data_head = data.head()
data_info, data_head
This code shows the statistical descriptions like count, mean, median, and percentile. This helps you to understand the dataset overview and its structure.
data.describe()
The code provides a summary of the dataset by calculating some statistics for numerical columns and providing counts for the categorical ones. This helps in understanding the distribution and compositions of the numerical and qualitative data attributes.
# Numerical columns
numerical_summary = data.describe()
# Categorical columns
categorical_columns = data.select_dtypes(include=['object']).columns
categorical_summary = {col: data[col].value_counts() for col in categorical_columns}
numerical_summary, categorical_summary
This line removes the columns Loan ID and Customer ID from the dataset. Because these two columns do not hold that importance for the analysis or model training.
# Drop unnecessary columns (e.g., IDs if not useful for analysis)
data.drop(columns=['Loan ID', 'Customer ID'], inplace=True)
This line of code checks if there are any null values present in each feature.
# Data Cleaning
data.isnull().sum()
The code block fills missing values in a dataset with representative values, using median values for numerical columns and the most frequent value for categorical data, ensuring data integrity.
# Handling missing values
data['Credit Score'].fillna(data['Credit Score'].median(), inplace=True)
data['Annual Income'].fillna(data['Annual Income'].median(), inplace=True)
data['Bankruptcies'].fillna(data['Bankruptcies'].median(), inplace=True)
data['Months since last delinquent'].fillna(data['Months since last delinquent'].median(), inplace=True)
data['Years in current job'].fillna(data['Years in current job'].mode()[0], inplace=True)
data['Tax Liens'].fillna(data['Tax Liens'].median(), inplace=True)
The code ensures data consistency by converting the columns for Maximum Open Credit and Monthly Debt to numeric data types, making it easier to use them as numerical features in computations and analysis.
data['Monthly Debt'] = pd.to_numeric(data['Monthly Debt'], errors='coerce')
data['Maximum Open Credit'] = pd.to_numeric(data['Maximum Open Credit'], errors='coerce')
STEP 3:
Univariate Column Analysis
Current Loan Amount
This line of code shows the statistical overview of the Current Loan Amount column.
data['Current Loan Amount'].describe()
The code creates a histogram with a KDE overlay to show the Current Loan Amount feature. It helps in spotting patterns, spreads, and shapes, and it points out outliers for more analysis or to enhance model performance.
# Distribution of Current Loan Amount
plt.figure(figsize=(10, 6))
sns.histplot(data['Current Loan Amount'], bins=30, kde=True)
plt.title('Distribution of Current Loan Amount')
plt.xlabel('Current Loan Amount')
plt.ylabel('Frequency')
plt.show()
The code generates a box plot of the Current Loan Amount to determine the outliers and illustrates the distribution and the median. This helps in finding out extreme values that could be causing some impact that should be reduced before training the models.
# Box plot to check for outliers
plt.figure(figsize=(10, 6))
sns.boxplot(x=data['Current Loan Amount'])
plt.title('Box Plot of Current Loan Amount')
plt.show()
The code shows the use of the Interquartile Range(IQR) for outlier detection in the Current Loan Amount feature. It calculates the 25th and the 75th percentiles of the data (Q1 and Q3) and creates a band around the middle 50% of values. Using the IQR rule, it defines limits beyond which values are considered outliers. Any values that fall outside of these limits are considered extreme values. This technique is an effective method for identifying and addressing such values.
Q1 = data['Current Loan Amount'].quantile(0.25)
Q3 = data['Current Loan Amount'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
It helps to quickly locate any extreme values within the dataset which are outside of the upper and lower limits. Then, the tail() function used here takes a few rows of these outliers to give a picture of extreme values for further analysis or possible deletion.
data["Current Loan Amount"][((data["Current Loan Amount"] < (Q1 - 1.5 * IQR)) |(data["Current Loan Amount"] > (Q3 + 1.5 * IQR)))].tail()
This code addresses an extreme outlier. Replace the extreme value 99999999 with NaN. After marking it as NaN value then the missing values are filled with the median of the Current Loan Amount.
# Replace with NaN for imputation
data["Current Loan Amount"].replace(99999999, np.nan, inplace=True)
# Impute with median
data["Current Loan Amount"].fillna(data["Current Loan Amount"].median(), inplace=True)
The code creates a histogram with a KDE overlay to show the Current Loan Amount feature. It helps in spotting patterns, spreads, and shapes, and it points out outliers for more analysis or to enhance model performance after the preprocessing.
# Distribution of Current Loan Amount
plt.figure(figsize=(10, 6))
sns.histplot(data['Current Loan Amount'], bins=30, kde=True)
plt.title('Distribution of Current Loan Amount')
plt.xlabel('Current Loan Amount')
plt.ylabel('Frequency')
plt.show()
Credit Score
This line of code shows the statistical overview of the Credit Score column.
data['Credit Score'].describe()
The code creates a histogram with a KDE overlay to show the Credit Score feature. It helps in spotting patterns, spreads, and shapes, and it points out outliers for more analysis or to enhance model performance.
# Distribution of Current Loan Amount
plt.figure(figsize=(10, 6))
sns.histplot(data['Credit Score'], bins=30, kde=True)
plt.xlabel('Credit Score')
plt.ylabel('Frequency')
plt.show()
The code creates a distribution plot with a KDE overlay to show the Credit Score feature. It allows for an assessment of the data's spread, central tendency, and overall shape. This plot helps identify patterns, skewness, and any potential outliers in the Credit Score data.
sns.distplot(data["Credit Score"])
This code caps extreme values in the Credit Score column by setting limits at the 5th and 95th percentiles. Values above or below these thresholds are adjusted to the nearest bound, minimizing the impact of outliers. It is then checked using a box plot that extreme values have been minimized.
# Define the upper and lower bounds for capping
upper_bound = data['Credit Score'].quantile(0.95)
lower_bound = data['Credit Score'].quantile(0.05)
# Cap the values in the 'Credit Score' column
data['Credit Score'] = data['Credit Score'].apply(lambda x: min(x, upper_bound))
data['Credit Score'] = data['Credit Score'].apply(lambda x: max(x, lower_bound))
# Verify the capping by checking the distribution again
plt.figure(figsize=(8, 6))
sns.boxplot(x=data['Credit Score'])
plt.title('Box Plot of Credit Score (After Capping)')
plt.show()
Year In Current Job
This line displays all unique “Years in current job” column values. This helps analyze the dataset's job tenure values and discover inconsistencies or categories that may need further processing or encoding.
data['Years in current job'].unique()
This code converts the “Years in current job” column from text to numbers, assigning integers to represent employment durations. This change helps the model measure job tenure through numerical analysis.
data['Years in current job'] = data['Years in current job'].replace({'\< 1 year': 0, '1 year': 1, '2 years': 2,
'3 years': 3, '4 years': 4, '5 years': 5,
'6 years': 6, '7 years': 7, '8 years': 8,
'9 years': 9, '10+ years': 10}) \
.astype(int)
The code creates a histogram with a KDE overlay to show the “Years in current job” feature. It helps in spotting patterns, spreads, and shapes, and it points out outliers for more analysis or to enhance model performance.
plt.figure(figsize=(10, 6))
sns.histplot(data['Years in current job'], bins=11, kde=False, color='skyblue')
plt.title('Distribution of Years in Current Job')
plt.xlabel('Years in Current Job')
plt.ylabel('Frequency')
plt.show()
Annual Income
This line of code shows the statistical overview of the Annual Income column.
data['Annual Income'].describe()
This code block provides a detailed visualization of the Annual Income feature to analyze its distribution and detect outliers.
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.distplot(data['Annual Income'], color='skyblue')
plt.title('Distribution of Annual Income')
plt.xlabel('Annual Income')
plt.ylabel('Frequency')
plt.grid(True)
plt.subplot(1, 2, 2)
sns.boxplot(y=data['Annual Income'], color='lightcoral')
plt.title('Box Plot of Annual Income')
plt.ylabel('Annual Income')
plt.grid(True)
plt.tight_layout()
plt.show()
It finds the quantiles for the Annual Income column, showing what income is spread out, how it clusters, and what extreme values exist. This helps with spotting skewness and making decisions about what to do with outliers, or even normalizing the data.
data['Annual Income'].quantile([.2,0.5,0.75,0.90,.95,0.99,.999])
This code limits the values in the Annual Income column to the 95th percentile, so any income that exceeds this limit gets replaced with the maximum value. This helps lessen the effect of high-income outliers, leading to a more even distribution that enhances model performance.
upper_cap = data['Annual Income'].quantile(0.95)
data['Annual Income'] =data['Annual Income'].apply(lambda x: min(x, upper_cap))
The code creates a distribution plot with a KDE overlay to show the Annual Income feature. It helps in spotting patterns, spreads, and shapes, and it points out outliers for more analysis or to enhance model performance.
plt.figure(figsize=(12, 6))
sns.distplot(data['Annual Income'], color='skyblue')
plt.title('Distribution of Annual Income')
plt.xlabel('Annual Income')
plt.ylabel('Frequency')
plt.grid(True)
Monthly Debt
This line of code shows the statistical overview of the Monthly Debt column.
data['Monthly Debt'].describe()
It finds the quantiles for the Monthly Debt column, showing what income is spread out, how it clusters, and what extreme values exist. This helps with spotting skewness and making decisions about what to do with outliers, or even normalizing the data.
data['Monthly Debt'].quantile([0.25,0.5,0.75,0.90,.95,0.97,0.98,0.99,.999])
This code creates a boxplot to check any outliers to understand the feature.
plt.figure(figsize=(10, 6))
sns.boxplot(x=data['Monthly Debt'])
plt.title('Box Plot of Monthly Debt')
plt.show()
This code limits the values in the Monthly Debt column to the 99th percentile, so any debt that exceeds this limit gets replaced with the maximum value. This helps lessen the effect of high-debt outliers, leading to a more even distribution that enhances model performance.
upper_cap = data['Monthly Debt'].quantile(0.99)
data['Monthly Debt'] = data['Monthly Debt'].apply(lambda x: min(x, upper_cap))
The code creates a distribution plot with a KDE overlay to show the Monthly Debt feature. It checks if the outliers are handled properly.
plt.figure(figsize=(12, 6))
sns.distplot(data['Monthly Debt'], color='skyblue')
plt.title('Distribution of Monthly Debt')
plt.xlabel('Monthly Debt')
plt.ylabel('Frequency')
plt.grid(True)
This code takes care of any unrecorded values in the Monthly Debt field by substituting it with the median value of that column. The median provides a central and robust estimate with lesser impact from the extremes (outliers), which helps keep all the information intact and suitable for further analyses and training models.
data['Monthly Debt'].fillna(data['Monthly Debt'].median(), inplace=True)
Years of Credit History
This line of code shows the statistical overview of the Years of Credit History column.
data['Years of Credit History'].describe()
The code creates a distribution plot with a KDE overlay to show the Years of Credit History feature. It helps in spotting patterns, spreads, and shapes, and it points out outliers for more analysis or to enhance model performance.
plt.figure(figsize=(10, 6))
sns.distplot(data['Years of Credit History'], color='skyblue', hist_kws=dict(edgecolor="k", linewidth=2))
plt.title('Distribution of Years of Credit History')
plt.xlabel('Years of Credit History')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
Number of Open Accounts
This line of code shows the statistical overview of the Number of Open Accounts column.
data['Number of Open Accounts'].describe()
The code creates a distribution plot with a KDE overlay to show the “Number of Open Accounts” feature. It helps in spotting patterns, spreads, and shapes, and it points out outliers for more analysis or to enhance model performance.
plt.figure(figsize=(10, 6))
sns.distplot(data['Number of Open Accounts'], color='skyblue', hist_kws=dict(edgecolor="k", linewidth=2))
plt.title('Distribution of Number of Open Accounts')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
It finds the quantiles for the “Number of Open Accounts” column, showing how numbers are spread out, how they cluster, and what extreme values exist. This helps with spotting skewness and making decisions about what to do with outliers, or even normalizing the data.
data['Number of Open Accounts'].quantile([0.25,0.5,0.75,0.999,1])
This code limits the Number of Open Accounts column to 36. So values over 36 are replaced. This minimizes data impact from extreme values. A KDE histogram shows how often distinct open account numbers appear in the dataset and their spread.
data.loc[data['Number of Open Accounts'] > 36, 'Number of Open Accounts'] = 36
plt.figure(figsize=(10, 6))
sns.distplot(data['Number of Open Accounts'], color='skyblue', hist_kws=dict(edgecolor="k", linewidth=2))
plt.title('Distribution of Number of Open Accounts')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
plt.figure(figsize=(10, 6))
sns.histplot(data['Years of Credit History'], bins=20, kde=True)
plt.title('Distribution of Years of Credit History')
plt.xlabel('Years of Credit History')
plt.ylabel('Frequency')
plt.show()
Maximum Open Credit
This line of code shows the statistical overview of the Maximum Open Credit column.
data['Maximum Open Credit'].describe()
This code checks if there are missing values in the Maximum Open Credit column
data['Maximum Open Credit'].isnull().sum()
It finds the quantiles for the Maximum Open Credit column, showing how numbers are spread out, how they cluster, and what extreme values exist. This helps with spotting skewness and making decisions about what to do with outliers, or even normalizing the data.
data['Maximum Open Credit'].quantile([.25,.50,.75,.90,.95,.99,.999])
This code creates a boxplot to check any outliers to understand the feature.
plt.figure(figsize=(12, 8))
sns.boxplot(x=data['Maximum Open Credit'], palette="Set3") # Use a colorful palette
plt.title('Box Plot of Maximum Open Credit', fontsize=16)
plt.xlabel('Maximum Open Credit', fontsize=14)
plt.ylabel('')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
The code below handles outliers in the Maximum Open Credit column and makes all values out of the IQR bounds as NaN. After that NAN values are replaced by the median. So the distribution becomes more stable. Then the box plot confirms the successful handling of extreme values
Q1 = data['Maximum Open Credit'].quantile(0.25)
Q3 = data['Maximum Open Credit'].quantile(0.95)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Replace outliers with NaN for imputation
data.loc[(data['Maximum Open Credit'] \< lower_bound) | (data['Maximum Open Credit'] > upper_bound), 'Maximum Open Credit'] = np.nan
# Impute the missing values (NaNs) using the median
data['Maximum Open Credit'].fillna(data['Maximum Open Credit'].median(), inplace=True)
plt.figure(figsize=(12, 8))
sns.boxplot(x=data['Maximum Open Credit'], palette="Set3")
plt.title('Box Plot of Maximum Open Credit (After Handling Outliers)', fontsize=16)
plt.xlabel('Maximum Open Credit', fontsize=14)
plt.ylabel('')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
The code creates a distribution plot with a KDE overlay to show the Maximum Open Credit feature. It checks if the outliers are handled properly.
plt.figure(figsize=(10, 6))
sns.histplot(data['Maximum Open Credit'], bins=20, kde=True)
plt.title('Distribution of Maximum Open Credit')
plt.xlabel('Maximum Open Credit')
plt.ylabel('Frequency')
plt.show()
Bankruptcies
The code creates a distribution plot with a KDE overlay to show the “Bankruptcies” feature. It helps in spotting patterns, spreads, and shapes, and it points out outliers for more analysis or to enhance model performance.
plt.figure(figsize=(10, 6))
sns.distplot(data['Bankruptcies'], color='skyblue', kde=True)
plt.title('Distribution of Bankruptcies')
plt.xlabel('Bankruptcies')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
This code limits the values in the Bankruptcies column to the 99th percentile, so any debt that exceeds this limit gets replaced with the maximum value. This helps lessen the effect of high-debt outliers, leading to a more even distribution that enhances model performance.
# Calculate the 99th percentile
upper_cap = data['Bankruptcies'].quantile(0.99)
data['Bankruptcies'] = data['Bankruptcies'].apply(lambda x: min(x, upper_cap))
Number of Credit Problems
This line of code shows the statistical overview of the Number of Credit Problems column.
data['Number of Credit Problems'].describe()
Current Credit Balance
The code creates a distribution plot with a KDE overlay to show the Current Credit Balance feature. It helps in spotting patterns, spreads, and shapes, and it points out outliers for more analysis or to enhance model performance.
plt.figure(figsize=(12, 8))
sns.displot(data['Current Credit Balance'], kde=True, bins=30, color='skyblue')
plt.title('Distribution of Current Credit Balance', fontsize=16)
plt.xlabel('Current Credit Balance', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
It finds the quantiles for the Current Credit Balance column, showing how numbers are spread out, how they cluster, and what extreme values exist. This helps with spotting skewness and making decisions about what to do with outliers, or even normalizing the data.
data['Current Credit Balance'].quantile([0.55,0.76,0.87,0.98,0.99,1])
This code caps Current Credit Balance column values to 81,007 and changes any values above this limit to reduce outliers. After capping and transformation, a KDE-overlapping histogram shows the distribution to comprehend credit balance value spread and frequency.
data.loc[data['Current Credit Balance'] > 81007, 'Current Credit Balance'] = 81007
plt.figure(figsize=(12, 8))
sns.displot(data['Current Credit Balance']**(1/2), kde=True, bins=30, color='skyblue')
plt.title('Distribution of Current Credit Balance (After Handling Outliers)', fontsize=14)
plt.xlabel('Current Credit Balance', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
It creates the histograms of the KDE overlay for each numeric feature in the dataset in a clean, tidy grid. This visualization makes it easily possible to view spread, skewness, and outliers of all numerical features at a glance.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np # Added import for numpy
plt.figure(figsize=(18, 16))
numerical_features = data.select_dtypes(include=np.number).columns.tolist()
num_features = len(numerical_features)
num_cols = 3
num_rows = int(np.ceil(num_features / num_cols))
# Create subplots for each numerical feature
for i, feature in enumerate(numerical_features, 1):
plt.subplot(num_rows, num_cols, i) # Dynamic subplot grid
sns.histplot(data[feature], bins=30, kde=True, color='skyblue')
plt.title(f'Distribution of {feature}')
plt.xlabel(feature)
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
STEP 4:
Categorical Value Analysis
Home Ownership
For each categorical feature in the dataset, this code iterates over each name, the unique values of each category, as well as the frequency of each category. The distribution and variety of categorical features and this overview help in understanding the distribution and variety within categorical features and help make decisions about encoding or data cleaning.
cat=data.select_dtypes(include=['object']).columns.tolist()
for c in cat:
print(c)
print(f'{c} : {data[c].value_counts()}')
print(data[c].unique())
print()
First, a frequency is counted for each category in the Home Ownership column and then one of these is standardized by replacing HaveMortgage with Home Mortgage. This change simplifies the data analysis, reduces the potential discrepancy of the data, and maintains consistency in the Home Ownership categories.
data['Home Ownership'].value_counts()
data['Home Ownership'] = data['Home Ownership'].replace({'HaveMortgage': 'Home Mortgage'})
This code makes a pie chart on the percentage distribution of each category in the Home Ownership column from the dataset, to give a visual summary of the proportion of different types of home possession from this dataset.
home_ownership_counts = data['Home Ownership'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(home_ownership_counts, labels=home_ownership_counts.index, autopct='%1.1f%%', startangle=90, colors=['skyblue', 'lightcoral', 'lightgreen'])
plt.title('Home Ownership Distribution')
plt.axis('equal')
plt.show()
This code makes the text in the Purpose column uniform by changing 'other' to 'Other'. It shows the unique values in the Purpose feature. This makes it easy to check that the text is formatted consistently,
data['Purpose']=data['Purpose'].str.replace('other', 'Other', regex=True)
data['Purpose'].unique()
The following code creates a count plot on the Purpose column plotting the frequency of each loan purpose category. It allows us to easily find out the most and least common loan purposes in the dataset.
plt.figure(figsize=(10, 6))
sns.countplot(x='Purpose', data=data, palette='viridis')
plt.title('Distribution of Loan Purposes')
plt.xlabel('Loan Purpose')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.show()
This visualization provides a comprehensive overview of the distribution across multiple categorical features, helping to identify category prevalence and patterns in the dataset.
# Set up the matplotlib figure for categorical features
plt.figure(figsize=(15, 12))
# List of categorical features to visualize
categorical_features = ['Loan Status', 'Term', 'Home Ownership', 'Purpose']
# Create subplots for each categorical feature
for i, feature in enumerate(categorical_features, 1):
plt.subplot(2, 2, i)
sns.countplot(data=data, x=feature, order=data\[feature\].value\_counts().index, palette='pastel')
plt.title(f'Distribution of {feature}')
plt.xlabel(feature)
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
These visualizations reveal potential relationships between loan status and both income and credit score, possibly revealing eligibility patterns.
plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
sns.boxplot(x='Loan Status', y='Credit Score', data=data, palette='Set2')
plt.title('Loan Status vs. Credit Score')
plt.xlabel('Loan Status')
plt.ylabel('Credit Score')
plt.xticks(rotation=45)
# Second subplot: Loan Status vs. Annual Income
plt.subplot(1, 2, 2)
sns.boxplot(x='Loan Status', y='Annual Income', data=data, palette='Set1')
plt.title('Loan Status vs. Annual Income')
plt.xlabel('Loan Status')
plt.ylabel('Annual Income')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
This matrix allows us to understand strong positive or negative relationships between features, to have an idea of features that influence the performance of the model or selection of features.
# Calculate the correlation matrix for numeric columns related to loan factors
loan_factors = data.select_dtypes(include=['float64', 'int64']) # Dropping Credit Score temporarily due to outliers
correlation_matrix = loan_factors.corr()
# Plotting the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title("Correlation Matrix of Loan Factors")
plt.show()
This plot provides us with a visual of any patterns or separation between loan statuses by showing us how income and credit balance might affect loan eligibility.
# Scatter plot for Annual Income vs. Current Credit Balance, colored by Loan Status
plt.figure(figsize=(12, 8))
sns.scatterplot(x='Annual Income', y='Current Credit Balance', hue='Loan Status', data=data, palette='Set1', alpha=0.6)
plt.title('Annual Income vs Current Credit Balance by Loan Status')
plt.xlabel('Annual Income')
plt.ylabel('Current Credit Balance')
plt.legend(title='Loan Status')
plt.show()
STEP 5:
This code takes care of encoding categorical features and the target variable, scaling numerical columns, and splitting the data into features (X) and target (y). Label encoding turns categorical data into numbers, and scaling helps to standardize numerical features, which gets the dataset ready for effective and precise model training.
# Encode categorical features and the target variable
categorical_cols= data.select_dtypes(include='object').columns.tolist() # Changed select_dtype to select_dtypes, include=object
numeric_cols= data.select_dtypes(include='number').columns.tolist() # Changed select_dtype to select_dtypes, include='number'
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
label_encoder = LabelEncoder()
for col in categorical_cols:
data[col] = label_encoder.fit_transform(data[col])
# Encode target variable
data['Loan Status'] = label_encoder.fit_transform(data['Loan Status'])
# Scale the data
scaler = StandardScaler()
data[numeric_cols] = scaler.fit_transform(data[numeric_cols])
# Separate features and target
X = data.drop('Loan Status', axis=1)
y = data['Loan Status']
The purpose of the code is to divide the dataset into two parts, 80% of the data for training the model while 20% of the data is held to test the model after training it to avoid any data overlap.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Display preprocessed data overview
X_train.head(), y_train.head()
STEP 6:
Model Training
This code initializes a dictionary named models containing five different regression models: Logistic Regression, Decision Tree Classifier, Random Forest, Gradient Boosting, and Ridge Classifier. Each model is represented by its name as the key and an instance of the corresponding classification algorithm as the value.
models = {
'Logistic Regression': LogisticRegression(max\_iter=1000),
'Ridge Classifier': RidgeClassifier(),
'Decision Tree Classifier': DecisionTreeClassifier(random\_state=42),
'Gradient Boosting': GradientBoostingClassifier(),
'Random Forest': RandomForestClassifier()
}
The evaluate_classifier function carries out the training of a model, performance metrics (accuracy, confusion matrix, classification report, AUC) generation, and its visualization. It allows showcasing a confusion matrix, a classification report, and an ROC curve if available, in an organized three-part representation, a thorough assessment of the performance of the system is provided.
from sklearn.metrics import classification_report
def evaluate_classifier(model, X_train, y_train, X_test, y_test, model_name="Classifier"):
model.fit(X\_train, y\_train)
y\_pred = model.predict(X\_test)
y\_proba = model.predict\_proba(X\_test)\[:, 1\] if hasattr(model, "predict\_proba") else None
accuracy = accuracy\_score(y\_test, y\_pred)
confusion = confusion\_matrix(y\_test, y\_pred)
class\_report = classification\_report(y\_test, y\_pred)
auc = roc\_auc\_score(y\_test, y\_proba) if y\_proba is not None else None
fpr, tpr, \_ = roc\_curve(y\_test, y\_proba) if auc else (None, None, None)
\# Display results in a row
fig, axs = plt.subplots(1, 3, figsize=(18, 5))
\# Confusion Matrix
ConfusionMatrixDisplay(confusion\_matrix=confusion).plot(ax=axs\[0\], cmap="Blues", values\_format='d')
axs\[0\].set\_title(f"{model\_name} \- Confusion Matrix")
axs\[1\].axis('off') \# Turn off axis
report\_text = f"Classification Report\\n\\n{class\_report}\\nAccuracy: {accuracy:.2f}"
axs\[1\].text(0.5, 0.5, report\_text, ha='center', va='center', fontsize=12, wrap=True)
axs\[1\].set\_title(f"{model\_name} \- Classification Report")
if auc:
axs\[2\].plot(fpr, tpr, label=f'AUC = {auc:.2f}')
axs\[2\].plot(\[0, 1\], \[0, 1\], 'k--')
axs\[2\].set\_xlabel("False Positive Rate")
axs\[2\].set\_ylabel("True Positive Rate")
axs\[2\].set\_title(f"{model\_name} \- ROC Curve")
axs\[2\].legend(loc="lower right")
else:
axs\[2\].text(0.5, 0.5, "No AUC available", ha='center', va='center')
plt.tight\_layout()
plt.show()
This code checks each model in the models dictionary by using evaluate_classifier, which trains and evaluates the model, and then shows important metrics visually. This organized loop makes it simple to compare different models, which helps in figuring out which one works best.
for model_name, model in models.items():
print(f"Evaluating {model\_name}...")
evaluate\_classifier(model, X\_train, y\_train, X\_test, y\_test, model\_name=model\_name)
This code creates a summary DataFrame of the best key metrics (accuracy, AUC, confusion matrix) for each model for easy comparison between the model performance. We can get a clear overview from the table to choose the best model.
model_summaries = {
model_name: {
"Accuracy": metrics.get("Accuracy"),
"AUC": metrics.get("AUC"),
"Confusion Matrix": metrics.get("Confusion Matrix")
}
for model_name, metrics in evaluation_metrics.items()
}
summary_df = pd.DataFrame({
"Model": model_summaries.keys(),
"Accuracy": [metrics["Accuracy"] for metrics in model_summaries.values()],
"AUC": [metrics["AUC"] for metrics in model_summaries.values()],
"Confusion Matrix": [metrics["Confusion Matrix"] for metrics in model_summaries.values()]
})
print("Model Evaluation Summary:")
print(summary_df)
This code trains all models, calculates the accuracy, confusion matrix, and AUC, and stores them in a dictionary. Then it plots ROC curves based on AUC scores for models to help make comparison of performance across models more visually.
evaluation_metrics = {}
for model_name, model in models.items():
print(f"Training and evaluating {model_name}...")
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]) if hasattr(model, "predict_proba") else None
fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test)[:, 1]) if auc else (None, None, None)
# Store metrics
evaluation_metrics[model_name] = {
"Accuracy": accuracy,
"Confusion Matrix": confusion,
"AUC": auc,
"FPR": fpr,
"TPR": tpr
}
plt.figure(figsize=(10, 8))
for model_name, metrics in evaluation_metrics.items():
if metrics["AUC"] is not None:
plt.plot(metrics["FPR"], metrics["TPR"], label=f'{model_name} (AUC = {metrics["AUC"]:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC_Curves_for_All_Models")
plt.legend(loc="lower right")
plt.show()
STEP 7:
Hyper-Tuning
The command installs the imbalanced-learn library, which provides tools for handling imbalanced datasets
!pip install imbalanced-learn
SMOTE is used on the training data to balance the data with synthetic samples for the minority class. After resampling the class distribution, it then checks whether the resampled class distribution tends to be balanced or not. So that the model performance is better suited for imbalanced data.
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
y_resampled.value_counts()
This code splits the balanced dataset into 80% for training and 20% testing sets.
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)
This code trains and evaluates metrics on the balanced dataset
for model_name, model in models.items():
print(f"Evaluating {model_name}...")
evaluate_classifier(model, X_train, y_train, X_test, y_test, model_name=model_name)
This code trains all models on the balanced dataset, calculates the accuracy, confusion matrix, and AUC, and stores them in a dictionary. Then it plots ROC curves based on AUC scores for models to help make comparison of performance across models more visually.
evaluation_metrics = {}
for model_name, model in models.items():
print(f"Training and evaluating {model_name}...")
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]) if hasattr(model, "predict_proba") else None
fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test)[:, 1]) if auc else (None, None, None)
evaluation_metrics[model_name] = {
"Accuracy": accuracy,
"Confusion Matrix": confusion,
"AUC": auc,
"FPR": fpr,
"TPR": tpr
}
plt.figure(figsize=(10, 8))
for model_name, metrics in evaluation_metrics.items():
if metrics["AUC"] is not None:
plt.plot(metrics["FPR"], metrics["TPR"], label=f'{model_name} (AUC = {metrics["AUC"]:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC_Curves_for_All_Models")
plt.legend(loc="lower right")
plt.show()
This code creates a summary DataFrame of the best key metrics (accuracy, AUC, confusion matrix) for each model for easy comparison between the model performance. We can get a clear overview from the table to choose the best model.
model_summaries = {
model_name: {
"Accuracy": metrics.get("Accuracy"),
"AUC": metrics.get("AUC"),
"Confusion Matrix": metrics.get("Confusion Matrix")
}
for model_name, metrics in evaluation_metrics.items()
}
summary_df = pd.DataFrame({
"Model": model_summaries.keys(),
"Accuracy": [metrics["Accuracy"] for metrics in model_summaries.values()],
"AUC": [metrics["AUC"] for metrics in model_summaries.values()],
"Confusion Matrix": [metrics["Confusion Matrix"] for metrics in model_summaries.values()]
})
print("Model Evaluation Summary:")
print(summary_df)
Finally, we evaluate each model and compare the metrics, pick the best based on AUC or accuracy save it into 'best_model.joblib', and compile a summary CSV table of evaluation metrics for comparison in an easy table.
best_model = None
best_score = 0
best_model_name = ""
evaluation_metrics = {}
for model_name, model in models.items():
print(f"Evaluating {model_name}...")
# Train and evaluate the model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None
# Calculate accuracy and AUC if available
accuracy = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_proba) if y_proba is not None else None
# Store evaluation metrics for summary
evaluation_metrics[model_name] = {
"Accuracy": accuracy,
"AUC": auc,
"Confusion Matrix": confusion_matrix(y_test, y_pred)
}
current_score = auc if auc is not None else accuracy
if current_score > best_score:
best_score = current_score
best_model = model
best_model_name = model_name
if best_model is not None:
joblib.dump(best_model, 'best_model.joblib')
print(f"The best model is '{best_model_name}' with a score of {best_score:.2f}. Model saved as 'best_model.joblib'.")
summary_df = pd.DataFrame({
"Model": evaluation_metrics.keys(),
"Accuracy": [metrics["Accuracy"] for metrics in evaluation_metrics.values()],
"AUC": [metrics["AUC"] for metrics in evaluation_metrics.values()],
"Confusion Matrix": [metrics["Confusion Matrix"] for metrics in evaluation_metrics.values()]
})
print("Model Evaluation Summary:")
print(summary_df)
STEP 8:
Prediction
This code loads the best model using joblib
best_model = joblib.load('best_model.joblib')
This code loads a test dataset from a specified path into a DataFrame called test_data. It then displays the first few rows. This allows for a quick preview of the test data structure.
test_data_path = '/content/drive/MyDrive/Aionlinecourse_badhon/Project/Loan Eligibility Prediction using Gradient Boosting Classifier/test_data.csv'
test_data = pd.read_csv(test_data_path)
test_data.head()
This code handles missing values in the “Months since last delinquent” column. Replaced NAn values by median value. Lastly drops unnecessary columns that do not hold much importance for the model
test_data['Months since last delinquent'].fillna(test_data['Months since last delinquent'].median(), inplace=True)
test_data.drop(columns=['Loan ID', 'Customer ID'], inplace=True)
This code takes care of encoding categorical features and the target variable in test_data, scaling numerical columns, and splitting the data into features (X) and target (y). Label encoding turns categorical data into numbers, and scaling helps to standardize numerical features, which gets the dataset ready for effective and precise model training.
categorical_cols= test_data.select_dtypes(include='object').columns.tolist() # Changed select_dtype to select_dtypes, include=object
numeric_cols= test_data.select_dtypes(include='number').columns.tolist() # Changed select_dtype to select_dtypes, include='number'
label_encoder = LabelEncoder()
for col in categorical_cols:
test_data[col] = label_encoder.fit_transform(test_data[col])
test_data['Loan Status'] = label_encoder.fit_transform(test_data['Loan Status'])
scaler = StandardScaler()
test_data[numeric_cols] = scaler.fit_transform(test_data[numeric_cols])
X = test_data.drop('Loan Status', axis=1)
y = test_data['Loan Status']
This line uses the best_model to make predictions on the test dataset X.
predictions = best_model.predict(X)
predictions
Project Conclusion
In this work, we utilized robust data pre-processing techniques, imbalanced data handling with SMOTE, and selected multiple machine learning classifiers (such as Logistic Regression, Decision Tree, Gradient Boosting and Random Forest) to develop a loan eligibility prediction model. Through feature scaling and label encoding of categorical data, we were able to make sure that model training and testing were being fed with consistent high-quality input data.
We then evaluated models on key metrics such as accuracy AUC found the best-performing model and saved it for future use. Finally, to test how accurately it predicts the outcome, this model was applied to a separate test dataset to produce final predictions of the outcome.
This approach not only provides good model accuracy but also scales well as the balanced dataset and better model can adjust more easily to new data in the future. It is a project which shows how a reliable, scalable loan eligibility model can be built for financial institutions to make loan decisions with greater confidence using data.
Challenges New Coders Might Face
Challenge: Handling Missing Data
Solution: Implement imputation methods such as replacing the missing values by mean or median values or more advanced methods such as KNN imputation and K-nearest neighbor imputation should be used.Challenge: Outliers in Numerical Data
Solution: Outliers should be identified by statistical means (for example; IQR ) and such outliers then transformed or deleted. Boxplots assist in the recognition of outliers at an early stage during the cleaning of the data.Challenge: Dealing with Categorical Variables
Solution: Incorporate Label Encoding or One-Hot Encoding techniques on the categorical variables. Label encoding comes in handy when dealing with ordinal data while one hot encoding is most suitable for categorical features which are not ordinal.Challenge: Choosing the Right Model
Solution: For example, when building a model one can start with a linear baseline model then intermediate models like the random forest and the gradient boosting models can be implemented. At this time, also compare the models with the Accuracy or Precision score metrics on the validation set and choose the model that will give the best results after the training phase.Challenge: Hyperparameter Tuning for Optimization:
Solution: Use Grid Search or Random Search for hyperparameter tuning to systematically find the optimal settings. These techniques carry out the tuning process automatically, which tends to enhance the performance of the model with very minimal effort.
FAQ
What is the objective of the loan eligibility prediction using machine learning??
Machine learning-based loan eligibility prediction comprises of training the algorithms by using the loan data history on the new applicant’s eligibility to be approved or not. With this model in place, financial institutions can use data to make data-driven decisions, like credit score, annual income, and employment history.Which data pre-processing techniques are important for Loan Eligibility Prediction?
In the preprocessing steps, we handle missing values or remove outliers, scale, and engineer the feature, as well as encode the categorical variable of given data for better model fitting.Which machine learning algorithms are best for loan eligibility prediction?
For this project, the algorithms are Random Forest, Gradient Boosting, and Logistic Classifier. These models are quite useful for the extraction of subtle trends within the data stream of actual retail sales.How does feature engineering influence the results of loan eligibility prediction models?
It increases the level of understanding of relationships between features by the model by enhancing feature engineering. For example, developing a new feature, Outlet_Age, and encoding categorical features.What is SMOTE, and why is it used in loan prediction models?
To handle imbalanced datasets, we employ SMOTE (Synthetic Minority Over-sampling Technique), to synthetically generate samples of the minority class. SMOTE is helpful in loan prediction to balance classes like (approved vs denied) to prevent the model from being biased to the majority class.