Insurance Pricing Forecast Using XGBoost Regressor
The project, Insurance Pricing Forecast Using XGBoost Regressor focuses on leveraging machine learning to accurately predict healthcare costs for insurance companies. This helps insurance companies forecast future expenses. The goal is to set accurate premiums. Insurance companies need accurate methods to forecast future expenses. This helps them set premiums profitably. Traditional methods often struggle with complex data interactions. Machine learning, especially XGBoost, offers a valuable solution. This project develops a machine learning model. It helps insurers establish rates based on features like age, BMI, and smoking status. The goal is to ensure profitability while providing fair coverage.
Project Overview
In this project, we build a machine learning model using XGBoost Regressor. This XGBoost Regressor predicts healthcare expenses. It considers factors like age, BMI, smoking status, and region. These factors help estimate healthcare costs accurately. We also build a linear regression model as a baseline for comparison. By the end of this project, insurance companies will have a reliable tool. This tool helps set premiums based on predicted expenses. It reduces reliance on manual calculations and improves profitability.
Prerequisites
Before starting this project, understand Python, statistics, and machine learning. Familiarity with libraries like NumPy, Pandas, Matplotlib, and Scikit-learn. These libraries help with data manipulation, visualization, and model building. Also, be familiar with XGBoost Regressor, linear regression, and regression analysis. This knowledge will help you understand the modeling process.
Approach
We focus on building an XGBoost Regressor to predict healthcare costs. The model uses several features for forecasting. Additionally, we will compare this XGBoost Regressor with a linear regression model. This comparison helps evaluate the model's effectiveness. We select the XGBoost Regressor for its ability to handle non-linear relationships. Furthermore, the XGBoost Regressor excels with complex datasets. It provides high predictive power and efficiency. Although other machine learning techniques could be used, the XGBoost Regressor stands out.
Workflow and Methodology
The overall workflow of this project includes:
-
Problem Definition: Predict healthcare expenses using various features like age and smoking status.
-
Data Collection: Gather data from healthcare records, including patient demographics and medical expenses.
-
Data Preparation: Clean, transform, and encode the data for modeling.
-
Modeling: Build a baseline linear regression model first. Then, use an XGBoost regressor to achieve better accuracy.
-
Evaluation: Use evaluation metrics to assess model performance. Check Root Mean Squared Error (RMSE). Also, calculate Mean Absolute Percentage Error (MAPE).
-
Conclusion: Analyze results and finalize the best model for predicting healthcare expenses.
The methodology involves:
-
Exploratory Data Analysis (EDA): Understanding feature distributions, correlations, and trends in the data.
-
Data Preprocessing: Handle missing values in the dataset. Encode categorical variables appropriately. Transform the target variable as needed. Ensure data suitability for modeling.
-
Feature Engineering: Creating or refining features that improve model performance.
-
Hyperparameter Tuning: Using Bayesian Optimization to fine-tune the XGBoost Regressor for optimal results.
-
Model Comparison: Compare the linear regression model with the XGBoost Regressor. Determine which model performs better. Assess their accuracy in predicting healthcare costs.
Data Collection
Data Preparation
We use a dataset with healthcare records to train the XGBoost Regressor. Specifically, features include age, BMI, smoking status, region, and costs. This data represents real-world medical expenses from diverse health profiles. Our goal is to identify features that impact costs. Therefore, we use the XGBoost Regressor. We then use this information to predict future expenses. We do this accurately with the XGBoost Regressor.
Data Preparation Workflow
-
Data Cleaning: We start by checking for missing values and outliers in the dataset. This ensures that the data is clean and consistent for modeling.
-
Feature Encoding: We one-hot encode categorical variables like 'sex' and 'region.' This process converts them into numerical values.
-
Target Transformation: Healthcare costs often have a skewed distribution. We apply a Yeo-Johnson transformation. This makes the target variable more normally distributed. As a result, model performance improves.
-
Data Splitting: The dataset is split into training and test sets, typically using a ratio of 80:20. This allows us to train the model on one portion of the data and evaluate it on the remaining portion.
Code Explanation
STEP 1:
You can mount your Google Drive in a Google Colab notebook with this block of code. This lets users easily view files saved in Google Drive within Colab. You can modify and analyze your data or even train models using the files.
from google.colab import drive
drive.mount('/content/drive')
Import required packages
We import essential libraries such as numpy, pandas, and matplotlib. We also include seaborn, plotly, and xgboost. These libraries help with data manipulation, visualization, and building machine learning models.
!pip install numpy
!pip install pandas
!pip install plotly
!pip install scikit-learn
!pip install scikit-optimize
!pip install statsmodels
!pip install category_encoders
!pip install xgboost
!pip install nbformat
!pip install matplotlib
Import libraries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px
import sys
from sklearn.model_selection import train_test_split
from category_encoders import OneHotEncoder
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import math
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.feature_selection import RFE
STEP 2:
Exploratory Data Analysis (EDA)
EDA stands for "Exploratory Data Analysis." It is a method used to examine data through visualizations. Specifically, EDA involves identifying trends and patterns using statistical and visual techniques.
People use it to figure out trends in data, find outliers, test assumptions, and so on. The main goal of Exploratory Data Analysis (EDA) is to allow individuals to explore and understand the data before developing any theories or hypotheses about it.
When creating a machine learning model, EDA is a crucial step. It helps us understand how variables are distributed and how different variables relate to each other. EDA also identifies which features are crucial for making predictions.
Firstly, let's read the information, which is in the folder called "input" and is named "insurance.csv".
Load Dataset
data = pd.read_csv('/content/drive/MyDrive/Aionlinecourse/data/insurance_dataset.csv')
data.head()
data.info()
We have three numeric features: Age, BMI, and Children. Additionally, we have three categorical features: Sex, Smoker, and Region.
NOTE: there are no null values in any of the columns, which means we won't need to impute values in the data preprocessing step. However, This is usually a step that you'll need to consider when building a machine learning model.
The target variable, which we want to predict, is the `charges` column. Now, let's split the dataset into features (X) and the target (y):
target = 'Charges'
X = data.drop(target, axis=1)
y = data[target]
X.shape, y.shape
Distributions
Let's examine the distribution of each feature by plotting a histogram for each. Additionally, here are key points to note about the distribution of each feature:
-
Age - Approximately uniformly distributed.
-
Sex - Approximately equal volume in each category.
-
Bmi - Approximately normally distributed.
-
Children - Right skewed (i.e. higher volume in lower range).
-
Smoker - Significantly more volume in the no category vs the yes category.
-
Region - Approximately equal volume in each category.
The distribution is right skewed (i.e. higher volume in the lower range).
fig = px.histogram(data, x=target, nbins=50, title="Distribution of Charges")
fig.show()
Univariate analysis (with respect to the target)
The next step is to use binary analysis on the target. In other words, we look at each trait and figure out how it fits with the goal.
How we do this changes based on whether the trait is a number or a set of words. We'll use a scatterplot for numerical features and a boxplot for categorical features.
Numeric features
Points to note regarding each feature:
-
Age - As Age increases, Charges also tend to increase (although there is a large variance in Charges for a given Age).
-
BMI - There is no clear relationship. However, a group of individuals with BMI > 30 tends to have charges above 30k.
-
This group may become more apparent when we carry out our bivariate analysis later.
-
Children - No clear relationship (although Charges seem to decrease as Children increase).
-
Since there are only 6 unique values, we will treat this feature as categorical. This approach is useful for univariate analysis.
numeric_features = X.select_dtypes(include=[np.number])
numeric_features.columns
# plot_heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(numeric_features.corr(), annot=True, fmt=".2f", cmap="coolwarm")
plt.show()
STEP 3:
Categorical features
categorical_features = X.select_dtypes(include=["object"]).columns
categorical_features
Things to keep in mind about each feature:
- Sex - No significant differences in Charges between the categories.
- Smoker - Charges for Smoker == 'yes' are generally much higher than when Smoker == 'no'.
- Region - No significant differences in Charges between the categories.
- Children - No significant differences in Charges appear between the categories. Children with 4 or more are skewed towards lower Charges. However, this is likely due to low volumes in those categories. Refer to the distributions section for more details.
# Create paired boxplots for each categorical feature against the target variable
for col in categorical_features:
plt.figure(figsize=(6, 4))
sns.boxplot(x=col, y=target, data=data)
plt.title(f"Distribution of {target} by {col}")
plt.show()
Numeric Feature Analysis
# Create paired scatterplots for each pair of numeric features
for i, col1 in enumerate(numeric_features.columns):
for j, col2 in enumerate(numeric_features.columns):
if i < j:
plt.figure(figsize=(6, 4))
sns.scatterplot(x=col1, y=col2, data=data)
plt.title(f"Scatterplot of {col1} vs. {col2}")
plt.show()
STEP 4:
Create scatter matrix and Chi squared Test.
sns.pairplot(data)
# Display the scatter matrix
plt.show()
px.imshow(X.select_dtypes(include=np.number).corr())
Chi-Squared Test for Categorical Features
The Chi-squared test examines if two categorical variables are significantly associated. It does this by comparing observed and expected frequencies in a contingency table. A p-value less than 0.05 indicates a significant association. This suggests that the variables are related rather than independent.
#Chi Squared Test
import scipy.stats as stats
# Loop through each categorical feature
for col in categorical_features:
# Create a contingency table
crosstab = pd.crosstab(data[col], data[target])
# Perform the chi-squared test
chi2, p, dof, expected = stats.chi2_contingency(crosstab)
# Print the results
print(f"Chi-squared test for {col} and {target}:")
print(f"- Chi-squared statistic: {chi2:.4f}")
print(f"- P-value: {p:.4f}")
print(f"- Degrees of freedom: {dof}")
print(f"- Expected frequencies:\n{expected}")
# Interpret the results
if p < 0.05:
print(f"There is a statistically significant association between {col} and {target}.")
else:
print(f"There is no statistically significant association between {col} and {target}.")
chi2
Numeric categorical feature pairs
ANOVA
Firstly, we will use an ANOVA test for pairings of numerical and category features. ANOVA, or analysis of variance, checks for changes between group means. It then assesses if these changes are statistically significant. Essentially, ANOVA compares the means of two or more independent groups. If the result is significant, it indicates that group means differ.
Analysis of variances, or ANOVA, looks at the differences between population means. There are two groups being tested to see if their means are different. To run ANOVA, you need at least one continuous variable and one categorical variable. ANOVA examines differences between means to determine if the differences are significant.
It uses ANOVA to see how the variance of groups means compared to the variance of the groups themselves. This process determines if the groups are from the same population or different ones. It helps to identify if the groups can be separated in distinct ways.
Despite the fact that it looks at differences, it tries limits. This is the simplest type of ANOVA. This way of comparing groups goes beyond t tests and can be used with more than two groups. The alternative theory suggests that at least one group has a different meaning.
Watch full explanation for Introduction to ANOVA here.!pip install pingouin
import pingouin as pg
X.head()
print(X.columns)
X_anova = pg.anova(dv='Age', between='Bmi', data=X)
X_anova
STEP 5:
Choosing an ML Model
For this project, we selected the XGBoost Regressor as the primary model. XGBoost is known for its strong performance with complex datasets. Moreover, it handles non-linear relationships effectively. This method builds decision trees sequentially. Each new tree corrects errors made by the previous one. Consequently, this approach makes XGBoost ideal for predicting healthcare costs. Healthcare data often involves complex interactions.
We also implemented a Linear Regression model as a baseline. Linear regression is a simpler model. It helps establish an initial understanding of feature relationships. This includes understanding how these features relate to healthcare costs. However, Its limitations in handling non-linear patterns make XGBoost a better option. XGBoost can manage these patterns more effectively for this project.
Why Linear Regression and XGBoost model?
1. Linear Regression Model
-
Simplicity and Interpretability: Linear Regression assumes a linear relationship between features. For example, it considers age, BMI, and smoking status. This makes the results easy to interpret.
-
Benchmark for Comparison: It sets a baseline for performance. Thus, we can compare it with more advanced models.
-
Computational Efficiency: Linear Regression is fast and lightweight. Therefore, it is suitable for quickly assessing data fit.
2. XGBoost Regressor
-
Handling Non-Linear Relationships: XGBoost models complex, non-linear interactions between features. Consequently, it offers more accurate predictions of healthcare costs.
-
Boosting Algorithm for Improved Performance: XGBoost uses gradient boosting. This method corrects the errors of previous models. As a result, it enhances accuracy.
-
Regularization for Overfitting Prevention: It includes built-in regularization techniques (L1 and L2). Hence, it helps prevent overfitting and ensures good generalization.
Feature Importance: XGBoost highlights which features are most influential in predicting costs. Thus, it provides valuable insights into feature importance.
Linear Regression Model Model Building
We start by creating a Linear Regression model. This model acts as our baseline. It helps us grasp the basic relationships between features and the target variable. However, Linear Regression may not fully capture the complexities of the dataset. Therefore, we use more advanced models for better accuracy.
Data preprocessing
Splitting the dataset into training data and test data.
Encoding (One - Hot Encoding)
Machine learning algorithms often struggle with categorical data, which must be converted to numerical data. One hot encoding approach addresses this issue by creating a new feature for each label and assigning a value of 1. This method creates a boolean column for each category in each category feature, ensuring that the model interprets the data correctly. However, this method may cause bias, as the model may consider blue to be superior to red.
Transformation of the target
The target's non-normal distribution may result in residuals with different variance across values, violating assumption 5 of the linear regression model. Power transforms, a family of parametric, monotonic transformations, can make data more Gaussian-like, useful for modeling issues with heteroscedasticity or normality. PowerTransformer supports both the Box-Cox and Yeo-Johnson transforms, so the Yeo-Johnson transformation will be used.
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.33,
random_state=42
cols_to_drop = [
'Children',
'Region',
'Sex'
]
X_train.drop(cols_to_drop, axis=1, inplace=True)
X_test.drop(cols_to_drop, axis=1, inplace=True)
)
ohe = OneHotEncoder(use_cat_names=True)
X_train = ohe.fit_transform(X_train)
X_test = ohe.transform(X_test)
cols_to_drop = ['Smoker_no']
X_train.drop(cols_to_drop, axis=1, inplace=True)
X_test.drop(cols_to_drop, axis=1, inplace=True)
pt = PowerTransformer(method='yeo-johnson')
y_train_t = pt.fit_transform(y_train.values.reshape(-1, 1))[:, 0]
y_test_t = pt.transform(y_test.values.reshape(-1, 1))[:, 0]
pt = PowerTransformer(method='yeo-johnson')
y_train_t = pt.fit_transform(y_train.values.reshape(-1, 1))[:, 0]
y_test_t = pt.transform(y_test.values.reshape(-1, 1))[:, 0]
pd.Series(y_train_t).hist(figsize=(5, 3))
pd.Series(y_test_t).hist(figsize=(5, 3))
Linear Regression Model training
When the model is trained, observations with higher "Charges" receive more weight. Observations with lower "Charges" are given less weight. Therefore, residuals are penalized more for larger "Charges." In contrast, residuals from smaller "Charges" receive less penalty.
We will use the sample weight from the target column. Firstly, we will scale it by the lowest number in the "Charges" column, which is 1. As a result, the lowest sample weight will be 1.
sample_weight = y_train / y_train.min()
lr = LinearRegression()
lr.fit(
X_train,
y_train_t,
sample_weight=sample_weight
)
Evaluation The Model
We can use our model to make predictions on both our training and test sets now that we've trained it. The code checks metrics like MSE, RMSE, MAE, and R^2. These metrics evaluate the performance of the linear regression model by assessing its accuracy on the training data, followed by presenting the results.
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error
import math
# Train evaluation
y_pred_train_t = lr.predict(X_train)
y_pred_train = pt.inverse_transform(y_pred_train_t.reshape(-1, 1))[:, 0]
y_train = pt.inverse_transform(y_train_t.reshape(-1, 1))[:, 0]
mse_train = mean_squared_error(y_train, y_pred_train)
rmse_train = math.sqrt(mse_train)
mape_train = mean_absolute_percentage_error(y_train, y_pred_train)
r2_train = r2_score(y_train, y_pred_train)
print("Train Evaluation:")
print(f"MSE: {mse_train:.4f}")
print(f"RMSE: {rmse_train:.4f}")
print(f"MAPE: {mape_train:.4f}")
print(f"R^2: {r2_train:.4f}")
This code evaluates the performance of a machine learning model. It tests the model using linear regression on test data. Next, the code calculates metrics such as mean squared error (MSE). Then, it computes the root mean squared error (RMSE). It also computes mean absolute percentage error (MAPE) and R-squared (R²) score. Finally, the code prints the results.
# Test evaluation
y_pred_test_t = lr.predict(X_test)
y_pred_test = pt.inverse_transform(y_pred_test_t.reshape(-1, 1))[:, 0]
y_test = pt.inverse_transform(y_test_t.reshape(-1, 1))[:, 0]
mse_test = mean_squared_error(y_test, y_pred_test)
rmse_test = math.sqrt(mse_test)
mape_test = mean_absolute_percentage_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)
print("Test Evaluation:")
print(f"MSE: {mse_test:.4f}")
print(f"RMSE: {rmse_test:.4f}")
print(f"MAPE: {mape_test:.4f}")
print(f"R^2: {r2_test:.4f}")
This code creates a table showing evaluation metrics. It includes MSE, RMSE, MAPE, and R². The metrics are displayed for both training and test datasets. Additionally, the code plots predicted values versus actual values. This visualization helps in understanding model performance.
# Create a table with the evaluation metrics
evaluation_metrics = pd.DataFrame({
"Metric": ["MSE", "RMSE", "MAPE", "R^2"],
"Train": [mse_train, rmse_train, mape_train, r2_train],
"Test": [mse_test, rmse_test, mape_test, r2_test]
})
print(evaluation_metrics.to_string())
# Plot the predicted values against the actual values
plt.figure(figsize=(6, 4))
plt.scatter(y_test, y_pred_test)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], "k--", lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted Values")
plt.show()
Check normality of residuals
A QQ (quantile-quantile) plot helps us check if the residuals are normally distributed. This plot shows the difference between each real quantile and the theoretical quantile. The real quantiles come from the data. The theoretical quantiles are based on a normal distribution. Therefore, if the data were perfectly normally distributed, we would expect a straight line. The data points would align along this line.
We will also use a histogram to make the residuals easier to understand.
residuals_train = y_train - y_pred_train
residuals_test = y_test - y_pred_test
fig = sm.qqplot(
residuals_train,
fit=True,
line='45'
)
fig = sm.qqplot(
residuals_test,
fit=True,
line='45'
)
# Create a figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
# Plot the residuals for the train set
sns.histplot(residuals_train, ax=ax1, kde=True, color='blue')
ax1.set_title('Train Residuals')
ax1.set_xlabel('Residuals')
ax1.set_ylabel('Frequency')
# Plot the residuals for the test set
sns.histplot(residuals_test, ax=ax2, kde=True, color='orange')
ax2.set_title('Test Residuals')
ax2.set_xlabel('Residuals')
ax2.set_ylabel('Frequency')
# Show the plot
plt.show()
Check homoscedasticity
We can check for homoscedasticity using a scatter plot. In this plot, the target variable is on the x-axis, while the residuals are on the y-axis. Ideally, we expect the data points to be evenly distributed as the target value increases.
px.scatter(x=y_train, y=residuals_train)
px.scatter(x=y_test, y=residuals_test)
STEP 6:
XGBoost Regressor Model Building
We begin by constructing a linear regression model to serve as a baseline. This model helps us understand basic relationships between features and the target variable. However, linear regression may not capture dataset complexities. Therefore, we use the XGBoost Regressor for improved accuracy.
Improve on the baseline linear model
Now, let's try to improve our linear regression model by training a non-linear model. Specifically, non-linear means the model can learn complex, non-linear relationships from the data. Before moving on to XGBoost Regressor let's understand Decision Trees.
Decision Tree
There are supervised machine learning methods known as decision trees, which can solve both regression and classification problems. Also called CART, or Classification and Regression Trees, decision trees are represented by a tree-like flowchart. It starts with a root node at the top, followed by leaf nodes at the bottom. Consequently, the tree shows predictions based on feature-based splits, making it a clear and visual tool for decision-making.
Decision Tree and Overfitting
Decision trees are complex algorithms that divide features into similar parts based on a cost function. Numerous splits occur as the algorithm is designed to fit all samples in the training dataset, potentially leading to overfitting. As the tree grows, it learns each feature value for every observation. This may lead to errors when testing new data. Decision trees are especially prone to overfitting. The cost function is used to find the best split places and conditions.
Bagging and Boosting
Ensemble learning methods include bagging and boosting. That lowers the estimator's range. Using the combined power of various learners to get a series of average results.
Boosting
This can cause overfitting. As the tree grows, it learns each feature value for every observation. This may lead to errors when testing new data. Decision trees are especially prone to overfitting. Each model fixes the mistakes of the ones that came before it. This process continues until we can accurately forecast the data. Alternatively, it stops when we reach the maximum number of models.
Boosting, unlike a specific method, can improve the performance of multiple weak models. It gives more weight to misclassified data. This helps the program focus on tough cases. As a result, the model's accuracy improves.
Boosting is not the same as bagging, which works best for models with a lot of variation and little bias. Boosting works best for models with low variance and high bias. However, because it aims to minimize mistakes, it can lead to overfitting.
Gradient Boosting
Gradient boosting creates models sequentially. First, each model corrects the mistakes of the previous one. The additive model, the loss function, and a weak learner are the three main parts. Additionally, these components work together to create the system.
This method views Boosting as numerical optimization using Gradient Descent. We use the Gradient Boosting Classifier for classification tasks. On the other hand, for regression tasks, we use the Gradient Boosting Regressor. What differentiates them is the "loss function." Specifically, we use Mean Squared Error (MSE) for regression and log-likelihood for classification. Ultimately, the goal is to lower the loss function by adding weak learners.
Introduction to a non-linear model - XGBoost
XGBoost stands for eXtreme Gradient Boosting. It is a powerful decision tree-based method. Often, XGBoost is used in machine learning competitions. For instance, in regression, this is how XGBoost works:
-
Making an initial guess: Begin by making an initial guess, like 0.5.
-
Find the difference between the predicted values and the real values.
Fit Regression Tree: To guess the residuals, make a decision tree.
-
Figure out how alike each leaf is.
-
To get the most gain, try splitting nodes.
-
Pruning Tree: Use a threshold parameter to get rid of breaks that have low gain.
-
Combine the first forecast with the tree's prediction. This combination is based on how quickly the tree learns.
-
Repeat: Repeat steps 1-5 until the residuals are as small as possible. Alternatively, stop when you reach the maximum number of trees.
Some important XGBoost settings are `n_estimators`, `gamma`, `reg_lambda`, `learning_rate`, and `min_child_weight`. First, these settings help create a highly accurate model. Next, the model improves predictions repeatedly. Each iteration reduces mistakes further.
Data preprocessing
We need to create a new training set and test set. The previous sets have been altered. Therefore, we must generate fresh data to ensure accuracy.
Note: We use the same "random_seed" number for both sets. Thus, this ensures consistency with the baseline linear model's training and test sets.
Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.33,
random_state=42
)
ohe = OneHotEncoder(use_cat_names=True)
X_train = ohe.fit_transform(X_train)
X_test = ohe.transform(X_test)
rfe = RFE(estimator=XGBRegressor())
xgb = XGBRegressor()
steps = [
('rfe', rfe),
('xgb', xgb)
]
pipe = Pipeline(steps)
num_features = X_train.shape[1]
search_spaces = {
'rfe__n_features_to_select': Integer(1, num_features), # Num features returned by RFE
'xgb__n_estimators': Integer(1, 500), # Num trees built by XGBoost
'xgb__max_depth': Integer(2, 8), # Max depth of trees built by XGBoost
'xgb__reg_lambda': Integer(1, 200), # Regularisation term (lambda) used in XGBoost
'xgb__learning_rate': Real(0, 1), # Learning rate used in XGBoost
'xgb__gamma': Real(0, 2000) # Gamma used in XGBoost
}
xgb_bs_cv = BayesSearchCV(
estimator=pipe, # Pipeline
search_spaces=search_spaces, # Search spaces
scoring='neg_root_mean_squared_error', # BayesSearchCV tries to maximise scoring metric, so negative RMSE used
n_iter=100, # Num of optimisation iterations
cv=3, # Number of folds
n_jobs=-1, # Uses all available cores to compute
verbose=1, # Show progress
random_state=0 # Ensures reproducible results
)
xgb_bs_cv.fit(
X_train,
y_train,
)
Model evaluation
Firstly, let’s examine how each set of parameters performed across each fold. Each record in the dataset represents a set of parameters that were tested. We then use rank_test_score to ensure that the best set of parameters appears at the top.
cv_results = pd.DataFrame(xgb_bs_cv.cv_results_).sort_values('rank_test_score')
cv_results
Let's use the model we trained with our best parameters to make guesses on both our training and test sets:
y_pred_train_xgb = xgb_bs_cv.predict(X_train)
y_pred_test_xgb = xgb_bs_cv.predict(X_test)
# Train evaluation
mse_train_xgb = mean_squared_error(y_train, y_pred_train_xgb)
rmse_train_xgb = math.sqrt(mse_train_xgb)
mape_train_xgb = mean_absolute_percentage_error(y_train, y_pred_train_xgb)
r2_train_xgb = r2_score(y_train, y_pred_train_xgb)
print("Train Evaluation:")
print(f"Root Mean Squared Error: {rmse_train_xgb:.4f}")
print(f"Mean Squared Error: {mse_train_xgb:.4f}")
print(f"Mean Absolute Error: {rmse_train_xgb:.4f}")
print(f"Mean Absolute Percentage Error: {mape_train_xgb:.4f}")
print(f"R Squared: {r2_train_xgb:.4f}")
# Test evaluation
mse_test_xgb = mean_squared_error(y_test, y_pred_test_xgb)
rmse_test_xgb = math.sqrt(mse_test_xgb)
mape_test_xgb = mean_absolute_percentage_error(y_test, y_pred_test_xgb)
r2_test_xgb = r2_score(y_test, y_pred_test_xgb)
print("Test Evaluation:")
print(f"Root Mean Squared Error: {rmse_test_xgb:.4f}")
print(f"Mean Squared Error: {mse_test_xgb:.4f}")
print(f"Mean Absolute Error: {rmse_test_xgb:.4f}")
print(f"Mean Absolute Percentage Error: {mape_test_xgb:.4f}")
print(f"R Squared: {r2_test_xgb:.4f}")
#Create a table with the evaluation metrics
evaluation_metrics = pd.DataFrame({
"Metric": ["Root Mean Squared Error", "Mean Squared Error", "Mean Absolute Error", "Mean Absolute Percentage Error", "R Squared"],
"Linear Regression": [rmse_train, mse_train, rmse_train, mape_train, r2_train],
"XGBoost": [rmse_train_xgb, mse_train_xgb, rmse_train_xgb, mape_train_xgb, r2_train_xgb],
})
print(evaluation_metrics.to_string())
# Plot the predicted values against the actual values
plt.figure(figsize=(6, 4))
plt.scatter(y_test, y_pred_test_xgb)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], "k--", lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted Values (XGBoost)")
plt.show()
STEP 7:
Comparison to the baseline model
# Create a table with the evaluation metrics for both models
evaluation_metrics = pd.DataFrame({
"Metric": ["Root Mean Squared Error", "Mean Squared Error", "Mean Absolute Error", "Mean Absolute Percentage Error", "R Squared"],
"Linear Regression": [rmse_train, mse_train, rmse_train, mape_train, r2_train],
"XGBoost": [rmse_train_xgb, mse_train_xgb, rmse_train_xgb, mape_train_xgb, r2_train_xgb],
})
# Print the table
print(evaluation_metrics.to_string())
# Compare the evaluation metrics for both models
print("Comparison of Evaluation Metrics:")
print("- Root Mean Squared Error:")
print(f" Linear Regression: {rmse_train:.4f}")
print(f" XGBoost: {rmse_train_xgb:.4f}")
print("- Mean Squared Error:")
print(f" Linear Regression: {mse_train:.4f}")
print(f" XGBoost: {mse_train_xgb:.4f}")
print("- Mean Absolute Error:")
print(f" Linear Regression: {rmse_train:.4f}")
print(f" XGBoost: {rmse_train_xgb:.4f}")
print("- Mean Absolute Percentage Error:")
print(f" Linear Regression: {mape_train:.4f}")
print(f" XGBoost: {mape_train_xgb:.4f}")
print("- R Squared:")
print(f" Linear Regression: {r2_train:.4f}")
print(f" XGBoost: {r2_train_xgb:.4f}")
# Based on the comparison, choose the model with the better performance
#plots for Compare the evaluation metrics for both models
# Create a figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
# Plot the RMSE for both models
ax1.bar(['Linear Regression', 'XGBoost'], [rmse_train, rmse_train_xgb])
ax1.set_title('Root Mean Squared Error')
ax1.set_xlabel('Model')
ax1.set_ylabel('RMSE')
# Plot the R^2 for both models
ax2.bar(['Linear Regression', 'XGBoost'], [r2_train, r2_train_xgb])
ax2.set_title('R Squared')
ax2.set_xlabel('Model')
ax2.set_ylabel('R^2')
# Show the plot
plt.show()
Giving the data to stakeholders who aren't technical
A lot of the time, data scientists have to tell people who aren't pros in the field how well a model works. This means that metrics like Root mean square error (RMSE) aren't very useful due to their complexity.
Instead, let's show the percentage of times our model's predictions are close to the real "charges" number. As an example, the portion of our model's test set predictions that are within $10,000 of the real "charges" number is:
# The percentage of our model's predictions (on the test set) that are within $10000 of the actual charges value is:
# Calculate the number of predictions within $2000 of the actual charges value
num_within_10000 = len(np.where((abs(y_test - y_pred_test_xgb) <= 10000))[0])
# Calculate the percentage of predictions within $2000 of the actual charges value
percentage_within_10000 = (num_within_10000 / len(y_test)) * 100
# Print the percentage
print(f"The percentage of predictions within $10000 of the actual charges value is: {percentage_within_10000:.2f}%")
We can also show the percentage of predictions close to actual charges. We calculate the distance between the prediction and real values to show the percentage of predictions close to actual charges. This helps us understand the accuracy of our model better.
For example, We can measure the percentage of predictions within 30% of the actual charges. This gives us an idea of how accurate our model is:
# The percentage our model's predictions (on the test set) that are within 30% of the actual charges value is:
# Calculate the number of predictions within 30% of the actual charges value
num_within_30_percent = len(np.where((abs(y_test - y_pred_test_xgb) / y_test) <= 0.30)[0])
# Calculate the percentage of predictions within 30% of the actual charges value
percentage_within_30_percent = (num_within_30_percent / len(y_test)) * 100
# Print the percentage
print(f"The percentage of predictions within 30% of the actual charges value is: {percentage_within_30_percent:.2f}%")
Project Conclusion
The Insurance Pricing Forecast Using XGBoost Regressor project demonstrates machine learning's capabilities. Moreover, it shows how to predict healthcare costs using the XGBoost Regressor. The XGBoost model outperformed the linear regression baseline. Specifically, it offered more accurate predictions and handled non-linear relationships effectively. Consequently, insurance companies can use this XGBoost Regressor to set premiums more reliably. This approach helps reduce operational costs and boost profitability.
Challenges and Troubleshooting
Handling Skewed Data
-
The dataset had a skewed distribution of healthcare costs, which affected predictions.
-
We applied the Yeo-Johnson transformation to address this skewness.
-
The transformation normalized the healthcare cost data.
-
Normalizing the data improved its suitability for machine learning models.
-
This resulted in more accurate predictions from the models.
XGBoost Regressor Hyperparameter Tuning
-
Tuning the hyperparameters of the XGBoost Regressor was essential for success.
-
We utilized Bayesian Optimization to optimize the hyperparameters effectively.
-
It explores a wide range of hyperparameter values for better performance.
-
The optimization process significantly improved model accuracy.
-
Bayesian Optimization also helped in preventing overfitting, ensuring reliable results.
XGBoost Regressor Model Evaluation
-
Selecting the right evaluation metrics was crucial for assessing model performance.
-
We used metrics such as RMSE, MAPE, and R-squared for evaluation.
-
These metrics offered valuable insights into how the model performed.
-
These metrics helped us understand the accuracy of the model's predictions.
-
Ultimately, these metrics demonstrated how effectively the model predicted healthcare costs.
FAQ
-
What is the purpose of this project?
-
Answer: The goal is to build a machine learning model. This model will predict healthcare costs for insurance companies. It will help them set premiums more accurately.
-
-
Why did we choose XGBoost for this project?
-
Answer: XGBoost handles complex datasets with non-linear relationships. This makes it ideal for predicting healthcare costs.
-
-
What are the key features used in this model?
-
Answer: Key features include age, BMI, smoking status, and region. Each significantly impacts healthcare expenses.
-
-
How do you evaluate model performance?
-
Answer: We use metrics like RMSE, MAPE, and R-squared to evaluate model accuracy and effectiveness.
-
-
What were the main challenges?
-
Answer: The main challenges included handling skewed data, tuning the XGBoost hyperparameters, and selecting the right evaluation metrics.
-