## Insurance Pricing Forecast Using XGBoost Regressor

Our project's goal is to create a machine learning model that can predict how much health care will cost so that insurance rates can be set correctly. Two models will be made and compared: an XGBoost Regressor and a Linear Regression model. The study's overarching goal is to assist insurance companies in setting profitable rates by estimating healthcare expenses using age, BMI, and smoking status. Data preparation, feature engineering, model training, assessment, and success comparison are all steps in the process. The main focus is on telling all parties, technical and not, about the results.

# Explanation All Code

### Step 1:

You can mount your Google Drive in a Google Colab notebook with this piece of code. This creates it simple to see files saved in Google Drive in the Colab setting so that data can be changed, analyzed, and models can be trained.

Import required packages

### Step 2:

Import libraries

**Exploratory Data Analysis (EDA)**

EDA stands for "Exploratory Data Analysis". It is a way to look at data with pictures. It includes looking for certain trends in data using statistics and visual methods.
People use it to figure out trends in data, find outliers, test assumptions, and so on. The main goal of EDA is to help people look into the data before they come up with a theory about it.
When you are making a machine learning model, EDA is a very important step. It helps us figure out how the variables are spread out, how they are connected, and which traits will be good at making predictions.
First, let's read the information, which is in the folder called "input" and is named "insurance.csv".

So we have three numeric features (Age, Bmi and Children) and three categorical features (Sex, Smoker and Region).

NOTE: there are no null values in any of the columns, which means we won't need to impute values in the data preprocessing step. However, This is usually a step that you'll need to consider when building a machine learning model.

The target (i.e. the variable that we want to predict) is the charges column, so let's split the dataset into features (X) and the target (y):

**Distributions**

Let's now look at the distribution of each feature by plotting a histogram for each Points to note regarding the distribution of each feature:

- Age - Approximately uniformly distributed.
- Sex - Approximately equal volume in each category.
- Bmi - Approximately normally distributed.
- Children - Right skewed (i.e. higher volume in lower range).
- Smoker - Significantly more volume in the no category vs the yes category.
- Region - Approximately equal volume in each category.
- The distribution is right skewed (i.e. higher volume in the lower range).

**Univariate analysis (with respect to the target)**

The next step is to use binary analysis on the target. In other words, we look at each trait and figure out how it fits with the goal.

How we do this changes based on whether the trait is a number or a set of words. A scatterplot will be used for numerical features and a boxplot will be used for category features.

**Numeric features**

Points to note regarding each feature:

- Age - As Age increases, Charges also tends to increase (although there is a large variance in Charges for a given Age).
- bmi - There is no clear relationship, although there seems to be a group of individuals with Bmi > 30 that have Charges > 30k.
- This group may become more apparent when we carry out our bivariate analysis later.
- Children - No clear relationship (although Charges seems to decrease as Children increases).
- Since there are only 6 unique values for this feature, let's try treating it as a categorical feature for the purposes of univariate analysis.

### Step 3:

** Categorical features**

**Things to keep in mind about each feature:**

- Sex - No significant differences in Charges between the categories.
- Smoker - Charges for Smoker == 'yes' are generally much higher than when Smoker == 'no'.
- Region - No significant differences in Charges between the categories.
- Children - No significant differences in Charges between the categories (Children >= 4 are skewed towards lower Charges, but this is likely due to the low volumes in those categories - see the Distributions section).

### Step 4:

**Create scatter matrix and Chi squared Test.**

The Chi-squared test checks if there is a significant association between two categorical variables by comparing observed and expected frequencies in a contingency table. If the p-value is less than 0.05, it indicates a significant association, suggesting that the variables are related rather than independent.

**Numeric-categorical feature pairs**

**ANOVA**

### Step 5:

**Data preprocessing**

**Model training**

With sampling weights, we can make the residuals more homoscedastic. When the model is trained, observations with higher "Charges" will be given more weight than observations with lower "Charges". This means that residuals are punished more for observations that have bigger "charges" than for observations that have smaller "Charges".
We will use the sample weight from the target column and scale it by the lowest number in the "Charges" column, which is 1. This means that the lowest sample weight is 1.

### Step 6:

**Evaluation The Model**

We can use our model to make predictions on both our training and test sets now that we've trained it. The code looks for metrics like MSE, RMSE, MAE, and R^2 to see how well a machine learning model that was taught with linear regression on training data is doing. The results are then shown.

This code evaluates the performance of a machine learning model test with linear regression on test data, calculating metrics such as mean squared error (MSE), root mean squared error (RMSE), mean absolute percentage error (MAPE), and R-squared (R^2) score, then prints the results.

**Check normality of residuals**

A QQ (quantile-quantile) plot can be used to see if the residuals are average. This shows the difference between the values of each real quantile (from the data) and the theoretical quantile (based on the idea of a normal distribution). You would expect the data points to be on a straight line if the data is perfectly normally distributed.
We will also use a histogram to make the residuals easier to understand.

**Check homoscedasticity**

### Step 7:

**Improve on the baseline linear model**

Let's train a non-linear model to see if it can do better than our linear regression model. In this case, "non-linear" means that the model can learn from the data connections that are not linear.

**Data preprocessing**

We need to make a new training set and test set because the ones we used for the baseline linear model have been changed.
Note: We use the same "random_seed" number for both the training and test sets to make sure that the observations are the same as those in the baseline linear model's training and test sets:

**Train/test split**

**Model evaluation**

First, let's take a look at how each set of parameters did on each fold. Each record in the dataset is a set of parameters that were tried. We use rank_test_score to make sure that the best set of parameters is at the top:

**Let's use the model we trained with our best parameters to make guesses on both our training and test sets:**

**Comparison to the baseline model**

**Giving the data to stakeholders who aren't technical**

Instead, let's show what percentage of the time our model's predictions are close to the real "charges" number. As an example, the portion of our model's test set predictions that are within $10,000 of the real "charges" number is:

We can also show what percentage of our model's predictions are within a certain percentage of the actual charges value.

For example, the percentage our model's predictions (on the test set) that are within 30% of the actual charges value is:

**Conclusion**

We understood what Linear Regression is based on and how to use and test it based on those assumptions. We learned a lot about the XGBoost algorithm and how to use it with the pipeline object in sklearn and bayesian optimization. We also learned how to tell non-technical users about how well a model is working.

Next, we'll look at some real-life insurance industry uses of Machine Learning. The Insurance Industry and Machine Learning.

**Machine Learning in Insurance Industry**

- Machine learning assists them in identifying probable fraudulent claims more quickly and correctly, and flagging them for inquiry. Given the variety of fraud types and the relatively low ratio of known frauds in typical samples, detecting insurance fraud is a difficult challenge. Machine learning techniques enable loss control units to achieve larger coverage with lower false positive rates by enhancing forecasting accuracy.
- Consumers want personalized solutions, which are made feasible by machine learning algorithms that analyze their profiles and offer tailored items. A growing number of the insurance industry are using chatbots on messaging apps to answer easy questions and handle claims.
- The insurance industry are using machine learning more and more to reach important goals like lowering costs, improving underwriting, and finding scams.