Learn to Build a Polynomial Regression Model from Scratch
Ready to explore polynomial regression? Imagine yourself driving around data - uncovering some hidden structure, some pattern that isn't a simple line. Polynomial regression is like enhancing the power of the model that helps to capture complex curves and trends. Increasing the accuracy for predicting the result and accomplishing actual tasks is such a great idea.
Project Overview
In this project, you will be learning how to create your polynomial regression model from scratch. You will realize how to mold a basic linear model to your advantage. First, let’s collect data and clean it. After that, we will discuss how polynomial regression is different from simple linear regression. Last, we will proceed with coding step by step while explaining it in detail. By the end of this project, you will have a working model that can handle non-linear trends perfectly. Ready to start? Let’s dive in!
Prerequisites
- Basic Python Skills: Be able to write loops, functions, and variables.
- Understanding of Linear Regression: Familiar with drawing a straight line to fit data.
- Basic Math Knowledge: Knowledge in simple algebra, power, and exponentiation.
- Libraries: NumPy and Matplotlib: Familiarity with calculations, data manipulations, and data visualization.
With these basics, you’re ready to start the project!
Approach
Our approach starts by setting up the dataset. After that, we'll use the dataset to train and test our model. We start with a quick review of linear regression to see where polynomial regression can add value. Next, we’ll transform our data by adding polynomial features. This allows our model to capture curves instead of just straight lines. Once the data is prepared, we’ll code the polynomial regression model step-by-step. We will employ the Python programming language and its NumPy library for numerical calculations and transformations. After training, we'll present the results using Matplotlib to assess the data-fitting ability of the model.
In the end, we will do a comparison of these two models and perform a more advanced analysis to show the clear benefit of using this model in practical applications dealing with databases.
Workflow
- Data Collection: First, we gathered the datasets made of the images of healthy and infected leaf samples.
- Data preprocessing: Once we gathered samples, we handled missing values. Then the images were normalized. After that, it was divided into data for training and others for validation.
- Feature Engineering
Add polynomial features to the dataset, which will allow our model to fit curves. - Model Building: Implementing polynomial regression from scratch using Python and NumPy to handle the math.
- Training the Model: The model is then trained on the training dataset to learn from the patterns.
- Evaluation: Carrying out model testing and its performance assessment.
- Comparison with Linear Regression
Compare results with a basic linear regression model to highlight the power of polynomial regression.
Methodology
To construct a polynomial regression model, we start by increasing the dimension of our features by adding polynomial terms of the original features like squares and cubes of the original respectively. This enables the model to fit higher-order trends. We then use least squares estimation to train the model and this is just a concept of getting closer to the real values. By mapping each step, from feature engineering up to training and evaluation, we shall not only know how it works but also find out why the model suits non-linear data. We can visualize and compare the effects of polynomial regression, thus making a practical and useful methodology!
Data Collection
First, we collected a public dataset that shows a non-linear pattern. This will allow us to see the benefits of polynomial regression. You can also use real-world data like housing prices or stock trends. In this dataset, values fluctuate in complex ways.
If you're a beginner, try to create artificial data with a curved pattern. This way at least you will be in control of the data. This makes it easier to know whether the model we are using is functioning as required.
Data Preparation
Once the dataset has been collected, it is then time to prepare the data set. We will start with the missing values so that it doesn’t mess up our data set. Following this, we will normalize our features to ensure all of them are in the same range. This is especially important in polynomial regression. Also, the last step to perform is transforming (squaring, cubing, or taking to higher powers) the original built features to make new features. This transformation enables our model to fit curves rather than mere straight lines. So that more relationships can be captured.
Data Preparation Workflow
- Handle Missing Values: Impute missing values so that it does not reduce the total number of features in a dataset on which the training is going to be done.
- Feature Scaling: Scale features to the same scale for training an accurate model.
- Generate Polynomial Features: To incorporate non-linear relationships, add polynomial features.
- Split the Data: Divide the available data into training and validation data.
Final Check: Check all features for modeling with the conditions that no outliers are influencing the results and no need for scale transformations.
STEP 1:
Code explanation
Here’s what is happening under the hood. Let’s go through it step by step:
Mount Google Drive
Mount your Google Drive to access and save datasets, models, and other resources.
from google.colab import drive
drive.mount('/content/drive')
Suppress Warnings
It excludes non-critical warnings from the output, producing a cleaner view of the results.
import warnings
warnings.filterwarnings('ignore')
Install Required Libraries
This installs the LightGBM and Scikit-Learn libraries. LightGBM is often used for gradient boosting in machine learning, while Scikit-Learn provides essential tools for building and evaluating models.
!pip install lightgbm
!pip install scikit-learn
Import Libraries
This section imports the necessary library for data manipulation (pandas, numpy), graph plotting (seaborn, plotly, matplotlib), statistical operations (scipy), and machine learning models (sklearn).
import sys
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import scipy.stats as stats
import matplotlib.pyplot as plt
from sklearn import linear_model
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
STEP 2:
Load the Dataset
The code employs the pandas library to retrieve the dataset from Google Drive. The head() method shows the first few records within the dataset to have an overview of the dataset.
data \= pd.read_csv("/content/drive/MyDrive/New 90Projects/Project_1/Final_NBA_Dataset.csv")
data.head()
Check the Dataset Dimensions
This statement displays the dimension of the dataset in terms of how many rows and columns it has. For example, if it says (317, 7), this means that the dataset has 317 rows and 7 columns, thereby helping to verify that the correct amount of data was indeed loaded.
print("Dimension of the dataset is= ", data.shape)
Display Column Names
The code is designed to print the names of the dataset's columns. This helps to quickly understand the features present in the data to be analyzed.
data.columns
Dataset Overview
Before starting the data cleaning or the analysis process, it is wise to understand the structure of the data and look for missing values. This command is very useful in undertaking this task, as it lets the user easily perceive the characteristics of the data.
data.info()
Renamed Columns for Easy Analysis
The code alters the column names of the data DataFrame to make simpler names and produces a new DataFrame df with such column names. It then uses df.head(10) to present the first 10 rows to check the modifications made.
df=data.rename(columns={'Points_Scored':'Points','Weightlifting_Sessions_Average':'WL','Yoga_Sessions_Average':'Yoga',
'Laps_Run_Per_Practice_Average':'Laps','Water_Intake':'WI',
'Players_Absent_For_Sessions':'PAFS'})
df.head(10)
STEP 3:
Visualizing Data Distribution and Outliers for Deeper Insights into Points
This code creates three visualizations for the Points variable: These include a density plot, a transformed density plot (using the square root of Points), and a boxplot. For detail, each of the plots assists us in identifying different aspects of data distribution such as spread, skewness, and outliers. The layout is changed to make it easier to view the plots, and the plots are shown beside each other for contrast.
fig, axs \= plt.subplots(1, 3, figsize=(20, 6), dpi=80)
# Distribution Plot for Points
sns.distplot(df.Points, ax=axs[0])
axs[0].set_xlabel("Points")
axs[0].set_ylabel("Density")
axs[0].set_title("Distribution Plot for Points")
# Distribution Plot for Square Root of Points
sns.distplot(np.sqrt(df.Points), ax=axs[1])
axs[1].set_xlabel("Square Root of Points")
axs[1].set_ylabel("Density")
axs[1].set_title("Distribution Plot for Square Root of Points")
# Boxplot for Points
sns.boxplot(x=df.Points, ax=axs[2])
axs[2].set_xlabel("Points")
axs[2].set_title("Boxplot for Points")
# Adjust layout to avoid overlap
plt.tight_layout()
plt.show()
Viewing the Last 100 Rows of the Dataset
The code displays the last 100 rows of the df DataFrame.
df.tail(100)
Creating Violin and Box Plots for Each Variable by Team
The function plotting_box_violin_plots() is constructed to create violin and box plots for any given variable against the ‘Team’ column to better visualize the spread and the central tendency of the data. This function is subsequently called in a loop to draw these plots for each variable (‘WL’, ‘Yoga’, ‘Laps’, ‘WI’, ‘PAFS’) and allow for comparison for each of them against the ‘Team’ label.
def plotting_box_violin_plots(df, x, y):
fig, axes \= plt.subplots(1, 2, figsize=(26, 8))
fig.suptitle("Violin and Box plots for variable: {}".format(x))
\# Violin plot with Team on the y-axis and the variable on the x-axis, colored by Team
violin \= sns.violinplot(ax=axes\[0\], x=x, y=y, data=df, hue=y, palette="Set2", split=True)
axes\[0\].set\_title("Violin plot for variable: {}".format(x))
\# Box plot with Team on the y-axis and the variable on the x-axis, colored by Team
box \= sns.boxplot(ax=axes\[1\], x=x, y=y, data=df, hue=y, palette="Set2")
axes\[1\].set\_title("Box plot for variable: {}".format(x))
\# Setting the labels for x-axis and y-axis
axes\[0\].set\_xlabel(x)
axes\[0\].set\_ylabel("Team")
axes\[1\].set\_xlabel(x)
axes\[1\].set\_ylabel("Team")
\# Add legends only if there are any labeled elements
if violin.get\_legend\_handles\_labels()\[1\]:
axes\[0\].legend(loc='upper right')
if box.get\_legend\_handles\_labels()\[1\]:
axes\[1\].legend(loc='upper right')
# Looping through the variables and plotting
for x in ['WL', 'Yoga', 'Laps', 'WI', 'PAFS']:
plotting_box_violin_plots(df, x, "Team")
STEP 4:
Identifying Outliers in Selected Columns
The function find_outliers() is employed to determine outliers in a given column by computing the Interquartile Range (IQR) of that column. In a for-loop manner, this function is executed column-wise on the columns (‘WL’, ‘Yoga’, ‘Laps’, ‘WI’, ‘PAFS’) and any values that are beyond the composed range are displayed, making it possible to identify data points that need additional attention.
def find_outliers(df,column):
Q1=df[column].quantile(0.25)
Q3=df[column].quantile(0.75)
IQR=Q3-Q1
Upper_End=Q3+1.5*IQR
Lower_End=Q1-1.5*IQR
outlier=df[column][(df[column]>Upper_End)| (df[column]\
return outlier
# Indent the code block within the for loop
for column in ['WL','Yoga','Laps','WI','PAFS']:
print('\n Outliers in column "%s"' %column)
outlier= find_outliers(df,column)
print(outlier)
Eliminating Certain Rows for Data Cleaning Purposes
The code constructs a revised DataFrame df_clean such that the rows with index numbers 142, 143, and 144 in df are excluded. In addition, the command df_clean.shape is employed to provide information regarding the size of the cleaned DataFrame.
df_clean=df.drop([142,143,144])
df_clean.shape
Replacing Invalid Values with NaN in the Cleaned Data
The specified code eliminates all the instances of the value 1111111.0 present in the 'WL' column of df_clean by substituting it with NaN denoting it as missing data. On displaying df_clean['WL’] the modified column is presented which helps in identifying the records that will require additional treatment during the analysis.
df_clean['WL'][df_clean['WL']==1111111.0]=np.nan
df_clean['WL']
STEP 5:
Calculating and Displaying Missing Data Proportion
The dataframe ncounts display the amount of missing data per column of the dataframe df_clean. It resorts to df_clean.isna().mean() to get the count of the NaN values as a percentage and the column is then transposed and given the name data_missing for ease of reference.
ncounts=pd.DataFrame([df_clean.isna().mean()]).T
ncounts=ncounts.rename(columns={1:'data_missing'})
ncounts
Visualizing Missing Data Proportion
The purpose of this code is to generate a horizontal bar graph showing the proportion of missing values against each of the columns in ncounts. Here, ncounts.plot(kind='barh', title='% of missing values across each column’) gives an intuitive understanding of the missing data in the columns and assists in determining which sections may need data winnowing. Lastly, the output has been generated using plt.show().
ncounts.plot(kind='barh', title='% of missing values across each column')
plt.show()
Comparing Data Shapes Before and After Dropping Missing Values
This comparison helps you understand the impact of dropping rows or columns with missing data on the dataset’s size.
df_clean.shape, df_clean.dropna(axis=0).shape, df_clean.dropna(axis=1).shape
Getting an Overview of the Cleaned Data
This summary provides insight into the arrangement of the prepared dataset before carrying out any evaluation.
df_clean.info()
Filling Missing Values in the 'WL' Column
The code replaces all NaN values in the 'WL' column with -1.
df_clean['WL'].fillna(-1)
STEP 6:
Visualizing the Effect of Filling Missing Values with Mean and Median
The presented code illustrates the effect on the distribution of the WL column when its missing values are replaced with the mean or the median. The placing of these two plots next to each other facilitates the comparison between the two imputation approaches and gives an idea of which method is better at retaining the original characteristics of the data.
fig, axes \= plt.subplots(1, 2, figsize=(16, 6), dpi=80)
# Visualizing after filling missing values with mean
sns.distplot(df_clean['WL'].fillna(df_clean['WL'].mean()), ax=axes[0])
axes[0].set_xlabel("WL")
axes[0].set_ylabel("Density")
axes[0].set_title("Distribution Plot for WL (Filled with Mean)")
# Visualizing after filling missing values with median
sns.distplot(df_clean['WL'].fillna(df_clean['WL'].median()), ax=axes[1])
axes[1].set_xlabel("WL")
axes[1].set_ylabel("Density")
axes[1].set_title("Distribution Plot for WL (Filled with Median)")
# Adjust layout to avoid overlap
plt.tight_layout()
plt.show()
Calculating Mean 'WL' per Team
This piece of code computes the average of the WL column for every distinct Team in df_clean, utilizing groupby(), and outputs the result in the form of a dictionary.
mean_WL=df_clean.groupby("Team")['WL'].mean().to_dict()
mean_WL
Replacing NaN Values in the 'WL' Column on a Team-Wise Basis
This code goes through every row of df_clean. In cases where the WL value is missing within a specific column, the void is filled with the mean WL value, which is associated with the team from the mean_WL dictionary.
for index, row in df_clean.iterrows():
team \= row['Team']
if pd.isna(row['WL']):
mean_value \= mean_WL.get(team)
df_clean.at[index, 'WL'] \= mean_value
Depicting the Distribution of 'WL', Once the NA's are Filled
Here’s a code that is aimed at creating a distribution plot for the column WL in a cleaned data frame where missing data has been filled with mean scores corresponding to each of the teams.
plt.figure(figsize=(8, 6), dpi=80)
sns.distplot(df_clean['WL'].replace(mean_WL))
plt.xlabel("WL")
plt.ylabel("Density")
plt.title("Distribution Plot for WL")
# Show only the plot
plt.show()
df_clean['WL'].replace(mean_WL)
df_clean['WL']=df_clean.groupby('Team')['WL'].transform(lambda x:x.fillna(x.mean()))
Using Simple Imputer to Handle Missing Values
This snippet fills up the missing values in the defined columns of the data frame by employing a SimpleImputer using the mean strategy. A new data frame, named si_impt_df is created, which consists of all the values present in the original data frame without any missing values, as each column has been filled with the mean of that particular column.
# 1 Simple Imputer
Features=['WL','Yoga','Laps','WI','PAFS']
impt=SimpleImputer(strategy='mean')
#Fit & Transform
si_impt=impt.fit_transform(df_clean[Features])
si_impt_df=pd.DataFrame(si_impt,columns=Features)
si_impt_df
Applying Iterative Imputer in Handling Missing Data
This script executes an IterativeImputer with 10 max_iter to handle missing values in the specified columns. It predictively improves any missing values using other features and generates thus imputed data in a new DataFrame called ITI_impt_df. This imputation is complex compared to filling values using a mean.
ITI=IterativeImputer(max_iter=10)
#Fit & Transform
ITI_impt=ITI.fit_transform(df_clean[Features])
ITI_impt_df=pd.DataFrame(ITI_impt,columns=Features)
ITI_impt_df
Using KNN Imputer for Missing Value Imputation
This code employs a KNNImputation with n_neighbors=3 that fills in missing values by averaging the values of the three closest neighbors. The imputed data is returned in a new DataFrame, KNN_impt_df that provides instance-based filling of missing values in the other entries of the dataset.
# KNN Imputer
KNN=KNNImputer(n_neighbors=3)
#Fit & Transform
KNN_impt=KNN.fit_transform(df_clean[Features])
KNN_impt_df=pd.DataFrame(KNN_impt,columns=Features)
KNN_impt_df
STEP 7:
Handle Missing Value Imputation through Iterative Imputer with LightGBM
This code defines an IterativeImputer object, the estimator of which is a LightGBMRegressor used to perform the missing value imputation in the given columns. The imputed data set is stored in a new DataFrame called lgbm_impt_df, making highly accurate imputations with the help of LightGBM capability for prediction.
# Define the LightGBM estimator with verbose=-1 to suppress warnings
lgbm_estimator \= lgb.LGBMRegressor(verbose=-1)
# Create an IterativeImputer using LightGBM as the estimator
imp_lgbm \= IterativeImputer(estimator=lgbm_estimator, max_iter=100, random_state=0)
# Fit and transform the data to impute missing values
lgbm_impt \= imp_lgbm.fit_transform(df_clean[Features])
# Create a DataFrame from the imputed data
lgbm_impt_df \= pd.DataFrame(lgbm_impt, columns=Features)
lgbm_impt_df
Inspecting and Renaming Columns in the Imputed Data
This section of the code assigns the lgbm_impt_df DataFrame filled with LightGBM-imputed data and creates a new DataFrame called df_new.
lgbm_impt_df
df_new=lgbm_impt_df
df_new.columns
Visualizing Distributions and Boxplots of Transformed Features
The function creates the grid of plots per number of features in df_new that examines their distributions and boxplots. It also transformed the distributions (square root and cube root) and created boxplots for WL, Yoga, and PAFS to enable a clear comparison of how the data is spaced out as well as the possible outliers.
# Create a figure with 2 rows and 3 columns
fig, axes \= plt.subplots(2, 3, figsize=(18, 10))
# Distribution of the square root of 'WL' (colored)
sns.distplot(np.sqrt(df_new["WL"]), ax=axes[0, 0], color="blue")
axes[0, 0].set_title('DistPlot - sqrt(WL)')
# Distribution of square root of 'Yoga' (colored)
sns.distplot(np.sqrt(df_new["Yoga"]), ax=axes[0, 1], color="orange")
axes[1, 1].set_title('DistPlot - sqrt(Yoga)')
# Distribution of cube root of 'PAFS' (colored)
sns.distplot(np.cbrt(df_new["PAFS"]), ax=axes[0, 2], color="cyan")
axes[1, 2].set_title('DistPlot - cbrt(PAFS)')
# Boxplot of 'WL' (colored)
sns.boxplot(df_new["WL"], ax=axes[1, 0], color="green")
axes[0, 1].set_title('BoxPlot - WL')
# Histogram of square root of 'WL' (colored)
np.sqrt(df_new["WL"]).hist(ax=axes[1, 1], color="red")
axes[0, 2].set_title('Histogram - sqrt(WL)')
# Histogram of 'Yoga' (colored)
df_new["Yoga"].hist(ax=axes[1, 2], color="purple")
axes[1, 0].set_title('Histogram - Yoga')
# Adjust the layout to prevent overlap
plt.tight_layout()
# Show the plots
plt.show()
df_clean.shape,df_new.shape
df_clean.columns
Adding 'Points' and 'Team' Columns to the Imputed Data
This code copies the 'Points' and 'Team' columns from df_clean into the df_new FrameData.
df_new["Points"]=df_clean['Points']
df_new["Team"]=df_clean['Team']
Calculating the Correlation Matrix for Numerical Columns
This piece of code filters out numerical columns from the df. It also computes the correlation matrix with the help of numerical_df.corr(). The correlation matrix correlation_matrix outlines how numerical attributes are related helping in spotting out strong relationships that may be positive or negative for further investigation.
# Assuming 'df' is your DataFrame
# Select only numerical columns for correlation calculation
numerical_df \= df.select_dtypes(include=['number'])
# Calculate the correlation matrix
correlation_matrix \= numerical_df.corr()
numerical_df.corr()
Using a Heatmap to Illustrate the Correlation Matrix
The code provided here uses sns.heatmap() to construct a correlation matrix as a heatmap and adds the values of correlation on the map (annot=True). This is utilized to show the existence of positive and negative correlations, making strong relationships between variables apparent.
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()
Creating a Pair Plot to Visualize Relationships Between Features
This code creates scatter plots of all feature pairs in df_new, with points colored by the Team column. This helps to visually explore relationships and patterns between variables across different teams.
sns.pairplot(df_new,kind='scatter',hue='Team')
plt.show()
Creating a Pair Plot to Visualize Relationships Between Features
This code creates scatter plots of all feature pairs in df_new. This helps to visually explore relationships and patterns between features.
sns.pairplot(df_new,kind='scatter')
plt.show()
Calculating Custom Correlation Using Ranked Data
The function aionlinecourse_corr() calculates a custom correlation coefficient between two variables, x, and y, in the DataFrame df.
def aionlinecourse_corr(df,x,y):
N=df.shape[0]
df_rank=df
df_rank['rank']=df_rank[y].rank()
#print(df_rank['rank'])
df_rank['rank_x']=df_rank[x].rank()
df_rank=df_rank.sort_values(by='rank_x')
#1-3*(abs(sum(xri-1 - Xir))/n square-1
#Diff function --> summation -> absolute
aionlinecourse_corr=1- (3*df_rank['rank'].diff().abs().sum() )/ (pow(N,2)-1)
return aionlinecourse_corr
aionlinecourse_corr(df_new,'WL','Points')
Calculating Custom Correlation Between Variables
This piece of code implements a function called aionlinecourse_corr(), which computes a certain correlation coefficient which is based on rankings. It is then used to find out how well Points correlate with the rest of the variables (WI, Laps, PAFS, Yoga) and helps see how strong the associations are towards those variables.
def aionlinecourse_corr(df, x_col, y_col):
# Rank the values of Y (the second column in comparison)
df['rank_y'] \= df[y_col].rank(method='first')
\# Rank the values of X (the first column in comparison)
df\['rank\_x'\] \= df\[x\_col\].rank(method='first')
\# Sort by the rank of X
df \= df.sort\_values('rank\_x')
\# Calculate absolute differences between consecutive ranks of Y
abs\_diff\_sum \= df\['rank\_y'\].diff().abs().sum()
\# Calculate the aionlinecourse correlation coefficient
N \= len(df)
aionlinecourse\_corr\_coef \= 1 \- (3 \* abs\_diff\_sum) / (N\*\*2 \- 1\)
return aionlinecourse\_corr\_coef
# Assuming df_new is your DataFrame
# Calculate aionlinecourse correlation for the specified columns
wi_points_corr \= aionlinecourse_corr(df_new, 'WI', 'Points')
laps_points_corr \= aionlinecourse_corr(df_new, 'Laps', 'Points')
pafs_points_corr \= aionlinecourse_corr(df_new, 'PAFS', 'Points')
yoga_points_corr \= aionlinecourse_corr(df_new, 'Yoga', 'Points')
# Print the results
print(f"aionlinecourse correlation between WI and Points: {wi_points_corr}")
print(f"aionlinecourse correlation between Laps and Points: {laps_points_corr}")
print(f"aionlinecourse correlation between PAFS and Points: {pafs_points_corr}")
print(f"aionlinecourse correlation between Yoga and Points: {yoga_points_corr}")
STEP 8:
Creating a Sample DataFrame and Grouping by Team
This snippet of code draws a random sample containing 50 rows of the df_new data frame with emphasis on the 'Team' and 'Points' columns. The sample is then divided into groups regarding 'Teams' to assess how many times each one occurs effectively summarizing how the teams are spread out in the sample.
import random
nba_id=list(df_new.index.unique())
random.seed(13)
sample_match_id=random.sample(nba_id,50)
sample_df=df_new[df_new.index.isin(sample_match_id)].reset_index(drop=True)
sample_df=sample_df[['Team','Points']]
groups=sample_df.groupby('Team').count().reset_index()
groups
Creating Probability Plots per Each Distinct Team
This script generates probability plots for the distribution of Points for each distinct team contained in sample_df and arranges them in a grid.
# Get unique teams from the dataset
unique_teams \= sample_df['Team'].dropna().unique()
# Calculate the number of rows and columns needed for the plots
n_teams \= len(unique_teams)
n_cols \= 3
n_rows \= (n_teams + n_cols - 1) // n_cols
# Create a figure with calculated rows and 3 columns
fig, axes \= plt.subplots(n_rows, n_cols, figsize=(14, 14))
# Flatten the axes array for easy iteration
axes \= axes.flatten()
# Loop through the unique teams and plot
for idx, team in enumerate(unique_teams):
stats.probplot(sample_df[sample_df['Team'] \== team]['Points'], dist='norm', plot=axes[idx])
axes[idx].set_title(f'Probability plot for {team}')
# Hide any unused subplots
for i in range(idx + 1, len(axes)):
fig.delaxes(axes[i])
# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()
Calculating the Ratio of Maximum to Minimum Standard Deviation Across Teams
This piece of code computes the standard deviation of Points for each of the teams in sample_df and afterward calculates and returns the value of the maximum standard deviation for all the teams to the smallest standard deviation. This helps to understand the differences in variability in the Points among the teams and how far apart or clustered the team's scores are from each other.
ratio=sample_df.groupby('Team').std().max()/sample_df.groupby('Team').std().min()
ratio
Creating an ANOVA Table for an Analysis of Variance for Groups
The code creates an empty ANOVA table called anova_table and has specific sections for different sources of variation along with their corresponding Sum of Squares (SS), degrees of freedom (df), Mean Square (MS), F value, P value and F critical values for between-groups, within groups and total.
The Sum of Squares Between Groups (SSTR) is obtained as the difference between the mean Points of each team to the overall mean (x_bar) to capture the extent of variation among team means. The SSTR calculated is then populated into the ANOVA table under the “Between Groups” section in the “SS” column. This is important in the analysis of the research question that seeks to determine whether there are significant differences within teams.
data=[['Between Groups','','','','','',''],['Within Groups','','','','','',''],['Total','','','','','','']]
anova_table=pd.DataFrame(data,columns=['Variation','SS','df','MS','F value','P value','F critical'])
anova_table.set_index('Variation',inplace=True)
x_bar=sample_df['Points'].mean()
#porcupines - 2 overall 10 8
SSTR=sample_df.groupby('Team').count()*(sample_df.groupby('Team').mean()-x_bar)**2
anova_table['SS']['Between Groups']=SSTR['Points'].sum()
anova_table
Computation of the Between-Groups Sum of Squares (SSE)
SSE is computed in the code by determining the variance within each of the teams (clusters) and multiplying that by the sample size of each team subtracting one. This measure captures the variability in Points to the Scores within the Team Showing how much each score varies from the average score of the team.
SSE=(sample_df.groupby('Team').count()-1)*sample_df.groupby('Team').std()**2
SSE
Updating ANOVA Table with Sum of Squares Within Groups (SSE)
The code adds the total SSE for Points to the "SS" column, which is listed as "Within Groups" in the anova_table. This update helps to finalize the accurate sum of squares and prepares the ANOVA table for subsequent operations.
anova_table['SS']['Within Groups']=SSE['Points'].sum()
anova_table
Calculating and Adding the Total Sum of Squares to ANOVA Table
This section of code calculated the Total Sum of Squares (SST) by aggregating the two components, SSE and SSTR for Points, and it considered the overall variations present in the data set. The total is then inserted into the “SS” row of the anova_table in the column labeled ‘Total’, thereby finalizing the sum of squares section in the ANOVA table.
total=SSE['Points'].sum()+SSTR['Points'].sum()
anova_table['SS']['Total']=total
anova_table
Calculating and Adding Degrees of Freedom to the ANOVA Table
This code calculates and adds the degrees of freedom for "Between Groups," "Within Groups," and "Total" to the anova_table.
anova_table['df']['Between Groups']=sample_df['Team'].nunique()-1
anova_table['df']['Within Groups']=sample_df.shape[0]-sample_df['Team'].nunique()
anova_table['df']['Total']=sample_df.shape[0]-1
anova_table
Computation of Mean Squares, F Value, and P Value for ANOVA Table
This code calculates the Mean Square (MS) for each variation category by dividing the Sum of Squares (SS) against the degrees of freedom (df). For the “Between Groups” effect, the F value is obtained from the ratio of MS of “Between Groups” to the MS of “Within Groups.” The P value for “Between Groups” is further computed based on the cumulative distribution function of the F-distributed random variable. These values are appended to anova_table facilitating analysis of the statistical significance of differences in groups.
anova_table['MS']=anova_table['SS']/anova_table['df']
anova_table['F value']['Between Groups']=anova_table['MS']['Between Groups']/anova_table['MS']['Within Groups']
anova_table['P value']['Between Groups']=1-stats.f.cdf(anova_table['F value']['Between Groups'],
anova_table['df']['Between Groups'],
anova_table['df']['Within Groups'])
anova_table
Calculating F Critical Value for Hypothesis Testing
The present code calculates the F critical value for “Between Groups” in a two-tailed test by modifying the significant level alpha and incorporating the F-distribution. This value is also incorporated within anova_table to conduct the final hypothesis testing.
alpha=0.05
hypothesis_type="two tailed"
if hypothesis_type=="two tailed":
alpha=alpha/2
anova_table['F critical']['Between Groups']=stats.f.ppf(1-alpha,
anova_table['df']['Between Groups'],
anova_table['df']['Within Groups'])
anova_table
STEP 9:
Interpreting the P-Value and Drawing a Conclusion
This snippet of code explains the results obtained from the ANOVA output by calculating the P value for the “Between Groups” and comparing it with alpha. In the case where the P value is less than alpha, this leads to the rejection of the null hypothesis, more so indicating that there is a difference between the groups. If not, as a result, it states that the null hypothesis is not tenable. It then proceeds to display the F value, the P value, and the summary for further understanding.
print("Approach for P value ")
conclusion="Failed to reject null hypothesis"
if anova_table['P value']['Between Groups']\
conclusion="Null hypothesis is rejected"
print("F value for the table is ", anova_table['F value']['Between Groups'],"and p value is ",anova_table['P value']['Between Groups'] )
print(conclusion)
Interpreting F Critical Value for Hypothesis Testing
This code compares the F value to the F critical value. If the F value exceeds the F critical value, it concludes "Null hypothesis is rejected"; otherwise, it states "Failed to reject the null hypothesis".
print("Approach for F critical ")
conclusion="Failed to reject null hypothesis"
if anova_table['F value']['Between Groups']>anova_table['F critical']['Between Groups']:
conclusion="Null hypothesis is rejected"
print("F value for the table is ", anova_table['F value']['Between Groups'],"and F critical value is ",anova_table['F critical']['Between Groups'] )
print(conclusion)
Applying One-Hot Encoding to Categorical Column
The following code implements one-hot encoding on the Team column in df_clean and constructs a new DataFrame one_hot_df with all teams represented in binary format as separate columns for each team.
#one hot encoding
one_hot_df=pd.get_dummies(df_clean,columns=['Team'],drop_first=True)
STEP 10:
Training Data Preparation and Train/Test Split
The code initializes the feature matrix X using selected columns of one_hot_df, which consist of the one-hot coded team columns, and sets y as the dependent variables (Points). The data is further divided into training and testing sets in an 80/20 ratio (i.e. using train_test_split) and respective dimensions of each split are provided for validation purposes. This arrangement of data is done to facilitate the training and testing of the model.
# Update X with the actual columns present in one_hot_df
X=one_hot_df[['Points', 'WL', 'Yoga', 'Laps', 'WI', 'PAFS',
'Team_ Portland Trail Blazers', 'Team_Golden State Warriors',
'Team_Houston Rockets', 'Team_Los Angeles Clippers',
'Team_Los Angeles Lakers', 'Team_Memphis Grizzlies',
'Team_Oklahoma City Thunder', 'Team_Orlando Magic', 'Team_Porcupines',
'Team_Washington Wizards']]
y=one_hot_df['Points']
#train test split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
print("X train dimension is ",X_train.shape)
print("y train dimension is ",y_train.shape)
print("X test dimension is ",X_test.shape)
print("y test dimension is ",y_test.shape)
Imputing Missing Values in Training Data Using LightGBM
This code employs IterativeImputer with LightGBMRegressor as an estimator to perform missing value treatment on X_train. Predicting missing values from other features is performed in a loop, which is set by max_iter=15. Then, the filled matrix is reconstructed to a DataFrame, X_train_clean, with original column names to provide a full and clear training set for the next step of modeling.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import lightgbm as lgb
import pandas as pd
# Initialize an IterativeImputer with LightGBM as the estimator, with verbose set to 0 to suppress output
imputer \= IterativeImputer(estimator=lgb.LGBMRegressor(), max_iter=15, verbose=0)
# Fit the imputer on X_train and transform the missing data
X_train_full \= imputer.fit_transform(X_train)
# Convert the result to a DataFrame with original column names
X_train_clean \= pd.DataFrame(X_train_full, columns=X_train.columns)
# Display the cleaned training data
X_train_clean.head()
Fitting an Ordinary Least Squares (OLS) Model on Cleaned Training Data
In this code, an OLS regression model is fitted using statsmodels to study the relationship between the target variable y_train and the features X_train_clean. The indices of y_train and X_train_clean are reset to avoid misalignment during modeling. Model statistics in the result which include coefficients, p-values, and goodness of fit measurements are shown in result.summary, hence giving information about the importance of the features and the effectiveness of the model.
import statsmodels.api as sm
# Resetting the index of both y_train and X_train_clean
# This ensures both DataFrames have a common index
y_train \= y_train.reset_index(drop=True)
X_train_clean \= X_train_clean.reset_index(drop=True)
# Fitting the OLS model after aligning the indices
result \= sm.OLS(y_train, X_train_clean).fit()
print(result.summary())
Adding a Constant Term and Fitting an OLS Model
This function modifies the variable X_train_clean by adding a constant term using sm.add_constant(), creating X_const in the process. The OLS regression model is then re-estimated on the new data, which includes X_const and y_train and result.summary() returns detailed information including coefficients, p-values, and fit statistics for a detailed assessment of the model and its features.
X_const=sm.add_constant(X_train_clean)
result=sm.OLS(y_train,X_const).fit()
print(result.summary())
Imputing Missing Values in the Test Set
This code applies the previously trained imputer to fill in missing values in X_test, creating a new DataFrame test, with the imputed values.
test=pd.DataFrame(imputer.transform(X_test))
Preparing Test Data for Prediction
This piece of code adjusts the column names of the test set test under the training set X_train_clean, allowing model predictions to be made without differences in the naming of the attributes. Then a constant term is also added to the test, and the fitted result model is used to make predictions based on provided features, keeping the forecasts in res. This action brings the evaluation of the model on the test set to a final.
# Get column names from X_train_clean (or where the model was trained)
original_cols \= X_train_clean.columns
# Assign these column names to the test DataFrame
test.columns \= original_cols
# Now, proceed with the prediction
res \= result.predict(sm.add_constant(test[['Points', 'WL', 'Yoga', 'Laps', 'WI', 'PAFS',
'Team_ Portland Trail Blazers', 'Team_Golden State Warriors',
'Team_Houston Rockets', 'Team_Los Angeles Clippers',
'Team_Los Angeles Lakers', 'Team_Memphis Grizzlies',
'Team_Oklahoma City Thunder', 'Team_Orlando Magic', 'Team_Porcupines',
'Team_Washington Wizards']]))
Evaluating Model Performance Using Regression Metrics
The code determines the accuracy of the prediction of the model with the use of four metrics: MSE (mean square error), RMSE (a measure of how different from the real values are the predicted ones), R-squared (a metric of how much variation is explained by the regression), and MAE (mean absolute error). Such evaluation metrics give an overall assessment of the effectiveness of the model when used on the test data.
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
# Assuming 'y_test' and 'res' are your true and predicted values
mse \= mean_squared_error(y_test, res)
rmse \= np.sqrt(mse)
r2 \= r2_score(y_test, res)
mae \= mean_absolute_error(y_test, res)
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R2): {r2}")
print(f"Mean Absolute Error (MAE): {mae}")
STEP 11:
Visualization of Polynomial-fitting Data by Graphing Quadratic and Cubic Data.
This code runs and presents two different forms of data points: a quadratic (y \= x^2) and a cubic (y_cubic \= x^3 + 2x^2), both of which are appended with noise to mimic the real-world scenario. It also plots these data points and provides appropriate legends to identify the quadratic and cubic graphs. The legends, the titles of the axes, and the title of the graph help in understanding the polynomial relationship, making it a good case for polynomial regression.
x \= np.linspace(-5, 5, 100)
y \= x**2 + np.random.normal(0, 1, 100)
y_cubic \= x**3 + 2*x**2 + np.random.normal(0, 1, 100)
# Plotting the data
plt.plot(x, y, label='Quadratic')
plt.plot(x, y_cubic, label='Cubic')
plt.legend(loc='best')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Polynomial Regression Example')
Creating a Plot to Compare Linear, Quadratic, and Cubic Polynomial Fits
The function serves the purpose of visualizing a polynomial fit by plotting the linear, quadratic, and cubic models over the data points. For each polynomial degree, Convinces all over the data, embeds the fitting lines in the appropriate places on the Graph, and gives an appropriate title so that one can see how each of the models fits the trend in the data.
def create_polynomial_plot(feature,label):
#convert to 1d
x_coordinates=feature
y_coordinates=np.squeeze(label)
linear_poly=np.poly1d(np.polyfit(x_coordinates,y_coordinates,1))
quadratic_poly=np.poly1d(np.polyfit(x_coordinates,y_coordinates,2))
cubic_poly=np.poly1d(np.polyfit(x_coordinates,y_coordinates,3))
values=np.linspace(x_coordinates.min(),x_coordinates.max(),len(x_coordinates))
plt.scatter(x_coordinates,y_coordinates,color='blue')
plt.plot(values,linear_poly(values),color='cyan',label='Linear Model')
plt.plot(values,quadratic_poly(values),color='red',label='Quadratic Model')
plt.plot(values,cubic_poly(values),color='yellow',label='Cubic Model')
plt.xlabel("%s from data" %(feature.name))
plt.ylabel("Points")
plt.rcParams["figure.figsize"]=(12,6)
plt.legend()
plt.title("Linear vs Quadratic")
plt.show()
This code call uses the create_polynomial_plot() function to compare polynomial fits of the WI feature from X_train_clean against the target variable y_train.
create_polynomial_plot(X_train_clean.WI,y_train)
Using Polynomial Feature Transformation and Applying to the Training Set
The presented code generates second-degree polynomial features for a given train matrix X_train_clean using the PolynomialFeatures class. This transformation (fit_transform) increases the dimension of X_train_clean by squaring feature values and adding interaction features into the matrix, which is saved as X_poly. By applying poly to X_poly and y_train the model is capable of capturing any effectively present non-linearity aiding the model in making the right polynomic predictions given the data.
poly \= PolynomialFeatures(degree=2)
X_poly \= poly.fit_transform(X_train_clean)
poly.fit(X_poly, y_train)
Training a Linear Regression Model on Polynomial Features
This code initializes a linear regression model (lm) and fits it to the polynomial-transformed training data (X_poly) and target variable (y_train)
lm=linear_model.LinearRegression()
lm.fit(X_poly,y_train)
Evaluating the Polynomial Regression Model with R² Score
This code generates predictions on the test data (test) by transforming it with the same polynomial feature expansion as used in training. The R² score is then calculated to assess how well the model's predictions match the actual values in y_test.
from sklearn import metrics
predictions=lm.predict(poly.fit_transform(test))
print("R2 score for test is",metrics.r2_score(y_test,predictions))
Calculating Root Mean Squared Error (RMSE) for Model Evaluation
This code computes the Root Mean Squared Error (RMSE) between the actual values (y_test) and the model’s predictions.
print("RMSE of the model is",np.sqrt(mean_squared_error(y_test,predictions)))
Turning Predictions into Binarized Classes and Computing Classification Metrics
The following code applies a threshold (0.5), which allows continuous predictions to be rolled into binary classes: predictions greater than or equal to 0.5 are classified as 1, and those below as 0. From this transformation, predictions_classes is generated, which is the output that is used in the classification_report against y_test. The report includes precision, recall, and F1-score measures which are used to analyze the performance of the model in classification problems within binary classes.
import numpy as np
from sklearn.metrics import classification_report
threshold \= 0.5
predictions_classes \= np.where(predictions > threshold, 1, 0)
# Now use predictions_classes in the classification_report
print(classification_report(y_test, predictions_classes))
Conclusion
In this project, we were able to utilize the polynomial regression method in establishing as well as predicting non-linear relations within a dataset. Extent polynomial features helped to capture quadratic and other curves that are not detectable by simple linear models. To take care of missing data we used the LightGBM method for imputation. Lastly, we have complete data for training the model. Parameters like model accuracy or the generalization capabilities of the model with a view to Root Mean Squared Error (RMSE) and classification report were used to justify the accuracy of the proposed model.
Thus we considered polynomial regression as an answer to the problem of enhancing predictive accuracy on non-linear data and discussed its possible applications among the industries where accurate predictions are crucial including finance, healthcare, and engineering. The flexibility of the method used in this project to analyze structured data made it ideal to build a stronger case using the basic principles of regression analysis. In this way, polynomial regression can be prepared as an effective objective for data scientists who are looking for accurate and comprehensible analytical models when examining sophisticated data.
Challenges and Solution
Problem: Missing Data Handling
- Solution: Implement advanced imputation techniques like Iterative Imputer with LightGBM to fill in missing values. This maintains data integrity and improves model performance.
Problem: Overfitting with High-Degree Polynomials
- Solution: When the degree is high, there is a tendency to fit noise and not interesting phenomena, thus the model tends to overfit. Experiment with different degrees of polynomial features and apply cross-validation to choose the number of terms that introduce the least variance.
Problem: Unbalanced Data for Classification
- Solution: If there are more instances of one class than the other, it can skew model performance. Apply resampling techniques or use evaluation metrics like precision, recall, and F1-score to assess model performance fairly.
Problem: Large Prediction Errors on New Data
- Solution: If the model is too specialized for the training data, it may not generalize well to unseen data. Follow the test data track Root Mean Squared Error (RMSE) and see if you can fit the data using either Ridge or Lasso regression to minimize prediction errors.
Problem: Polynomial Features Transformation of Complex Data
- Solution: However, creating polynomial features can be very computationally expensive and would create a huge feature set leading to slower training. The feature size should be limited by the expansion of polynomials to relevant features or the usage of dimensionality reduction techniques such as Principal Component Analysis (PCA) to reduce feature size and improve the model’s efficiency.
FAQ
1. What is polynomial regression and why do we use it?
Polynomial regression is one of the regression analysis methods, where the regression equation is expressed as an nth degree polynomial in the input variables. This model deals with non-linear data. Thus, it applies in cases where the data cannot be fitted into a straight line, examples include complicated financial patterns and even biological growth cycles.
2. What do I do when I have missing data in a polynomial regression project?
Imputation techniques make missing data workable. For this, you’d just use an Iterative Imputer with LightGBM, predict, and fill the empty values. It keeps data quality and improves model accuracy by using existing patterns present in the data.
3. How to determine the ideal degree for polynomial regression?
The choice of a suitable polynomial degree depends also on data and model performance. The higher degrees will have more complex patterns to capture that will more likely lead to overfitting. Test different polynomial degrees with the use of cross-validation and select a degree that maximizes accuracy and minimizes generalizability.
4, What is the way to look at the performance of a polynomial regression model?
Evaluation metrics for the polynomial regression model include R² score (explained variance), Root Mean Squared Error (RMSE) (average prediction error), and Mean Absolute Error (MAE). This helps to know how good the model is in fitting the data and also how well the model predicts new data points.
5. Can polynomial regression be used to solve classification problems?
It is possible to turn polynomial regression into classification by applying a threshold to the predictions that will transform them into binary classes. Next, after generating the predictions, use a threshold (e.g., 0.5) to evaluate the model using classification metrics such as precision, recall, and F1-score to evaluate the model’s classification performance.