BigMart Sales Prediction ML Project in Python
Do you want to learn predictive modeling and turn meaningful business insights into it? The Big Mart Sales Prediction project allows one to gain practical experience with data science methods using actual retail sales data. This project is intended for learners who wish to improve their machine-learning abilities and understand the retail business dynamics at the same time.
Project Overview
This big mart sales prediction is the best example of how data science methods can be applied to real-life sales data in retail. You will be using a dataset from Kaggle containing some rigorous features like product type, item exposure, your store’s location as well as customer information to create a perfect sales prediction model.
The project begins with data cleaning and preprocessing, where you’ll also deal with missing values and scaling of features for model training. You will then move to feature engineering and explore data analysis (EDA) on customers and products to assess patterns and trends in customer buying behaviors and product performance.
When progressing to building the regression model, you’ll discover basic concepts such as scaling the data, selecting features, and optimizing the model. Techniques like linear regression, random forest regression, and hyperparameter tuning will generate the sales figure model for Big Mart products.
That is why you can include this project in your portfolio as it will let you have practical experience in both predictive modeling and retail analysis. If you want to get a job as a data scientist, e-business, or business analyst, this project will help you to improve your ability and confidence.
Prerequisites
Learners must develop some skills before undertaking this Big Mart Sales Prediction project Here’s what you should ideally know:
- Understanding of basic knowledge of Python for data analysis and manipulation
- Knowledge of libraries such as Pandas, Numpy, and Matplotlib for data manipulation and data visualization respectively.
- Understanding of data preprocessing steps such as how to deal with missing values, normalization, and scaling.
- Familiarity with exploratory data analysis (EDA) to find out patterns and growing trends in sets.
- Elementary concepts about regression models in order to learn how predictive modeling works
- Machine learning frameworks such as Scikit-Learn for building, training, and assessing models
Approach
First, we begin with data loading and cleaning to ensure high-quality data. Then, EDA reveals some useful insights and key patterns in sales.
After that use feature engineering to create impactful variables. After preprocessing with scaling and encoding, you’ll select and train regression models. Hyperparameter tuning then optimizes model accuracy, evaluated with metrics like MAE and RMSE. Finally, the model generates sales predictions and insights, enhancing the understanding of retail sales trends for better decision-making.
Workflow and Methodology
Here's a step-by-step workflow you'll follow to build a successful sales prediction model:
- Data Collection and Loading: Begin by collecting the Big Mart sales data set from Kaggle and loading and importing them to a Pandas data frame
- Data Cleaning: To improve data quality, you need to detect and deal with missing values, ensure correct data type of data, and the handling of outliers.
- Exploratory Data Analysis (EDA): By applying EDA, you can understand the distribution of data and the prominent features, with an analysis of patterns or trends of sales.
- Feature Engineering: Create new columns and transform existing ones (categorical variables encoding) for better results of the model.
- Data Preprocessing: You have to scale the numerical data and convert the categorical data into numeric data for better model training.
- Model Selection: Use Regression models as this is a regression task.
- Model Training: Train all models with cleaned data and prepared data.
- Model Evaluation: Compare the model using other parameters such as the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE) metrics.
- Prediction and Insights: Use the final model to predict sales, and generate insights to help improve Big Mart’s sales strategies.
Data Collection and Preparation
Data collection
Big Mart Sales dataset is available in Kaggale. It is possible to conveniently and securely access a Kaggle dataset from within Google Colab after configuring your Kaggle credentials to prevent compromising sensitive information. It brings in the user’s data to collect securely the Kaggle API key and username and assigns them as environment variables. This enables the use of Kaggle’s CLI command (!kaggle datasets download -d brijbhushannanda1979/bigmart-sales-data) which authenticates the user and downloads the dataset straight into Colab.
Data Preparation
Data preparation workflow
- Data Cleaning: Handling missing values with median or mode. Then convert data types into correct formats.
- Outlier Management: Detects outliers for better model performance using statistical methods like IQR.
- Feature Engineering: Transform categorical variables with label encoding or one-hot encoding. Create additional features if they can improve model performance.
- Scaling and Normalization: Use StandardScaler to normalize numeric columns.
- Data Splitting: Split data into training and testing sets to prepare for model training.
Code Explanation
STEP 1:
Mounting Google Drive
First, mount Google Drive to access the dataset that is stored in the cloud.
from google.colab import drive
drive.mount('/content/drive')
Importing Libraries
This code block imports all the required libraries for this project for creating, training, and evaluating models. It also imports image visualization libraries like Matplotlib and Seaborn, and performance evaluation using metrics like mean squared error and R² score.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import GradientBoostingRegressor
STEP 2:
Load Dataset
The load_data function takes two parameters. These are the train dataset path and test dataset path. If datasets are present and loaded successfully then it prints “Train and Test Data Loaded Successfully!”
If there is any error occurs then it prints “File not found”
def load_data(train_path, test_path):
try:
train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)
print("Train and Test Data Loaded Successfully!")
return train_data, test_data
except FileNotFoundError as e:
print(f"File not found: {e}")
return None, None
train_df, test_df = load_data('/content/Train.csv', '/content/Test.csv')
Dataset Overview
The explore_data function takes a dataset and prints its quick overview. It prints how many rows and columns are present. The counts of missing values And their statistical descriptions like count, mean median, and percentile. This helps you to understand the dataset overview and its structure.
def explore_data(data, data_name="Dataset"):
print(f"{data_name} Shape:", data.shape)
print("\nMissing Values:\n", data.isnull().sum())
print("\nData Types:\n", data.dtypes)
print("\nDescriptive Statistics:\n", data.describe())
# Explore train and test data
explore_data(train_df, "Train Data")
explore_data(test_df, "Test Data")
This checks the first ten rows of the train and test dataset for a quick overview.
train_df.head(), test_df.head()
STEP 3:
Handling Missing Values
This displays those column names that have missing values and the percentage of missing values out of the training dataset.
missing_values= train_df.isnull().sum().sort_values(ascending=False)
missing_percentage = (missing_values / len(train_df)) * 100
missing_data = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
missing_data = missing_data[missing_data['Missing Values'] > 0]
missing_data
This displays those column names that have missing values and the percentage of missing values out of the test dataset.
missing_values= test_df.isnull().sum().sort_values(ascending=False)
missing_percentage = (missing_values / len(test_df)) * 100
# Display columns with missing values and their percentages
missing_data = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
missing_data = missing_data[missing_data['Missing Values'] > 0]
missing_data
Here, the Outlet_size column is categorical. That’s why we are filling the missing values with mode() techniques in train_df and test_df.
train_df['Outlet_Size']=train_df['Outlet_Size'].fillna(
train_df['Outlet_Size'].mode().values[0])
# test_df outlet size filling missing values
test_df['Outlet_Size']=test_df['Outlet_Size'].fillna(
test_df['Outlet_Size'].mode().values[0])
For the numerical column train_df[‘Item_Weight’], we are checking if there are any outliers. It will help us to make a decision on which method to choose for handling missing values.
plt.figure(figsize=(10,5))
sns.boxplot(data=train_df['Item_Weight'],orient="v", color = 'c')
plt.title("Item_Weight Boxplot")
Looking at the boxplot created, we noticed that there are no outliers in the `Item_Weight’ column. Hence to maintain the overall distribution, all the missing values were imputed with mean values.
train_df['Item_Weight'] = train_df['Item_Weight'].fillna(train_df['Item_Weight'].mean())
test_df['Item_Weight'] = test_df['Item_Weight'].fillna(test_df['Item_Weight'].mean())
missing_train = train_df['Item_Weight'].isnull().sum()
missing_test = test_df['Item_Weight'].isnull().sum()
print(f'Missing values in train_df Item_Weight: {missing_train}')
print(f'Missing values in test_df Item_Weight: {missing_test}')
STEP 4:
This line displays all the column names in test_df and train_df. It helps you quickly check and compare the available features in each dataset.
train_df.columns, test_df.columns
This code extracts out and splits the columns according to the data varieties for both training records: train_df) and test records: test_df. First, it then generates a list of numeric columns (num) and categorical columns (cat) for each dataset. After this, it calculates the value counts for each of them and all the other categorical columns of both datasets (except the first). This assists in gaining insight into distribution and other characteristics of data across each of the categories which in turn assist in pre-processing and preparing the data for modeling.
#list of all the numeric columns
num = train_df.select_dtypes('number').columns.to_list()
#list of all the categoric columns
cat = train_df.select_dtypes('object').columns.to_list()
[train_df[category].value_counts() for category in cat[1:]]
#list of all the numeric columns
num = test_df.select_dtypes('number').columns.to_list()
#list of all the categoric columns
cat = test_df.select_dtypes('object').columns.to_list()
[test_df[category].value_counts() for category in cat[1:]]
This code replaces 'LF', and 'low fat' with ‘Low Fat’ in both datasets. This makes consistent labels: 'Low Fat' and 'Regular'. This ensures uniformity across these categories. After that, it also prints the count of different values of Item_Fat_Content in both the data frames after the replacement of the mentioned values.
#train
train_df['Item_Fat_Content']=train_df['Item_Fat_Content'].replace(['LF', 'low fat', 'reg'],
['Low Fat','Low Fat','Regular'])
#test
test_df['Item_Fat_Content']=test_df['Item_Fat_Content'].replace(['LF', 'low fat', 'reg'],
['Low Fat','Low Fat','Regular'])
#check result
train_df.Item_Fat_Content.value_counts()
test_df.Item_Fat_Content.value_counts()
Feature Engineering
This code creates a new feature, Outlet_Age, by subtracting Outlet_Establishment_Year from the current year (2024). This feature can enhance model performance by providing insights into the impact of outlet age on sales.
# Create new feature: Outlet Age
train_df['Outlet_Age'] = 2024 - train_df['Outlet_Establishment_Year']
test_df['Outlet_Age'] = 2024 - test_df['Outlet_Establishment_Year']
This code generates boxplots for each numerical column in the train_df to get an overview of the distributions and outliers. It displays the plots in grid form to facilitate their comparison; spacing is then optimized for visualization purposes.
import matplotlib.pyplot as plt
import seaborn as sns
# Define a list of numerical columns
numerical_cols = train_df.select_dtypes(include='number').columns.tolist()
num_cols = 2
num_rows = (len(numerical_cols) + num_cols - 1) // num_cols
# Create boxplots for each numerical column
plt.figure(figsize=(12, 6 * num_rows))
for i, col in enumerate(numerical_cols):
plt.subplot(num_rows, num_cols, i + 1)
sns.boxplot(y=train_df[col])
plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()
This code sets up a plot to visualize the distribution of Item_Outlet_Sales across different Outlet_Type categories in train_df.
# Set up the plot area
plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
sns.boxplot(x='Outlet_Type', y='Item_Outlet_Sales', data=train_df, palette='viridis')
plt.title('Sales Distribution by Outlet Type')
STEP 5:
Visualization
The code calculates total sales for each outlet by grouping train_df by Outlet_Identifier and summing Item_Outlet_Sales. It creates a bar plot, "Total Sales by Outlet," for comparison.
total_sales_by_outlet = train_df.groupby('Outlet_Identifier')['Item_Outlet_Sales'].sum().reset_index()
plt.figure(figsize=(10, 6))
sns.barplot(x='Outlet_Identifier', y='Item_Outlet_Sales', data=total_sales_by_outlet, palette='viridis')
plt.title('Total Sales by Outlet')
plt.xlabel('Outlet Identifier')
plt.ylabel('Total Sales')
plt.show()
The code calculates total sales for each Outlet_Location_Type by grouping train_df and summing Item_Outlet_Sales. It presents the results in a bar plot, using a viridis color palette and rotated labels for easy comparison.
total_sales_by_location = train_df.groupby('Outlet_Location_Type')['Item_Outlet_Sales'].sum().reset_index()
plt.figure(figsize=(10, 6))
sns.barplot(x='Outlet_Location_Type', y='Item_Outlet_Sales', data=total_sales_by_location, palette='viridis')
plt.title('Total Sales by Outlet Location Type')
plt.xlabel('Outlet Location Type')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.show()
The code calculates sales for each Item_Type by aggregating train_df and Item_Outlet_Sales, creating a pie chart with percentage labels. This plot helps in understanding sales distribution.
total_sales_by_item_type = train_df.groupby('Item_Type')['Item_Outlet_Sales'].sum()
plt.figure(figsize=(10, 8))
plt.pie(total_sales_by_item_type, labels=total_sales_by_item_type.index, autopct='%1.1f%%', startangle=140)
plt.title('Proportion of Total Sales by Item Type')
plt.axis('equal')
plt.show()
The code creates a 2 by 2 grid with four subplots, histograms, count plots, and density curves to display distributions in the train_df dataset. It uses plt.tight_layout() for clear spacing and easy feature identification.
# Set up the plot area
plt.figure(figsize=(15, 10))
# Histogram for Item_Weight
plt.subplot(2, 2, 1)
sns.histplot(train_df['Item_Weight'], bins=30, kde=True, color='skyblue')
plt.title('Item Weight Distribution')
# Histogram for Item_Outlet_Sales
plt.subplot(2, 2, 2)
sns.histplot(train_df['Item_Outlet_Sales'], bins=30, kde=True, color='salmon')
plt.title('Item Outlet Sales Distribution')
# Countplot for Item_Fat_Content
plt.subplot(2, 2, 3)
sns.countplot(x='Item_Fat_Content', data=train_df, palette='pastel')
plt.title('Item Fat Content Distribution')
# Countplot for Outlet_Size
plt.subplot(2, 2, 4)
sns.countplot(x='Outlet_Size', data=train_df, palette='pastel')
plt.title('Outlet Size Distribution')
plt.tight_layout()
plt.show()
The code uses pair plots to analyze the correlation between variables Item_Visibility and Item_Outlet_Sales, revealing their relationship across outlet types.
pairplot = sns.pairplot(train_df, hue='Outlet_Type', vars=['Item_Visibility', 'Item_Outlet_Sales'])
plt.suptitle('Pairplot of Item Visibility and Outlet Type vs Item Outlet Sales', y=1.02) # Adjust y for vertical positioning
plt.subplots_adjust(top=0.9)
plt.show()
This code creates a correlation heat map for numerical data in train_df, computing the correlation matrix and heat mapping it with values.
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
corr_matrix = train_df.corr(numeric_only=True)
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap for Numerical Columns")
plt.show()
STEP 6:
Data preprocessing
The code creates label encoders for all the categorical variable columns present in the train_df data frame and applies a label encoder to each one of those columns to convert their categorical values. This is done to avoid using categorical data directly in machine learning models.
label_encoders = {}
categorical_columns = train_df.select_dtypes(include='object').columns
for column in categorical_columns:
label_encoders[column] = LabelEncoder()
train_df[column] = label_encoders[column].fit_transform(train_df[column])
The code creates label encoders for all the categorical variable columns present in the test_df data frame and applies a label encoder to each and every one of those columns to convert their categorical values. This is done to avoid using categorical data directly in machine learning models.
label_encoders = {}
categorical_columns = test_df.select_dtypes(include='object').columns
for column in categorical_columns:
label_encoders[column] = LabelEncoder()
test_df[column] = label_encoders[column].fit_transform(test_df[column])
The objective of this code is to prepare the dataset for modeling by standardizing the numeric features. Firstly, it removes the target variable Item_Outlet_Sales from the training set and makes an identical copy of the test set for X_test. Next, it determines the numeric columns in X_train, and StandardScaler is applied to convert these columns.
The scaler is fit and applied on X_train, and to be consistent. The scaling transformation is also used on X_test. This guarantees that all numeric features across both datasets have a standardized scale.
from sklearn.preprocessing import StandardScaler
X_train = train_df.drop('Item_Outlet_Sales', axis=1)
y_train = train_df['Item_Outlet_Sales']
X_test = test_df.copy()
numerical_features = X_train.select_dtypes(include=['number']).columns
scaler = StandardScaler()
X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_test[numerical_features] = scaler.transform(X_test[numerical_features])
This code splits the standardized dataset in X_train, X_val, y_train, and y_val. It allocated 80% for training data (X_train, y_train) and 20% for validation data (X_val,y_val).
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
STEP 7:
Model Building
This code initializes a dictionary named models containing five different regression models: Linear Regression, Decision Tree Regressor, Random Forest Regressor, Gradient Boosting Regressor, and Support Vector Regressor. Each model is represented by its name as the key and an instance of the corresponding regression algorithm as the value.
models = {
"Linear Regression": LinearRegression(),
"Decision Tree Regressor": DecisionTreeRegressor(),
"Random Forest Regressor": RandomForestRegressor(),
"Gradient Boosting Regressor": GradientBoostingRegressor(),
'Support Vector Regressor': SVR()
}
This piece of code is responsible for training and assessing all regression models provided in the model's dictionary. For all models, the code fits the model to the training data (X_train, y_train) and predicts values for the validation data (X_val).
After that, the Mean Squared Error (MSE) and R squared score are calculated for the predictions made. This serves to determine the accuracy of the model. The results of MSE and R squared score for each model are kept in the results dictionary. A brief report on the performance of models is provided for easy comparison of different models within the results section.
results = {}
for model_name, model in models.items():
model.fit(X_train, y_train)
y_val_pred = model.predict(X_val)
# Calculate MSE and R2 score on validation set
mse = mean_squared_error(y_val, y_val_pred)
r2 = r2_score(y_val, y_val_pred)
# Store the results
results[model_name] = {'MSE': mse, 'R2 Score': r2}
print(f"{model_name} - MSE: {mse:.2f}, R2 Score: {r2:.2f}")
This code starts with the creation of a DataFrame called comparison_df which takes as input the results dictionary to compare and show the model performance on the validation dataset. It transposes the DataFrame for a better view, each model’s MSE and R² score are in rows now. Last of all, it displays the comparison table in a manner where it can be used to compare model performance to determine the effectiveness of each.
comparison_df = pd.DataFrame(results).T
print("\nComparison of Model Performance on Validation Set:")
print(comparison_df)
STEP 8:
Prediction
The Random Forest Regressor is selected as the best model. Which is then trained on the entire training set (X_train, y_train). Predictions are subsequently made on the test data (test_df) using the trained model. These predictions are contained in a data frame named output in the column titled ‘Prediction’, which is then written out to a file entitled ‘test_predictions.csv’. Lastly, the system prints a friendly message confirming the successful saving of test predictions.
best_model = RandomForestRegressor()
best_model.fit(X_train, y_train)
test_predictions = best_model.predict(test_df)
# Save predictions for test data
output = pd.DataFrame({'Prediction': test_predictions})
output.to_csv('test_predictions.csv', index=False)
print("\nTest predictions have been saved to 'test_predictions.csv'")
The test prediction results are read by Panda. Converted into a data frame and showed the first 10 rows.
result= pd.read_csv('/content/test_predictions.csv')
result.head(10)
Again The Gradient boosting Regressor is selected as the best model. Which is then trained on the entire training set (X_train, y_train). Predictions are subsequently made on the test data (test_df) using the trained model. These predictions are contained in a data frame named output in the column titled ‘Prediction’, which is then written out to a file entitled ‘test_predictions_GradientBoostingRegressor.csv’. Lastly, the system prints a friendly message confirming the successful saving of test predictions.
best_model = GradientBoostingRegressor()
best_model.fit(X_train, y_train)
test_predictions = best_model.predict(test_df)
# Save predictions for test data
output = pd.DataFrame({'Prediction': test_predictions})
output.to_csv('test_predictions_GradientBoostingRegressor.csv', index=False)
print("\nTest predictions have been saved to 'test_predictions_GradientBoostingRegressor.csv'")
The test prediction results are read by Panda. Converted into a data frame and showed the first 10 rows.
result= pd.read_csv('/content/test_predictions_GradientBoostingRegressor.csv')
result.head(10)
Project Conclusion
“The Big Mart Sales Prediction” project shows the power of machine learning in retail business analytics. By using various regression models we successfully predicted sales based on key features like product characteristics, outlet type, and location. Through data preprocessing and feature engineering, we enhanced the model’s accuracy, providing actionable insights into the sales drivers across Big Mart’s network.
This project enhances data science capabilities and provides hands-on practice in predictive analytics specific to retail sectors. Drawing upon exploratory data analysis and considering feature engineering, we managed to create and optimize models with good performance on unseen data. In the end, this Big Mart Sales Prediction model will be useful for businesses to indicate and appreciate the concept of sales, and improve their ability to make decisions based on available information.
This project is perfect for those wishing to acquire skills in retail data analysis, sales targeting, and predictive analytics all the while positioning themselves advantageously in the area of data science and machine learning.
Challenges New Coders Might Face
Challenge: Handling Missing Data
Solution: Implement imputation methods such as replacing the missing values by mean or median values or more advanced methods such as KNN imputation and K-nearest neighbor imputation should be used.Challenge: Outliers in Numerical Data
Solution: Outliers should be identified by statistical means (for example; IQR ) and such outliers then transformed or deleted. Boxplots assist in the recognition of outliers at an early stage during the cleaning of the data.Challenge: Dealing with Categorical Variables
Solution: Incorporate Label Encoding or One-Hot Encoding techniques on the categorical variables. Label encoding comes in handy when dealing with ordinal data while one hot encoding is most suitable for categorical features which are not ordinal.Challenge: Choosing the Right Model
Solution: For example, when building a model one can start with a linear regression baseline model then intermediate models like the random forest and the gradient boosting models can be implemented. At this time, also compare the models with the MSE or R² score metrics on the validation set and choose the model that will give the best results after the training phase.Challenge: Hyperparameter Tuning for Optimization:
Solution: Use Grid Search or Random Search for hyperparameter tuning to systematically find the optimal settings. These techniques carry out the tuning process automatically, which tends to enhance the performance of the model with very minimal effort.
FAQ
What is the objective of the Big Mart Sales Prediction project?
We are aiming to develop a predictive model on retail sales for the retail outlets of Big Mart given product, outlet, and customer characteristics. This project assists companies in learning their sales patterns to make better decisions and understand sales trends.Which data pre-processing techniques are important for Big Mart Sales Prediction?
In the preprocessing steps, we handle missing values or remove outliers, scale, and engineer the feature, as well as encode the categorical variable of given data for better model fitting.Which machine learning algorithms are best for sales prediction?
For this project, the algorithms are Random Forest Regressor, Gradient Boosting Regressor, and Linear Regression. These models are quite useful for the extraction of subtle trends within the data stream of actual retail sales.How does feature engineering influence the results of sales prediction models?
It increases the level of understanding of relationships between features by the model by enhancing feature engineering. For example, developing a new feature, Outlet_Age, and encoding categorical features.Why is scaling necessary in sales prediction techniques?
Scaling simply refers to the normalization of all the numerical features that enhance the performance of machine-learning models
Explore Practical Machine Learning Projects to Boost Your Skill
Ready to engage in practical learning? Immerse yourself in the world of machine learning projects such as the Big Mart Sales Prediction, and learn the art of data science that industries are looking for. Check out our website to learn more!!