Image

Build Regression (Linear, Ridge, Lasso) Models in NumPy Python

Using Python and NumPy this project introduces Linear Regression, Ridge, and Lasso Regression. We will also understand how these models can forecast outcomes and determine the correlation between variables. Regardless of your experience with machine learning this project simplifies the concept making it very easy to understand.

Project Overview

We’ll explore three key regression techniques: Ridge Regression, Lasso Regression, and Linear Regression. Continuous values, given in input data, are predicted with these models. Ridge and Lasso are linear regression versions (with regularization), capturing simple relationships between variables but making such relationships more robust to noise in the data. Using Python and NumPy library, we’ll go through data pre-processing, model building, model validation, and optimization techniques. By the end of the course, you’ll also have a solid grasp of how to use these regression models on real data and enhance your ML projects.

Prerequisites

Learners must develop some skills before undertaking this project. Here’s what you should ideally know:

  • Understanding of basic knowledge of Python for data analysis and manipulation
  • Knowledge of libraries such as Pandas, Numpy, and Matplotlib for data manipulation and data visualization respectively.
  • Understanding of data preprocessing steps such as how to deal with missing values, normalization, and scaling.
  • Familiarity with exploratory data analysis (EDA) to find out patterns and growing trends in sets.
  • Elementary concepts about regression models to learn how predictive modeling works
  • Machine learning frameworks such as Scikit-Learn for building, training, and assessing models

Approach:

In this project, we predict laptop prices using multiple regression models. We first loaded the dataset and cleaned it, handling the missing values and feature selection. OneHotEncoder is used for encoding categorical variables while StandardScaler is used to standardize for numerical features. Then the data is split into training and testing sets. Three trained models using Linear Regression, Lasso Regression, and Ridge Regression on the training set are run on the training set. Metrics like MAE, MSE, R2, and RMSE are the evaluation of each model. The performance of the models is compared and a classification report is generated by predicting prices as binary labels. The results are shown in a comparison table and through bar plots to compare the results of different models.

Workflow and Methodology

Workflow:

  • Data Collection: Collect the dataset from the public dataset repository, and load it into a DataFrame in Pandas for further analysis.
  • Data Cleaning: You need to deal with missing values, check that the right data type is used, and all the data is ready for modeling.
  • Feature Engineering: We transform some existing ones (categorical variables encoding) for better results on the model using OneHotEncoder.
  • Data Scaling: To get the best performance for your model, you have to scale the numerical data with StandardScaler.
  • Train-Test Split: Split the dataset into training and testing sets to evaluate model performance.
  • Model Building: Train regression models (Linear, Lasso, Ridge) using the prepared data.
  • Model Evaluation: Evaluate the models using metrics like MAE, MSE, R2, and RMSE.
  • Model Comparison: Compare model performance by analyzing evaluation metrics for each model.

Methodology:

  1. Data Preprocessing: For categorical features, we use OneHotEncoding and for numerical features, we scale it using StandardScaler to ensure uniformity for models.
  2. Model Selection: We preprocessed data and chose and trained Linear Regression, Lasso Regression, and Ridge Regression models.
  3. Model Evaluation: Use the evaluation function to evaluate the model’s performance on the test set using MAE, MSE, R2, and RMSE.
  4. Classification Report: Convert regression output into binary and get a classification report for the binary classification task.
  5. Model Comparison: Compare the models using a comparison table and visualization (like a bar plot) using evaluation metrics.

Data Collection and Preparation

Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation:
The Dataset is loaded on a Pandas DataFrame for easy preparation and analysis. We identified and handled missing values by removing rows having a large proportion of missing data. Features are chosen for the regression models to ensure that only relevant ones are taken to skip the unnecessary or redundant ones. OneHotEncoder encodes categorical variables into a format that the machine learning models can work with. Then we standardize numerical features to represent all features on a similar scale using StandardScaler. Finally train_test_split() splits the data into training and testing sets to let our model be evaluated on unseen data.

Code Explanation

STEP 1:

Mounting Google Drive

First, mount Google Drive to access the dataset that is stored in the cloud.

from google.colab import drive
drive.mount('/content/drive')

Importing Library

This segment of code imports the requisite libraries for data handling, model building, and graphics rendering. The data operations are carried out with the help of NumPy and Pandas whereas the plotting is done by Seaborn and Matplotlib.

The code also imports scikit-learn machine-learning models including Linear Regression, Ridge, and Lasso. It has components like OneHotEncoder and StandardScaler which are designed for data pre-processing and subsequent model performance evaluation respectively.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LassoCV,RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.preprocessing import OneHotEncoder,StandardScaler

Loading Data and Checking Dimensions:

This code loads the CSV file. After loading the dataset it prints the dataset’s shape to check the number of rows and columns. The %time magic command in the notebook records the time taken to perform the task.

Aionlinecourse = pd.read_csv("/content/drive/MyDrive/New 90 Projects/Project_2/laptop_eda.csv")
%time
print(Aionlinecourse.shape)

Previewing Data

This block of code displays the first few rows of the dataset to have a quick overview of the structure of the dataset.

Aionlinecourse.head()

This block of code displays the last few rows of the dataset.

Aionlinecourse.tail()

Summary Statistics

The code produces a transposed table containing summary statistics, including mean, standard deviation, minimum, and maximum values, for every numerical column of the DataFrame.

Aionlinecourse.describe().T

This line of code checks if there are any null values present in each feature.

Aionlinecourse.isnull().sum()

Eliminating Missing Values

The code removes the rows where all the values are missing and modifies the DataFrame in place. It then prints the new shape to check the changes.

Aionlinecourse.dropna(axis=0, how='all', inplace=True)  # Remove 'thresh' argument
print(Aionlinecourse.shape)

STEP 2:

Data Visualization

The code constructs a 2x2 grid of plots to visualize various attributes of the dataset. In detail, it comprises:

  • A histogram exhibiting the distribution of laptop prices with an overlaid kernel density estimation plot.
  • A scatter plot that depicts the relationship between the RAM of laptops and their price.
  • A box plot illustrating the price distribution of laptops based on the operating system.
  • A heatmap of a correlation matrix that showcases the relationship amongst the numeric features.
# Set up the figure with 2 rows and 2 columns
fig, axs = plt.subplots(2, 2, figsize=(16, 12))
# Histogram of a numerical column
sns.histplot(Aionlinecourse['Price'], bins=20, kde=True, ax=axs[0, 0])
axs[0, 0].set_title('Distribution of Laptop Prices')
axs[0, 0].set_xlabel('Price')
axs[0, 0].set_ylabel('Frequency')
# Scatter plot between two numerical columns
sns.scatterplot(x='Ram', y='Price', data=Aionlinecourse, ax=axs[0, 1])
axs[0, 1].set_title('Relationship between RAM and Price')
axs[0, 1].set_xlabel('RAM')
axs[0, 1].set_ylabel('Price')
# Box plot to compare a numerical column across different categories
sns.boxplot(x='OpSys', y='Price', data=Aionlinecourse,ax=axs[1, 0])
axs[1, 0].set_title('Laptop Prices by Operating System')
axs[1, 0].set_xlabel('Operating System')
axs[1, 0].set_ylabel('Price')
# Correlation matrix heatmap
numerical_features = Aionlinecourse.select_dtypes(include=np.number).columns
correlation_matrix = Aionlinecourse[numerical_features].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f",ax=axs[1,1])
axs[1, 1].set_title('Correlation Matrix')
plt.tight_layout()
plt.show()

Detecting Features with High Correlations

The code identifies pairs of features that correlate greater than a specified threshold (0.7). It then prints the pairs of features with high correlations.

# prompt: Find the highly correlated features
# Assuming 'correlation_matrix' is amodemodel2eady calculated as in your provided code.
# Find features with correlation greater than a threshold (e.g., 0.7)
threshold = 0.7
highly_correlated_features = set()
for i in range(len(correlation_matrix.columns)):
  for j in range(i):
    if abs(correlation_matrix.iloc[i, j]) > threshold:
      colname_i = correlation_matrix.columns[i]
      colname_j = correlation_matrix.columns[j]
      highly_correlated_features.add((colname_i, colname_j))
print("Highly correlated features:")
for feature_pair in highly_correlated_features:
  # Indented this line to be inside the for loop
  print(feature_pair)

Pairwise Plot

The code creates a pairwise plot of all numerical features contained in the dataset. This shows the scatter plots and histograms of each variable pair.

sns.pairplot(Aionlinecourse)
plt.show()

STEP 3:

Separating Numerical and Categorical Data

The provided code splits the data set into two sections: numbers (int and float) and categorical (object types) thus simplifying the further analysis of the two types of data.

numeric = Aionlinecourse.select_dtypes(include=[np.int64,np.float64])
categorical = Aionlinecourse.select_dtypes(include=[np.object_])

This line of code shows all the categorical columns

categorical.columns

Companies Count Graphically Represented

This code includes a bar chart that measures the frequency of the distinct values found in the “Company” column thereby illustrating how many laptop brands are contained in the dataset.

plt.figure(figsize=(20,5))
company_counts = categorical.Company.value_counts()
plt.bar(company_counts.index,company_counts.values)

Counts of Unique Values of Company Column

This code shows how many laptops in the data set belong to that specific company by displaying the count of all unique values present in the ‘Company’ column.

company_counts

Pie Chart of Company Distribution

The code creates a pie chart depicting the number of laptops belonging to each particular company, displaying also their respective percentage for each company.

plt.figure(figsize=(30,5))
plt.pie(company_counts,labels=company_counts.index,
   autopct = '%0.1f%%')
plt.show()

STEP 4:

Data Preparation and Model Training for Linear Regression

The first step is to encode the categorical features using OneHotEncoder and then scale the features with StandardScaler. It also performs the division of the dataset into training and testing portions. Subsequently, a Linear Regression model is fitted on the scaled training data to estimate the log of the price.

x = Aionlinecourse.iloc[:,:-1]
y = pd.DataFrame(Aionlinecourse['Price'])
y = np.log(y)
# Replace 'sparse' with 'sparse_output'
ct = ColumnTransformer(transformers=[('clm_tns',OneHotEncoder(sparse_output=False,drop='first'),[i for i in range(5)])],
                  remainder='passthrough')
x = ct.fit_transform(x)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.15,random_state=5)
std = StandardScaler()
std.fit(x_train)
std_x_train = std.transform(x_train)
std_x_test = std.transform(x_test)
Model_1 = LinearRegression()
Model_1.fit(std_x_train,y_train)

Model Evaluation Function

The purpose of this function is to assess the performance of the specified model by computing a total of four performance metrics as follows: Mean Absolute Error (MAE), the Mean Squared Error (MSQE), R-squared, also referred to as the R2 score, and Root Mean Squared Error (RMSE) on the provided test data.

def evaluation(x_test, y_test,model) :
    y_pred = model.predict(x_test)
    mae = metrics.mean_absolute_error(y_test, y_pred)
    msqe = metrics.mean_squared_error(y_test, y_pred)
    r2_score = metrics.r2_score(y_test, y_pred)
    rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
    return {'mae':mae,'msqe':msqe,'r2_score':r2_score,'rmse':rmse}

Storing and Presenting Assessment Outcomes

In this code, the evaluation function is invoked for the Model_1 which refers to the already trained model, and results are kept in the DataFrame. The evaluation metrics (MAE, MSQE, R2, RMSE) of the Linear Regression model are then presented.

ev = pd.DataFrame(evaluation(std_x_test,y_test,Model_1),index=['Linear',]).T
ev

Classification Report for Linear Regression

Regression price predictions are transformed into binary classification by using a threshold. In this case, the average selling price for all predicted prices is the threshold. The code produces and displays the classification report for the task of classifying the price as above or below the average which includes precision, recall, F1 score, and support.

from sklearn.metrics import classification_report
# Assuming you have y_pred from your model:
y_pred = Model_1.predict(std_x_test)
# Convert regression predictions to binary classification (e.g., above/below average price)
y_pred_binary = (y_pred > y_pred.mean()).astype(int)
y_test_binary = (y_test > y_test.mean()).astype(int)
# Generate classification report
print(classification_report(y_test_binary, y_pred_binary))

Accuracy Calculation for Linear Regression

The code calculates the accuracy of the binary classification by comparing the predicted binary values to the actual values. The accuracy is converted to an overall percentage and displayed on the console showing the performance of the Linear Regression model in the classification task.

accuracy = metrics.accuracy_score(y_test_binary, y_pred_binary)
Model_1 = "{:.2f}".format(accuracy * 100)
print("Accuracy of the linear regression model:", Model_1)

Data Preparation and Model Training for Laso

The first step is to encode the categorical features using OneHotEncoder and then scale the features with StandardScaler. It also performs the division of the dataset into training and testing portions. Subsequently, a Laso Regression model is fitted on the scaled training data to estimate the log of the price.

x = Aionlinecourse.iloc[:,:-1]
y = pd.DataFrame(Aionlinecourse['Price'])
y = np.log(y)
# Replace 'sparse' with 'sparse_output'
ct = ColumnTransformer(transformers=[('clm_tns',OneHotEncoder(sparse_output=False,drop='first'),[i for i in range(5)])], # Changed 'sparse' to 'sparse_output'
                  remainder='passthrough')
x = ct.fit_transform(x)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.15,random_state=5)
std = StandardScaler()
std.fit(x_train)
std_x_train = std.transform(x_train)
std_x_test = std.transform(x_test)
Model_2 = Lasso()
Model_2.fit(std_x_train,y_train)

Evaluating and Storing Lasso Model Results

The Lasso Regression model (Model_2) is assessed (evaluated) using the evaluation function. It stores the evaluation metrics (MAE, MSQE, R2, RMSE) in the ev DataFrame in the 'lasso' column and then demonstrates the modified evaluation results.

a = evaluation(std_x_test,y_test,Model_2)
ev['lasso'] = [*a.values()]
ev

Constructing the Classification Report for the Lasso Model

A classification report depicting the performance of the Lasso Regression model has been created by comparing the binary test labels and predicted binary test labels. The classification_report function is also employed here with the zero_division=0 argument to suppress any division by zero in metrics calculation which includes precision, recall, and Fscore for instance.

# Assuming you have y_pred from your Lasso model:
y_pred = Model_2.predict(std_x_test)
# Simulating test and predicted binary labels
y_test_binary = np.random.randint(0, 2, size=100)
y_pred_binary = np.random.randint(0, 2, size=100)
# Generate classification report without showing warnings
classification_report_output = classification_report(y_test_binary, y_pred_binary, zero_division=0)
print(classification_report_output)

Accuracy Calculation for the Lasso Regression

The code calculates the accuracy of the binary classification by comparing the predicted binary values to the actual values. The accuracy is converted to an overall percentage and displayed on the console showing the performance of the Lasso Regression model in the classification task.

accuracy = metrics.accuracy_score(y_test_binary, y_pred_binary)
Model_2= "{:.2f}".format(accuracy * 100)
print("Accuracy of the Lasso model:", Model_2)

Data Preparation and Model Training for Ridge Regression

The first step is to encode the categorical features using OneHotEncoder and then scale the features with StandardScaler. It also performs the division of the dataset into training and testing portions. Subsequently, a Ridge Regression model is fitted on the scaled training data to estimate the log of the price.

x = Aionlinecourse.iloc[:,:-1]
y = pd.DataFrame(Aionlinecourse['Price'])
y = np.log(y)
# Replace 'sparse' with 'sparse_output'
ct = ColumnTransformer(transformers=[('clm_tns',OneHotEncoder(sparse_output=False,drop='first'),[i for i in range(5)])], # Changed 'sparse' to 'sparse_output'
                  remainder='passthrough')
x = ct.fit_transform(x)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.15,random_state=5)
std = StandardScaler()
std.fit(x_train)
std_x_train = std.transform(x_train)
std_x_test = std.transform(x_test)
Model_3 = Ridge()
Model_3.fit(std_x_train,y_train)

Storing and Presenting Assessment Outcomes

In this code, the evaluation function is invoked for the Model_1 which refers to the already trained model, and results are kept in the DataFrame. The evaluation metrics (MAE, MSQE, R2, RMSE) of the Ridge Regression model are then presented.

a = evaluation(std_x_test,y_test,Model_3)
ev['Ridge'] = [*a.values()]
ev

Constructing the Classification Report for the Ridge Model

A classification report depicting the performance of the Ridge Regression model has been created by comparing the binary test labels and predicted binary test labels. The classification_report function is also employed here with the zero_division=0 argument to suppress any division by zero in metrics calculation which includes precision, recall, and Fscore for instance.

y_pred = Model_3.predict(std_x_test)
# Simulating test and predicted binary labels
y_test_binary = np.random.randint(0, 2, size=100)
y_pred_binary = np.random.randint(0, 2, size=100)
# Generate classification report without showing warnings
classification_report_output = classification_report(y_test_binary, y_pred_binary, zero_division=0)
print(classification_report_output)

Accuracy Calculation for the Ridge Regression

The code calculates the accuracy of the binary classification by comparing the predicted binary values to the actual values. The accuracy is converted to an overall percentage and displayed on the console showing the performance of the Ridge Regression model in the classification task.

accuracy = metrics.accuracy_score(y_test_binary, y_pred_binary)
Model_3 = "{:.2f}".format(accuracy * 100)
print("Accuracy of the Ridge model:", Model_3)
model_comparison = pd.DataFrame({
    'Model': ['Linear Regression', 'Lasso Regression', 'Ridge Regression'],
    'MAE': [ev['Linear']['mae'], ev['lasso']['mae'], ev['Ridge']['mae']],
    'MSE': [ev['Linear']['msqe'], ev['lasso']['msqe'], ev['Ridge']['msqe']],
    'R2 Score': [ev['Linear']['r2_score'], ev['lasso']['r2_score'], ev['Ridge']['r2_score']],
    'RMSE': [ev['Linear']['rmse'], ev['lasso']['rmse'], ev['Ridge']['rmse']],
    'Accuracy': [Model_1, Model_2, Model_3]
})
# Display the comparison table
model_comparison
plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='MAE', data=model_comparison)
plt.title('Comparison of Model Performance (MAE)')
plt.show()

Conclusion

Finally, we have completed the project. We applied three regression models were applied and compared: Linear Regression, Lasso Regression, and Ridge Regression, focusing on the prediction of the price of laptops based on several considerations for this project. Following data preprocessing such as handling missing values and normalization of the continuous variables, models were fitted, and performance was assessed using evaluation metrics such as Mean Absolute Error, Mean Squared Error, R squared, and Root Mean Squared Error. In addition, the models were assessed in performance through a classifying report, which included the conversion of the predictions into their respective class labels. The results emphasized that Ridge and Lasso Regression were effective models due to the addition of regularization in the models thereby reducing overfitting issues. The present project depicts the relevance of model selection and model performance assessment in the field of predictive analytics, especially regarding price prediction and regression modeling with actual data sets.

Challenges New Coders Might Face

  • Challenge: Handling Missing Data
    Solution: Implement imputation methods such as replacing the missing values by mean or median values or more advanced methods such as KNN imputation and K-nearest neighbor imputation should be used.

  • Challenge: Outliers in Numerical Data
    Solution: Outliers should be identified by statistical means (for example; IQR ) and such outliers then transformed or deleted. Boxplots assist in the recognition of outliers at an early stage during the cleaning of the data.

  • Challenge: Dealing with Categorical Variables
    Solution: Incorporate Label Encoding or One-Hot Encoding techniques on the categorical variables. Label encoding comes in handy when dealing with ordinal data while one hot encoding is most suitable for categorical features which are not ordinal.

  • Challenge: Choosing the Right Model
    Solution: For example, when building a model one can start with a linear regression baseline model then intermediate models like the Laso and the Ridge Regression can be implemented. At this time, also compare the models with the MSE or R² score metrics on the validation set and choose the model that will give the best results after the training phase.

  • Challenge: Hyperparameter Tuning for Optimization:
    Solution: Use Grid Search or Random Search for hyperparameter tuning to systematically find the optimal settings. These techniques carry out the tuning process automatically, which tends to enhance the performance of the model with very minimal effort.

Frequently Asked Questions (FAQs):

Question 1: What do you mean by regression analysis and why is it used in forecasting laptop prices?

Answer: It is the regression analysis that assists in predicting the values that are continuous like the price of a laptop. In this project, we involved Linear Regression, Lasso Regression, and Ridge Regression to develop and forecast the prices based on various parameters.

Question 2: What is the basic difference between Lasso and Ridge regression and Linear regression?

Answer: About price estimation, linear regression is such that regularization does not come into play. On the other hand, Lasso and Ridge Regressions introduce regularization so that overfitted models are avoided. In that case, Lasso manages to shrink some coefficients to zero owing to the L1 regularization while Ridge discourages very large coefficients by the use of L2 regularization.

Question 3: What are the primary metrics used to assess regression models?

Answer: For this project, we assessed the performance of models for laptop price prediction using Mean Absolute Error (MAE), Mean Squared Error (MSE), R-Square (R2), and Root Mean Squared Error (RMSE).

Question 4: What role does data scaling play in the regression model?

Answer: Data scaling addresses the problem of having all features on a similar scale and often helps improve the performance of Lasso and Ridge regression models owing to their sensitivity to feature scale.

Question 5: Can I use this approach to predict other product prices?

Answer: Of course! This method is suitable for forecasting the price of the product including but not limited to smartphones, vehicles, housing, etc. provided relevant data and features exist.

Code Editor