Multiple Linear Regression | Machine Learning


In this tutorial, we are going to understand the Multiple Linear Regression Algorithm and implement the algorithm with Python.

Multiple Linear Regression: Multiple Linear Regression is closely related to a simple linear regression model with the difference in the number of the independent variables. Whereas the simple linear regression model predicts the value of a dependent variable based on the value of a single independent variable, in Multiple Linear Regression, the value of a dependent variable is predicted based on more than one independent variables. The concept of multiple linear regression can be understood by the following formula-
y = b0+b1*x1+b2*x2+..........+bn*xn

In the equation, y is the single dependent variable value of which depends on more than one independent variables(i.e. x1,x2,...,xn).

For example, you can predict the performance of students in an exam based on their revision time, class attendance, previous results, test anxiety, and gender. Here the dependent variable(Exam performance) can be calculated by using more than one independent variables. So, this the kind of task where you can use a Multiple Linear Regression model.

Now, let's do it together. We have a dataset(Startups.csv) that contains the Profits earned by 50 startups and their several expenditure values. Les have a glimpse of some of the values of that dataset-

                                                 


Note: this is not the whole dataset. You can download the dataset from here.

From this dataset, we are required to build a model that would predict the Profits earned by a startup and their various expenditures like R & D Spend, Administration Spend, and Marketing Spend. Clearly, we can understand that it is a multiple linear regression problem, as the independent variables are more than one.

Let's take Profit as a dependent variable and put it in the equation as y and put other attributes as the independent variables-

Profit = b0 + b1*(R & D Spend) + b2*(Administration) + b3*(Marketing Spend)

From this equation, hope you can understand the regression process a bit clearer.

Now, let's jump to build the model, first the data preprocessing step. Here we will take Profit as in the dependent variable vector y, and other independent variables in feature matrix X.


# Multiple Linear Regression
# Importing the essential librariesimport numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#Importing the dataset dataset = pd.read_csv('50_Startups.csv') X = dataset.iloc[:, [0,1,2,3]].values y = dataset.iloc[:, 4].values


The dataset contains one categorical variable. So we need to encode or make dummy variables for that.

#Encoding categorical data from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()


Dummy Variable Trap: Above code will make two dummy variables(as the categorical variable has two variations). And obviously, our linear equation will use both dummy variables. But this will make a problem. Here both dummy variables are correlated to some extent(that means ones value can be predicted by the other)  which causes multicollinearity, a phenomenon where an independent variable can be predicted from one or more than one independent variables. When multicollinearity exists, the model cannot distinguish the variables properly, therefore predicts improper outcomes. This problem is identified as the Dummy Variable Trap.

To solve this problem, you should always take all dummy variables except one form the dummy variable set.

#Avoiding the Dummy Variable Trap 
X = X[:, 1:]

Now split the dataset in training set and test set

#Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 0)

Its time to fit Multiple Linear Regression to the training set.

# Fitting Multiple Linear Regression to the Training set 
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

Let's evaluate our model how it predicts the outcome according to the test data. 

#Predicting the Test set result
y_pred = regressor.predict(X_test)

Let's compare the test set values with the prediction values

23_1_Multiple_Linear_Regression

Here you can see our model has made some close predictions and some bad predictions also. But you can improve the quality of the prediction by choosing other Multiple Linear Regression techniques such as Backward Elimination, Forward Selection etc. which we will discuss in other tutorials.