Random Forest Regression | Machine Learning

Random Forest Regression: The basic idea behind Random Forest is that it combines multiple decision trees to determine the final output. That is it builds multiple decision trees and merge their predictions together to get a more accurate and stable prediction.

How Does It Work?

Random Forest is a supervised learning algorithm. It uses the ensemble learning technique(Ensemble learning is using multiple algorithms at a time or a single algorithm multiple times to make a model more powerful) to build several decision trees at random data points. Then their predictions are averaged. Taking the average value of predictions made by several decision trees is usually better than that of a single decision tree. Look at the following illustration of two trees(though the number of trees is much more!)


Difference between Decision Trees and Random Forests: Random Forest is a collection of Decision Trees, but there are some differences. If you input a training dataset with features and labels into a decision tree, it will formulate some set of rules, which will be used to make the predictions.

Another difference is that decision trees might suffer from Overfitting. Random Forest prevents overfitting most of the time, by creating random subsets of the features and building smaller trees using these subsets. Afterwards, it combines the subtrees. Note that this doesn't work every time and that it also makes the computation slower, depending on how many trees your random forest builds.

Random Forest is a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, a great result most of the time. It is also one of the most used algorithms because it's simplicity and the fact that it can be used for both classification and regression tasks. In this post, you are going to learn, how the random forest algorithm works and several other important things about it.

The Steps Required to Perform Random Forest Regression:

Step 1: Pick at random k data point from the training set.

Step 2: Build the decision Tree associated with this K data point.

Step 3: Choose the number Ntree of trees you want to build and repeat STEPS 1 & 2.

Step 4: For a new data point, make each one of our Ntree trees predict the value of Y to for the data point in question and assign the new data point the average across all of the predicted Y values.

Now lets do these steps in Python.

Random forest regression in python: 

In this tutorial, we will implement Random Forest Regression in Python. We will work on a dataset (Position_Salaries.csv) that contains the salaries of some employees according to their Position. Our task is to predict the salary of an employee at an unknown level. So we will make a Regression model using Random Forest technique for this task.

You can download the dataset from here.

First of all, we will import some essential libraries

# Importing the Essential Libraries 
import numpy as np 
import matplotlib.pyplot as plt import pandas as pd

Then we will import the dataset.

# Importing the Dataset 
dataset = pd.read_csv("Position_Salaries.csv")

Now our dataset has imported to our program. Lets check how it looks:


Now, we need to determine the dependent and independent variables. Here we can see the Level is an independent variable while Salary is dependent variable or target variable as we want to find out the salary of an employee according to his Level. So our feature matrix X will contain the Level column and the value of Salary is taken into the dependent variable vector, y.

# Creating Feature Matrix and Dependent Variable Vector
X = dataset.iloc[:, 1].values 
y = dataset.iloc[:, 2].values

Well, we have come to the main part of the Regression. To implement Random Forest Regression, we need RandomForestRegressor class from Scikit-Learn library. Then we fit an object of that class to our dataset.

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X, y)

Note: Here, n_estimators is a parameter that sets the number of decision trees created for a random data point(the default value is 10, you can use a more number of trees). random_state = 0 is used so that your code provides the same output as us.

Our model is ready! Now, we will test our model for a new value of y.

# Predicting a New Value 
y_pred = regressor.predict([[6.5]])

Our model predicted a value of $167k. Lets compare this value with the actual value. For this, we will visualize our training set result.

# Visualizing the Training Set X_grid = np.arange(min(X), max(X), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Truth or Bluff (Random Forest Regression)')
plt.xlabel('Position level')


After executing the code, we can see a graph, which is pretty similar to that of a Decision Tree Regression. The main difference is that the lines are more discontinuous from the Decision Tree regression. This is because Random Forest uses a number of decision trees to predict the value of a data point. 

From the graph, we see that the prediction for a 6.5 level is pretty close to the actual value(around $160k).