Random forest Classification | Machine Learning


Random Forest is an ensemble learning technique. It builds a number of decision trees on the randomly selected data sample. Then it gets predictions from each tree and by means of majority voting, it selects the decision which gets the majority vote.


                                     19_1_Random_Forest

How does the Algorithm work?

As said earlier, this algorithm is based on decision trees. For a given dataset, it takes some random data points, builds decision trees one at a time around them. The number of trees is defined in the program. Then it comprises the predictions of all the decision trees. From the predictions, it chooses the best one which is predicted by the most number of trees. 


The steps for Random Forest Algorithms are as follows-


STEP 1: Pick at random K data points from the Training set. 

STEP 2: Build the Decision Tree associated with these K data points. 

STEP 3: Choose the number Ntree of trees you want to build and repeat STEPS 1 & 2 

STEP 4: For a new data point, make each one of your Ntree trees predict the category to which the data points belongs, and assign the new data point to the category that wins the majority vote.


This simple ensemble technique gives astonishingly accurate predictions. This is because one decision tree may provide wrong predictions, but aggregating the decisions of a large number of trees will reduce the wrong predictions leading to more accurate predictions.


Now, we will implement this algorithm in Python.


Random Forest Classification in Python:

For this task, we will use the dataset named Social_Network_Ads.csv. This dataset contains the age, salary, and buying choice for a specific product of a number of customers through social network ads. Our task is to classify what will be the buying choice of a future customer upon given these features. Let's have a glimpse of that dataset


                                                              19_2_Random_Forest.

You can download the whole dataset from here.


First of all, we will import the essential libraries to our program.

# Importing the libraries 
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd

Then we will import the dataset into our program and divide the attributes into Feature matrix and dependent variable vectors. Here the Age and EstimatedSalary are the independent attributes, so we will put them into the Feature matrix and the Purchased column into the dependent variable vector.


# Importing the dataset 
dataset = pd.read_csv('Social_Network_Ads.csv') 
X = dataset.iloc[:, [2, 3]].values 
y = dataset.iloc[:, 4].values
Now, we will split the dataset into training and test sets. # Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

We need to scale our dataset for a more accurate prediction.

# Feature Scaling 
from sklearn.preprocessing import StandardScaler 
sc = StandardScaler() 
X_train = sc.fit_transform(X_train) 
X_test = sc.transform(X_test)

It's time to fit the Random Forest algorithm to our dataset.

# Fitting Random Forest Classification to the Training set 
from sklearn.ensemble import RandomForestClassifier 
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0) 
classifier.fit(X_train, y_train)

Note: Here n_estimators defines the number of decision trees we want in our Random Forest.

We have built our model. Now, Lets see how it predicts on the test set.

# Predicting the Test set results 
y_pred = classifier.predict(X_test)

We could build the confusion matrix to see the accuracy of our model.

# Making the Confusion Matrix 
from sklearn.metrics import confusion_matrix 
cm = confusion_matrix(y_test, y_pred)

                                                      


We have come to the most exciting and fun part. Lets visualize the predictions of our model for training and test sets.

# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Random Forest Classification (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()


                                                                        

# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Random Forest Classification (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()


                                                                    

If you compare the results with a single Decision Tree classifier, you must find that Random Forest tends to provide more accurate predictions.