K-Nearest Neighbor | Machine Learning


In this tutorial, I am going to explain to you the K-Nearest Neighbor(KNN) algorithm and how to implement this algorithm in Python.

K-Nearest Neighbor Intuition: K-nearest neighbor is a non-parametric lazy learning algorithm, used for both classification and regression. KNN stores all available cases and classifies new cases based on a similarity measure. The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. When a new situation occurs, it scans through all past experiences and looks up the k closest experiences. Those experiences (or: data points) are what we call the k nearest neighbors.

How This Algorithm Works?
Let's take an example where we have some data points of two different classes. Now our task is to determine the position of a new data point whether it falls in the red category or the green category.

                                                 25_1_KNN



Here where KNN algorithm comes into action. Now let's understand the whole algorithm step by step.

                                  25_2_KNN_Step_1

                                                  25_2_KNN


Euclidean distance: Euclidean distance is a basic type of distance that we define in geometry. The distance between two points is measured according to this formula.

                           25_4_KNN_Euclidian_Distance          

                                                25_3_KNN


                                                    25_4_KNN





K-NN in python: 

Now we will implement the KNN algorithm in Python. We will use the dataset Social_Network_Ads.csv

You can download the dataset from here.

First of all, we will import all the essential libraries.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Now, let's import our dataset.

dataset = pd.read_csv('Social_Network_Ads.csv')

Let's have a look on our dataset.

                    21_knn_5

Here you can see that the Age and EstimatedSalary columns are independent variables and the Purchased column is the dependent variable. So we will take the Age and EstimatedSalary in the independent variable matrix and the Purchased column in the dependent variable vector.

Now, we will split our dataset into train and test sets.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

Note: Here random_state parameter is set to zero so that your result and our result remain the same.

We need to scale our training and test set to make sure that they have a standard difference in them.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

It's time to make a classifier of KNN algorithm. To do so we need the following code.

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

We have come to the final part of our program, lets take a variable to predict the outcome from our model.

y_pred = classifier.predict(X_test)

Now, we use the confusion metrics to find the correct and incorrect prediction. 

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

Let's execute the code.

                                                                

After executing the code, we get 64+29= 93 correct predictions and 3+4=7 incorrect prediction. The correctness of the model is 93% quite impressive!

Well, it's time to look at the result whether it is linear or non-linear.  If it is linear we get a straight line and if it is non-linear we get the curve shape. Let's visualize the outcome.

First, we will visualize our training set.

# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('K-NN (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

                                               21_knn_6

From the above graph, we can see that KNN is a nonlinear classifier.

Now, we visualize the result for the test set. 

# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('K-NN (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

                                             21_knn_7

After executing this code, we can see that these regions are perfectly fitted to the test observation. We can see that the real observation point is in the right region.