K-means Clustering | Machine Learning


Clustering: Clustering is an unsupervised learning algorithm. A cluster refers to groups of aggregated data points because of certain similarities among them. Clustering algorithms group the data points without referring to known or labelled outcomes.

There are commonly two types of clustering algorithms, namely K-means Clustering and Hierarchical Clustering. In this tutorial, we are going to understand and implement the simplest one of them- the K-means clustering.

K-means Clustering: This algorithm categorizes data points into a predefined number of groups K, where each data point belongs to the group or cluster with the nearest mean. Data points are clustered based on similarities in their features. The algorithm works iteratively to assign each data point to one of the K groups in such a way that the distance(i.e. Euclidian or Manhattan) between the centroid(i.e. the centre of the cluster) of that group and the data point be small. The algorithm produces exactly K different clusters of the greatest possible distinction.

                                             23_1_K_means_clustering

How The Algorithm Works:

K-means Clustering takes an iterative approach to perform the clustering task. The working steps of this algorithm are as follows-

Step 1: Choose the number K of clusters.

Step 2: Select at random K points, the centroids(not necessarily from our dataset).

Step 3: Assign each data point to the closest centroid based on euclidian or manhattan distance. That forms K clusters.

Step 4: Compute and place the new centroid of each cluster.

Step 5. Reassign each data point to the new closest centroid. If any reassignment took place, go to step 4.

It stops creating or optimizing clusters if either the centroids have stabilized meaning no new reassignment of data points takes place or the algorithm reaches the defined number of iterations.

Choosing The Optimal Number of Clusters:

The value of k is very crucial for optimal outcomes from the algorithm. There are several techniques to choose the optimal value for k including cross-validation, the silhouette method, G-means algorithm and the elbow method. Here we will implement the elbow method to find the optimal value for k.

As the K-means algorithm works by taking the distance between the centroid and data points, we can intuitively understand that the higher number of clusters will reduce the distances among the points. For that, we plot the number of clusters k and the Within Cluster Sum of Squares(WCSS). The plot would look like an elbow, as with an increasing number of clusters after a certain point, the WCSS starts to stabilize and tends to go parallel with the horizontal axis. So we will take that point after which the plot tends to become similar.

                 23_3_K_means_clustering

The Within Cluster Sum of Squares(WCSS) is used to calculate the variance in the data points. The goal of the algorithm is to minimize this value.

23_2_k_means_clustering

K-means clustering in python:

First of all, we set up the working directory. For this task, e will use the Mall_Customers.csv dataset. Lets have a glimpse of that dataset

In the dataset, we can see that it contains information about some customers in a supermall categorized by various features. We don't know about any certain outcome or labelled output. So, we will implement K-means clustering to find out some important class about the data points.

You can download the whole dataset form here.

First of all, we will import essential libraries.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Now, we will import the dataset to our program. Then we will take those features that will effective for our clustering. Here we are intereted in Annual Income and Spending Score. So we will take them to our program.

dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [3, 4]].values


We need to find the number of cluseter or k. For this we will plot a elbow method graph to find the optimal value for k.

# Using the elbow method to find the optimal number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Note: Here, the parameter init is set to k-means++ to avoid the random initialization trap.

Let's see the graph

                                               

From the above graph, we can see that the optimal value for k is 5(the point where the clustering going parallel with the increasing number of clusters). 

# Fitting K-Means to the dataset
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)


Well, we will come to the coolest part of our algorithm. Now we will visualize the outcomes to see what it has done.

                                                                 

We can see that the data points are grouped in 5 clusters as we defined in the algorithm. This algorithm is one of the simplest clustering algorithms and has various business uses.