K-fold Cross Validation | Machine Learning


In this tutorial, we are going to learn the K-fold cross-validation technique and implement it in Python.

While building machine learning models, we randomly split the dataset into training and test sets where a maximum percentage of the data is taken into the training set. Though the test dataset is small, there is still some chance that we left some important data in there that might have improved the model. And there is a problem of high variance in the training set. To solve this, problems we use the idea of K-fold cross-validation.

Cross-validation is a technique that is used to evaluate machine learning models by resampling the training data for improving performance.

In K-fold Cross-Validation, the training set is randomly split into K(usually between 5 to 10) subsets known as folds. Where K-1 folds are used to train the model and the other fold is used to test the model. This technique improves the high variance problem in a dataset as we are randomly selecting the training and test folds.

24_1_k-fold_cross_validation


The steps required to perform K-fold cross-validation are given below-
Step 1: Split the entire data randomly in k folds(usually between 5 to 10). The higher number of splits leads to less biased model.
Step 2: Then fit the model with k-1 folds and test it with the remaining Kth fold. Record the performance metric.
Step 3: Repeat step 2 until every k-fold serves as the test set.
Step 4: Take the average of all the recorded scores. This will serve as the final performance metric of your model.

K-fold cross-validation in Python: Now, we will implement this technique to validate our machine learning model. For this task, we will use "Social_Network_Ads.csv" dataset. We will implement the K-fold cross-validation technique to improve our Kernel SVM classification model.

 You can download the dataset from here.

First of all, we need to import some essential libraries.

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Now, we will import the dataset and make the feature matrix X and the dependent variable vector y.

# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

Now, we will split the dataset into training and test sets.

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)


We, need to feature scale our training and test sets for an improved result.

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Now, we will fit Kernel SVM to our training set and predict how it performs on the test set. 

# Fitting Kernel SVM to the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)


To calculate the accuracy of our Kernel SVM model we will build the confusion matrix.

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

 Let's see how accurate our model is      

                                              24_2_K_fold_Cross_Validation_(2)


From the above matrix, we can see that the accuracy of our Kernel SVM model is 93%

Now, let's see how we can improve the performance metric of our model using K-fold cross-validation with k = 10 folds. 

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()


Let's see the accuracies for all the folds.


                                                                                                           24_5_K_fold_Cross_Validation

The mean value for the accuracies is 90% with a mean deviation of 6%. That means our model is accurate for 96% or 84% time.