K-fold cross Validation | Machine Learning


K-fold cross Validation: Cross-validation, sometimes called rotation estimation, or out-of-sample testing is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set.

In this tutorial, We are going to talk about the fix variance problem and we fix the dataset. Now, we split dataset into training set and testing set. And, we train our model into nine fold and we test it on the last remaining foot. We can train the model and test them all on ten combinations of training and test sets. We will know in which of these four categories will be because if we get a good accuracy and a small varience will be on the lower left. If we get a large accuracy and high varience we will be on the lower right. If we get a small accuracy and low varience, we will be on the upper left. So this careful cross validation is very useful. Here, we are going to use the kernel V.M. model. We made in part three classification and that remember we use to predict if the customer is going to click on the ads. So, this model is already built and we are ready for everything. After using the confusion matrix, We are going to call it section applying for cross validation. Here we use the model selection library and import the cross_val_score. Now, we are going to call the vector accuracies. We also call the cross_val_score also add some parameter. as an estimator is equal to the classifier, x is equal to x-train and y is equal to t-train and we will add CV is equal to 10.Now, we execute this code. We also get the accuracy. This accuracy vector that will tell users if there is a high varience or low varience.Here, we use the SDD function that will give us the standard deviation of this accuracy's vector. We get a 6 percent standard deviation.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


# Importing the dataset

dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values


# Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)


# Feature Scaling

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


# Fitting Kernel SVM to the Training set

from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)


# Predicting the Test set results

y_pred = classifier.predict(X_test)


# Making the Confusion Matrix

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)


# Applying k-Fold Cross Validation

from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()


# Visualising the Training set results


# Visualising the Test set results

from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                    np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
            alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
   plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
               c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Kernel SVM (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()



Grid Search in Python-Step 1: We are going to learn a technique that is going to be about improving models performance.The first type of parameters that are learned through the machine learning algorithm. It is a categorical problem where the problem is a classification problem. Now, we are going to do the same code. We apply the grid search and to find the best model and the best parameter. It is implemented grid search. We will apply the model_selection method and import the GridSearchCV class. We will not try to optimize this degree parameter. We will improve it by finding the best penalty parameter to prevant overfitting. We will optimize this gamma value to find the optimal kernel. We will try different values of this penalty parameter. We specify different values of c, we specify these different values in square brackets. This is a model that does not fit anymore to dataset. We will try several values that the default values are 10 to 100. Grid search will investigate is a linear model that is a classic. We end with a linear kernel. Now, we add the second option. Now, we use the rbf kernel. We add the gamma. We should try gamma parameter. Now, we just putting this value for our dataset. In case we need an even smaller value for the gamma parameter. We should take an RBF kernel and therefore a nonlinear model. These parameter options are going to be the input of the grid search function.


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


# Importing the dataset

dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values


# Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)


# Feature Scaling

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


# Fitting Kernel SVM to the Training set

from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)


Grid Search in Python-Step 2: In this tutorial, we move on the grid search implementation. We are going to fit this object. Now, we call the GridSearchCV class. We also add some parameter. The first parameter is estimator. Here, the estimator parameter is equal to the classifier. Here, scoring is equal to the accuracy. Here, cv equals 10 and therefore tenfold cross validation will be applied through the grid search. We get n-hobs is equal to -1. Then we apply the fit method, as usual, to fit an object to a training set or testing set. Now, we execute this code one by one. To get more accuracy, we will define a new variable here that we call best accuracy which is call the grid search and call best score method. After executing, we get 90% best accuracy. Now, we execute this code. And we get the best parameter c is equal to 1, gamma is equal to the o.7  kernel is rbf.



# Making the Confusion Matrix

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)


# Applying k-Fold Cross Validation

from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()


# Applying Grid Search to find the best model and the best parameters

from sklearn.model_selection import GridSearchCV
parameters = [{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
             {'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}]
grid_search = GridSearchCV(estimator = classifier,
                          param_grid = parameters,
                          scoring = 'accuracy',
                          cv = 10,
                          n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_


# Visualising the Training set results

from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                    np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
            alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
   plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
               c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Kernel SVM (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()


# Visualising the Test set results

from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                    np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
            alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
   plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
               c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Kernel SVM (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()