Cross Validation In Machine Learning

Cross validation is an important term in machine learning and deep learning models. Every machine learning and deep learning model predicts results, and we need to verify them. Cross validation is used for verification of the predictive ability of machine learning and deep learning models. In cross-validation, many subsets of data from training and testing samples are collected and used for prediction. Different datasets are processed by the model, which predicts results. Then we calculate an average of these results and evaluate the model’s performance.

Let’s see more details about cross validation and examples.

Workflow of Cross Validation

The workflow of cross validation is all about implementing three steps. The first step is to divide the whole dataset into small datasets/subgroups and use every dataset for prediction. Suppose we divided datasets into k subsets. Then we need to use k-1 subsets for training and the remaining for testing purposes. The second step is repetition. We need to repeat step 1 for k times. The third step is to use the average of the results of all subsets as a final result. Based on the final result, we can test the model’s capabilities.

Types of Cross Validation Techniques

There are various techniques in cross validation, i.e., used in machine learning. These techniques are K-fold cross validation, Leave-one-out cross validation, and Stratified cross validation. Let’s see all these techniques in detail.

K-fold Cross Validation

The k-fold cross validation technique is used to determine the performance of the model. The k-fold cross validation technique can be used to see how well the model can work on new datasets. In k-fold cross validation, the entire dataset is divided into k parts. These parts are also known as folds. The ideal size of this k ranges from 5 to 10. It may be different for large datasets. We need to follow some steps to implement the k-fold cross validation in machine learning.

The first step is to mix up all the data from the dataset so, that it will be evenly distributed. Then the model is implemented k number of times with different sub datasets. Each time, the results are recorded and analyzed. The last step is to calculate the average of all the results. In this way, we calculate the results of k-fold cross validation.

Let’s see one simple example to understand k-fold cross validation in Python. In this example, we are going to use the cross_val_score function from the sklearn library to evaluate the model.

Sklearn Library to Perform Cross Validation in Python

Cross validation techniques implemented using the renowned sklearn library in Python have made it one of the most popular libraries for various machine learning models. With its diverse range of functions, it not only aids in model training and evaluation but also feature selection and extraction, along with model preprocessing. For cross validation techniques in Python, the import of functions like logistic regression, KFold, and cross_val_score is essential. The advanced functions of this library surpass those of other Python libraries. The implementation of certain functions from the sklearn library in cross validation is worth exploring further.

cross_val_score Function in Cross Validation

In Python, one can implement cross validation using the cross_val_score function found in the sklearn library. This function serves to evaluate a model’s performance, and is utilized in the K-fold cross validation technique. Characterized by various attributes, X and Y denote input data and output, respectively. Meanwhile, CV encompasses an assortment of folds within its cross validation strategy.

Let’s try to implement K-fold cross validation using the sklearn library and cross_val_score function in Python.

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
import warnings
warnings.filterwarnings("ignore")
iris = load_iris()
X = iris.data
y = iris.target

model = LogisticRegression()
k = 7

kf = KFold(n_splits=k, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf)

for fold, score in enumerate(scores):
    print(f"Fold {fold+1}: {score}")
    
print(f"Mean accuracy: {scores.mean()}")
print(f"Standard deviation: {scores.std()}")

In this example, all the required models from the sklearn library are imported. To avoid the warnings from the output warning model is also imported. Then the iris dataset is used which has two different attributed data and targets. Then logistic regression is applied to the dataset and the value of k i.e. fold is 7. Then, the scores of all the subsets are stored in the score variable(refer to Line No. 15). Then, the average of all the results is calculated and stored in the scores variable. In the end, we are printing the mean and standard deviation of the final result.

In the results, we can see the overall accuracy of this model is 96%. In this way, we can calculate the accuracy of any model using the k-fold cross validation technique.

Leave-one-out Cross Validation Technique

The leave-one-out is a cross validation technique in Python that helps to evaluate the machine learning models. This technique is very simple, first, remove one sample from the dataset and considered it as a test case. Another set of samples i.e. remaining dataset is considered a training sample set. In this way, all samples from the dataset are used as a test sample for one time. This process is repeated for all the samples present in the dataset. That is why this technique is called a leave-one-out cross validation technique.

Let’s try to implement one example to understand the working of this technique.

from sklearn.model_selection import LeaveOneOut
from sklearn import datasets
from sklearn import svm
import warnings
warnings.filterwarnings("ignore")
iris = datasets.load_iris()
X = iris.data
y = iris.target

clf = svm.SVC(kernel='linear', C=1)
LOO = LeaveOneOut()

scores = []
for train_index, test_index in LOO.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    scores.append(score)
Result = sum(scores) / len(scores)

print("Mean Accuracy:", Result)

The leave-one-out cross validation is very simple to implement. First, we need to import some modules from the library like leave-one-out, datasets, and SVM for cross validation. Then we are denoting X and Y as input and output data. The SVM is used to analyze the data from the model. Then, the data is divided into training and testing parts. Then the process is repeated several times so, the mean of the all results is calculated. Now, let’s see the results.

Here, we can see the result is printed successfully. The mean average of this model is 98%. In this way, we can test the model efficiency through this leave-one-out algorithm.

Stratified Cross Validation Technique

The stratified cross validation technique was used to evaluate the efficiency of the model. This technique distributes an equal amount of samples from different classes in every small dataset. This technique is very useful when we use datasets where the samples are unevenly distributed among the groups. In this stratified cross validation technique we can make several folds using this method. The sklearn library from Python is used to implement the stratified cross validation technique. Let’s see an example to understand this technique.

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
import warnings
warnings.filterwarnings("ignore")
for train_index, test_index in stratified_kfold.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model = LogisticRegression()
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)
    print("Accuracy:", accuracy)

In this example of stratified cross validation technique, we are using the sklearn library from Python. The iris dataset from the sklearn library is used for testing and training purposes. The X and Y variables are used as input and output variables. Then, the StratifiedKFold function is used to implement the stratified cross validation in Python. Here, we have used 5 folds to implement the technique. In the end, the samples are tested and trained on equally distributed samples. The accuracy is printed as a result. Let’s see the result.

In this example, we can see the accuracy ranges from 96% to 100%. In this way, we can normally take a mean and calculate the average accuracy of the model. This helps to find out the overall efficiency of the machine learning model for uneven datasets.

Comparison Between K-fold, Leave-one-out, and Stratified Cross Validation

All three techniques of cross validation, i.e., K-fold, Leave-one-out, and Stratified cross validation, are used to get the model efficiency. All three techniques work differently, but, the purpose is the same. We need to understand the difference between these techniques to decide which technique is best at what time.

So, the K-fold cross validation means the dataset will be divided into k folds/subsets. In the Leave-one-out technique, only 1 sample from the dataset is used as a testing sample, and all other samples are used for training. In the case of stratified cross validation, every sample dataset contains all types of samples from all the classes present in the dataset. After implementing all three techniques, we can see the results. But, these techniques are implemented differently using different sizes of training and testing datasets.

The k-fold dataset is flexible and easy to use. We can easily implement this technique on normal datasets to improve the accuracy and efficiency of the models. On the other hand, the leave-one-out cross validation technique is quite complex in terms of computation. It is more suitable to implement on small datasets where we need unbiased results. The last technique, i.e., stratified cross-validation, is normally used for datasets where the samples are not distributed evenly.

How to Use Train-test-split Function?

There is one more way to evaluate the model in Python. This method is the Train-test-split method. This is a type of cross validation where we split the whole dataset into two parts, i.e., training and testing. Based on this, the module is evaluated. The standard split is 20% for testing and 80% for training. Let’s evaluate the model based on 30%-70%.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
final_accuracy = model.score(X_test, y_test)
print("Accuracy:", final_accuracy)

In this technique, we have used the train_test_split function from the sklearn library of Python. The dataset is divided into two parts i.e. training and testing. In this example, we are using logistic regression for model evaluation. Let’s see the result.

Here, the result is 100%. In this way, we can also find out its accuracy using an easy method.

Application of Cross Validation

The cross validation is used to solve different problems in machine learning models. Let’s see these applications one by one.

Evaluation of Machine Learning Models

By testing the model on numerous data subsets, cross-validation offers a more accurate assessment of how well it performs compared to using a single train-test split. This approach grants a more comprehensive insight into the model’s capacity to generalize. Different techniques of cross validation are used in this, like k-fold, stratified, and leave-one-out cross validation.

Tuning the Hyperparameters

The learning process of a model is heavily influenced by hyperparameters, which are adjustable settings. It’s crucial to discover the ideal combination of hyperparameter values to optimize model performance. Cross-validation can assist with this task by evaluating the model using varying hyperparameter values.

Feature Selection Using Cross Validation

Utilizing cross-validation is a way to enhance the evaluation of feature selection techniques. It is essential to measure the effectiveness of various cross-validation folds in order to determine the most suitable approach for the dataset.

Comparison of Models on the Basis of Accuracy

Different models can be evaluated through cross-validation in order to determine which one has the best average performance over varied folds. This method ensures that the selected model won’t overfit and can accurately predict data outside of what it has already seen. By testing multiple models through cross-validation, the model with the best performance can be identified and chosen.

Summary

In this article, cross validation in machine learning is explained in detail. The basics of cross validation, different techniques, functions, and models used for cross validation are also explained with the help of examples. The comparison and application of different cross validation techniques are also covered in brief. Hope you will enjoy this article.

References

Do read the official documentation for the sklearn library.