CIFAR 10 Dataset: Everything You Need To Know

CIFAR 10

Imagine you want to conduct in-depth research about the state-of-the-art machine learning models, and their applications, and implement one of the prominent applications yourself from scratch. You know which machine learning model you would use, and what you want to do after successfully building and training the model. But what is the first and the most important step you want to complete successfully?

Gathering/curating datasets of course! While finding datasets might be difficult for some tasks, it is the most crucial because what do you train your model on? Datasets right?

Popular public use services like the UCI machine learning repository, Kaggle, and other government dataset repos are useful for collecting large amounts of data. These repositories make a large number of popularly used datasets available for everyone.

The CIFAR 10 dataset, a benchmark in image classification, features 60,000 small 32×32 color images across 10 classes. Used extensively in machine learning, especially for training and evaluating models, it’s a subset of the Tiny Images dataset and includes diverse categories like animals and vehicles. Accessible through platforms like TensorFlow and Keras, it’s essential for beginners and researchers in computer vision

In this post, we are going to talk about one such popular dataset- The CIFAR 10.

Also check out the most widely used libraries of Python

Origins of CIFAR 10 Dataset

CIFAR 10, an abbreviation for Canadian Institute For Advanced Research was first introduced by Alex Krizhevsky et al in their research titled Learning multiple layers of features from tiny images. This dataset contains 10 classes and is often described as a subset of the Tiny Images dataset.

We can say that this dataset is relatively small based on its size. There are around 60000 color images each of size 32×32. The images have a uniform shape of (32,32,3), where 3 represents the channels in the images- Red, Green, and Blue.

As its name suggests, the dataset has all the images labeled under 10 classes, where each image only belongs to a single class or label.

Classes Of CIFAR 10
Class Categories in CIFAR 10

Et al literally means ‘and others’ in Latin and is used in academic and research citations to address multiple authors.

Accessing the CIFAR 10 Dataset

This dataset is hosted on the website of the University of Toronto(Department of Computer Science) and we can download the Python and Matlab versions from this site. This site also has the code to unpickle the datasets in both Python and Matlab.

If you want to go for ready-to-use version of the original dataset that is split and pre-processed, there are many machine-learning frameworks like TensorFlow, Keras, HuggingFace, and PyTorch that not only provide the dataset but do provide additional support related to the usage of the dataset.

Additionally, Kaggle, the machine learning and data science community also has a downloadable copy of this benchmark dataset.

Learn how to curate your own dataset!

Significance of CIFAR 10 in Machine Learning

  • This dataset is so popular among beginners that it is called the Benchmark for Image classification
  • Many papers and research articles have used this dataset as the primary data source for their image classification model
  • This dataset is also used as training data and evaluation data to enhance the model’s performance
  • It is often used as a teaching material to understand the labeling of images
  • Many researchers have used this dataset to build complex computer vision models like the Convolutional Neural Networks(CNNs)

Visualizing the CIFAR 10 Dataset

In this section, we are going to see how to display the first 25 images from the dataset along with their class labels. We are going to use the Keras API to load the data.

import tensorflow as tf
from tensorflow.keras.datasets import cifar10
import matplotlib.pyplot as plt

We have imported the tensorflow library and the dataset from the datasets module of Keras. The matplotlib library is used for visualization of the images.

(x_train, y_train), (_, _) = cifar10.load_data()

We are unpacking the images(x_train) and the labels(y_train) from the dataset.

# Display images with class labels(as numbers)
plt.figure(figsize=(10, 10))
for i in range(25): 
    plt.subplot(5, 5, i + 1)
    plt.imshow(x_train[i])
    plt.title(f"Class {y_train[i][0]}")
    plt.axis('off')
plt.show()

In the above code snippet, we have decided on the figure grid’s size(10×10) in which 25 images are going to be displayed. The images are displayed with the help of imshow and the class numbers are shown with the help of the title method.

First 25 images of the dataset
First 25 images of the dataset

Notice how the class labels are numbers. If you want to display the class names(eg: airplane instead of class 0), you can add a list of all the class names as specified in the hosted site in the correct order.

(x_train, y_train), (_, _) = cifar10.load_data()
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

# Display images with class names
plt.figure(figsize=(10, 10))
for i in range(25):  
    plt.subplot(5, 5, i + 1)
    plt.imshow(x_train[i])
    plt.title(class_names[y_train[i][0]])
    plt.axis('off')
plt.show()
First 25 images with class names
First 25 images with class names

The images of the dataset are not recognizable right now, but we can enhance the clarity of the images using some processing techniques.

Exploring CIFAR 100, An Extended Version of CIFAR 10

The CIFAR 100 is an extension of the small CIFAR 10 dataset, which has over 100 classes in which the images are classified. This dataset is more detailed than the 10 dataset and has 600 images under each label.

Now that you have understood the basics of the CIFAR 10 dataset, please refer to this article on building CNN using the dataset.

Summary

The CIFAR 10 dataset is often referred to as the benchmark dataset for image classification and is widely used for image classification tasks and building machine learning and computer vision models.

I hope this post helped you understand the basics of this dataset so that you can proceed with using it to build your first-ever image classification model or use it to create a fun project. Happy coding!

Useful Sites

CIFAR 10 – University of Toronto

TensorFlow

Hugging Face

Keras