Multiclass Classification - An Ultimate Guide for Beginners

Machine Learning is one of the most sought-after technologies among students and many million-dollar organizations due to its widespread applications and flexibility. Consider any field of work, and you might always find a way to make things easier in that field using machine learning and the other sub-branches of AI.

Machine Learning tasks are mainly categorized as Regression and Classification. While regression tasks deal with determining the relationship between an independent feature and a dependent one, classification is used to group or classify objects into similar groups, like how life on Earth is classified into two primal categories: vertebrates and invertebrates.

The example above can be treated as a Binary classification problem because there are just two classes or groups to categorize living organisms into, and all the organisms belong to either one of the classes: vertebrates or invertebrates.

There are other classification problems where there can be two or more classes in which we can group the objects. Such problems are called multiclass classification problems.

Multiclass classification simply means to categorize objects into two or more classes. Let us consider the classification of life again. In the case of vertebrates, these organisms can be further classified as Fishes, Amphibians, Reptiles, Mammals, and Birds. This can be taken as an example of multiclass classification. We are going to get to know more about the multiclass classification problems here.

Before that, make sure to go through this article on Regression vs. Classification

Binary Classification vs Multiclass Classification

You might have gotten the general idea behind binary and multiclass classification from the examples discussed above. To distinguish between them, let us see a proper definition.

While you are at it, also refer to this article on the different classification algorithms

A binary classification is said to classify or categorize sample records into either of the two classes.

Multiclass classification also goes along the same lines as the name suggests; there can be multiple classes into which the objects are classified or grouped.

Binary Classification vs Multiclass Classification

Based on the price of the apartment, the number of rooms present in it, and how far it is from the shopping center, we can determine if the customer is interested in buying the apartment or not. The classes in this problem are Yes and No. This is another example of binary classification.

The buyer is unsure and may be looking for equivalent properties at low prices. Then we can add another class called Neutral. So the buyer’s answer can be between Yes, No and Neutral. This is an example of multiclass classification.

Multiclass Classification Example

Let us take the Iris dataset for this example. If you don’t already know, the Iris dataset is a collection of physical features of various flowers, such as the petal width and length, and the sepal width and length. Based on these characteristics, almost 3000 flowers are classified into any one of the three classes – Setosa, Virginica and Versicolor.

We can use any classification model for a multiclass classification algorithm. In this case, we are using the Support Vector Classifier(SVC).

Importing the Libraries

import matplotlib.pyplot as plt
import sklearn
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix

We are importing all the necessary libraries, like matplotlib for visualization, sklearn for the data and model building, pandas, and numpy for displaying the data and manipulation.

From the sklearn library, we are also importing the Iris dataset, the Support Vector Classifier, the train, test, split method, and performance metrics such as the accuracy, and confusion matrix.

Loading the Data Set

Since the data set is already available in the sklearn library, we just need to import it and load it in our environment.

iris = load_iris()
iris.keys()

The data set is loaded with load_iris and is stored in an instance called iris. The iris.keys is used to print the basic information about the dataset.

Creating the Data Frame

We are going to create a data frame to display the data.

df = pd.concat([
    (pd.DataFrame(data=iris['data'], columns=iris['feature_names'])),
     (pd.DataFrame(data=iris['target'], columns=['target']))],
               axis=1)
df.replace({'target':{0:'setosa', 1:'versicolor', 2:'virginica'}}, inplace = True)
df

To avoid any complexities, we are creating two data frames consisting of the feature names of the flowers and the other consisting of the target values, then merging them horizontally. The target values are initially numerical(0,1,2). In the fifth line, we are giving categorical names to the numerical values.

The resulting data frame looks something like this:

Defining the Dependent and Independent Variables

We are going to use two variables, X and y, to define the dependent and independent variables, respectively,

X,y = df.drop('target', axis=1), df.target.replace({'setosa':0,'versicolor':1, 'virginica':2})
print(X.shape)
print(y.shape)

Since the independent variables are everything other than the target column, we just need to drop the target column for the X variable (which stores all the independent labels), and y is the target variable.

The shapes of the independent and dependent variables are printed in the next two lines.

Dependent and Independent variable sizes

Splitting the Dataset and Model Building

Now, we need to split the dataset into training and testing records for the model to evaluate.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=8021)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

Each of the dependent and independent records is split into training and testing records in the ratio 70:30. Then we print the sizes of the parts.

Now we need to build the model.

model = SVC(kernel='linear', random_state=0)
model.fit(X_train, y_train)

We have initialized the SVC model and then fit the model according to the training set.

Accuracy

After the model is fitted, we have to check how well the model is performing. We are going to check how accurate the model is.

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

The predict method is used to know how well the model is performing on the test records. The prediction is stored in a variable called y_pred.

The accuracy of the model prediction is stored in a variable called accuracy which is also printed in the next line.

Our model is 93% accurate in predicting the records. It is now time to test the model on new data.

Testing the Model

We are going to provide a new test input with four parameters: petal width and length, sepal width, and length. These parameters are then used to predict the class of the flower.

input_str = input()
test = [float(x) for x in input_str.split(',')]
pred=model.predict([test])
if pred==0:
  print("Setosa")
elif pred==1:
  print("Versicolor")
else:
  print("Virginica")

In the first line, we are taking input from the user, which is stored in a variable called input_str.The input is then converted into float values, separated by commas.

The model is then used to predict the user input.

If the prediction is zero, the flower belongs to the class Setosa. If the prediction is 1, the flower belongs to Versicolor; otherwise, it belongs to Virginica.

Summary

To summarize the whole tutorial, we started off with understanding the classification problem and proceeded to distinguish between a binary classification problem and a multiclass classification problem with the help of a few examples and illustrations.

Furthermore, we have seen an example of a multiclass classification problem using a Support Vector Classifier(SVC).

References

Find more about multiclass classification models here