Iris Dataset Classification with Multiple ML Algorithms

Featured Img Iris Data Classification

Hello there! Today we are going to learn about a new dataset – the iris dataset. The dataset is very interesting and fun as it deals with the various properties of the flowers and then classifies them according to their properties.

1. Importing Modules

The first step in any project is to import the basic modules which include numpy, pandas and matplotlib.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

2. Loading and Preparing the Iris Dataset

To load the data we will download the dataset from Kaggle. You can download the dataset here but make sure that the file is in the same directory as the code file.

We will also be separating the data and labels from each other by using the slicing operation on the data.

data = pd.read_csv('Iris.csv')
data_points = data.iloc[:, 1:5]
labels = data.iloc[:, 5]

3. Split Data Into Testing and Training Data

Before training any kind of ML model, we first need to split data into testing and training data using the train_test_split function from sklearn.

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(data_points,labels,test_size=0.2)

4. Normalization/Standardization of Data

Before we work on the ML modeling and the data processing, we need to normalize the data for which the code is mentioned below.

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
Standard_obj = StandardScaler()
Standard_obj.fit(x_train)
x_train_std = Standard_obj.transform(x_train)
x_test_std = Standard_obj.transform(x_test)

5. Applying Classification ML model

Now that our data is prepared and is ready to go into the various ML models we will be testing and comparing the efficiency of various classification models

5.1 SVM (Support Vector Machine)

The first model we are going to test the SVM Classifier. The code for the same is mentioned below.

from sklearn.svm import SVC
svm = SVC(kernel='rbf', random_state=0, gamma=.10, C=1.0)
svm.fit(x_train_std, y_train)
print('Training data accuracy {:.2f}'.format(svm.score(x_train_std, y_train)*100))
print('Testing data accuracy {:.2f}'.format(svm.score(x_test_std, y_test)*100))

On successful execution, the classifier gave a training and testing accuracy of about 97% and 93% respectively which is pretty decent.

5.2 KNN (K-Nearest Neighbors)

KNN algorithm is one of the most basic, simple, and beginner-level classifying models in the world of ML. The code to directly execute the same is shown below.

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 7, p = 2, metric='minkowski')
knn.fit(x_train_std,y_train)
print('Training data accuracy {:.2f}'.format(knn.score(x_train_std, y_train)*100))
print('Testing data accuracy {:.2f}'.format(knn.score(x_test_std, y_test)*100))

The testing accuracy in this case is just about 80% which is less when compared to other models but its justified as the model is very basic and has several limitations.

5.3 Decision Tree

Next, we will be implementing the Decision Tree Model which is one of the simple yet complex ML model. The code for the same is shown below.

from sklearn import tree
decision_tree = tree.DecisionTreeClassifier(criterion='gini')
decision_tree.fit(x_train_std, y_train)
print('Training data accuracy {:.2f}'.format(decision_tree.score(x_train_std, y_train)*100))
print('Testing data accuracy {:.2f}'.format(decision_tree.score(x_test_std, y_test)*100))

The testing accuray in this model as well is still around 80%, hence so far SVM gives the best results.

5.4 Random Forest

Random Forest is a more complex and better decision tree in Machine Learning. The implementation of same is shown below.

from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier()
random_forest.fit(x_train_std, y_train)
print('Training data accuracy {:.2f}'.format(random_forest.score(x_train_std, y_train)*100))
print('Testing data accuracy {:.2f}'.format(random_forest.score(x_test_std, y_test)*100))

The accuracy levels are very good here where the training data is 100% which is awesome! while the testing data accuracy is 90% which is decent as well.

Conclusion

Congratulations! This tutorial mentioned a lot of different algorithms on the same dataset and we obtained different results for each model. Hope you liked it! Keep reading to learn more!

Thank you for reading!