Support Vector Machines in Python

When it comes to the implementation of Machine Learning algorithms, the list starts from linear regression to decision trees. They are of various types. Mainly when it comes to separations of data points along a linear axis the recommendations from the experts are:

K-means clustering.
Support Vector Machines.

As we all know that an ML model is of two types:

Supervised Learning: Needs a guide to select the input data from the programmer.
Unsupervised Learning: Needs no guide to select the input data. It’s a learn itself model.

General Theory

The main aim of this article is to make the reader aware of how the technique of SVM works. On the internet, data is available in raw. So, when we structure the data and visualize it, the results are either a discrete or continuous distribution. According to this, SVMs are used for two purposes:

Classification: For discrete data parameters.
Regression: For Continuous data parameters.

This is one of the main reasons why Support Vector Machines are highly used for classification and regression purposes. The definition says that: Support Vector Machines are a set of learning algorithms that help us classify and analyze the nature of data.

Components of SVM

Support vectors: These are the main components. They are simple data points that lie on both sides of the Maximum margin.
Maximum margin: The maximum limit till the data classification takes place.
Maximum margin hyperplane: The maximum mid-limit margin that lies between the positive and negative hyperplanes.
Positive hyperplane: Right side of the margin plane.
Negative hyperplane: Left side of the margin plane.

Diagram

In this diagram, we can clearly see that the main margin is separating all the different data points according to the color. We have used the color for showing their nature of difference. The main aim of SVM is to show the distinction and classify each point with the best possible marginal line.

Example and applications

Suppose we have a class: Vehicle. Our task is to fetch the Sports Utility Vehicle (SUV) from that class. Now there are various other types. Now, when we try to arrange them manually in order then it may take a lot of time. This also creates some errors. So, to make the classification more stable we can create a Support Vector Machine that will classify all the models of cars from the parent vehicle class. It shall work on the following steps:

The model will take a sample image.
Then it compares it with the test data of vehicle types provided already.
After that, it tells us which type of model of the car is there in that input image.
No other algorithm can make things simpler than an SVM.

Implementing Support Vector Machines

In this section, we shall implement all the necessary implementation for the Support Vector Machine. So, let’s get started!

Environment details:

Python 3.9.7
IDE: Jupyter Notebooks
Environment: Anaconda 3
Dataset: Cancer dataset (cell_samples.csv)

Importing the necessary libraries for data reading and preprocessing

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from warnings import filterwarnings
filterwarnings("ignore")

Reading the dataset

cancer_data = pd.read_csv("Datasets/cell_samples.csv", sep = ",")
cancer_data.head()

Output:

Checking for null values

cancer_Data.isna().sum()

Getting the general info about the dataset

print("The shape of the dataset is: ", cancer_data.shape)
print("The size of the dataset is: ", cancer_data.size, " bytes\n")
print("The count of each attribute of the dataset is: \n")
print(cancer_data.count())
print("\nThe datatype of each attribute is: \n")
print(cancer_data.dtypes)

Output:

The shape of the dataset is:  (699, 11)
The size of the dataset is:  7689  bytes

The count of each attribute of the dataset is: 

ID             699
Clump          699
UnifSize       699
UnifShape      699
MargAdh        699
SingEpiSize    699
BareNuc        699
BlandChrom     699
NormNucl       699
Mit            699
Class          699
dtype: int64

The datatype of each attribute is: 

ID              int64
Clump           int64
UnifSize        int64
UnifShape       int64
MargAdh         int64
SingEpiSize     int64
BareNuc        object
BlandChrom      int64
NormNucl        int64
Mit             int64
Class           int64
dtype: object

Converting the BareNuc column into integer type

cancer_data = cancer_data[pd.to_numeric(cancer_data["BareNuc"], errors = "coerce").notnull()]
cancer_data["BareNuc"] = cancer_data["BareNuc"].astype("int")
cancer_data.dtypes

ID             int64
Clump          int64
UnifSize       int64
UnifShape      int64
MargAdh        int64
SingEpiSize    int64
BareNuc        int32
BlandChrom     int64
NormNucl       int64
Mit            int64
Class          int64
dtype: object

Separating the two classes from the data frame

For cancer cells type classification we have two types of cells for classification:

Malignant: value = 4 in our dataset
Benign: value = 2 in our dataset

We create two separate data frames of the same names. Then, try to classify them using data visualization techniques. Taking only the first fifty value from the core dataset. This makes plotting easier.

malignant = cancer_data[cancer_data["Class"] == 4][0:50]
benign = cancer_data[cancer_data["Class"] == 2][0:50]

plt.figure(figsize = (10, 5))
ax = plt.axes()
ax.set_facecolor("white")
plt.title("Separating the data points - Clump and UniformShape")
plt.scatter(malignant["Clump"], malignant["UnifShape"] , color = "red", marker = "*")
plt.scatter(benign["Clump"], benign["UnifShape"], color = "green", marker = "+")
plt.legend(["Malignant cell class", "Benign cell class"])
plt.show()

Creating independent and dependent data column lists with their numpy arrays:

dependent_data = cancer_data[["ID", "Class"]]
independent_data = cancer_data[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize',
       'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]

X_data = np.array(independent_data)
X_data[0:5]

Y_data = np.array(dependent_data["Class"])
Y_data[0:5]

Output:

array([[ 5,  1,  1,  1,  2,  1,  3,  1,  1],
       [ 5,  4,  4,  5,  7, 10,  3,  2,  1],
       [ 3,  1,  1,  1,  2,  2,  3,  1,  1],
       [ 6,  8,  8,  1,  3,  4,  3,  7,  1],
       [ 4,  1,  1,  3,  2,  1,  3,  1,  1]], dtype=int64)

array([2, 2, 2, 2, 2], dtype=int64)

Splitting the data into train and test variables

From the sklearn.model_selection import the train_test_split function. This splits the data into four arrays:

X_train
X_test
y_train
y_test

Out of these the training arrays are tow dimensional and the testing arrays are one dimensional. Just remember to take the test_size = 0.2 as we need only 20 percent of the total dataset to test our model accuracy.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_data, Y_data, test_size = 0.2, random_state = 4)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

Output:

(546, 9)
(546,)
(137, 9)
(137,)

Importing the SVM from sklearn and creating a classifier instance

First we import the model and then we import SVC. It is the classifier class for separating the support vectors. Create an instance “Classify”. Give the kernel value as “linear” it will linearly separate the support vectors. Then we fit the X_train data and Y_train data inside the model using the fit() function. After that create an instance “y_predict”, which holds all the predictions in a one-dimensional array.

from sklearn import svm
classify = svm.SVC(kernel = "linear")
Classify.fit(X_train, y_train)
y_predict = Classify.predict(X_test)
print(y_predict)

Output:

array([2, 4, 2, 4, 2, 2, 2, 2, 4, 2, 2, 4, 4, 4, 4, 2, 2, 2, 2, 2, 4, 2,
       4, 4, 4, 4, 2, 2, 4, 4, 4, 2, 4, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 4,
       4, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 4, 4, 2, 4, 4,
       4, 2, 2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 4, 4, 2, 2, 2, 2, 4, 4, 2, 4,
       2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 2, 4, 2, 4, 2, 2, 4, 2, 4, 2, 2, 4,
       2, 2, 4, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 2, 2, 4, 2, 2, 4, 2, 4, 2,
       2, 2, 2, 2, 4], dtype=int64)

So, we have successfully separated all the cancerous patients with the noncancerous ones. The cells having 4 as value are cancerous and with that 2 are noncancerous. Now, that we have got the predictions we can run them against our Y_test array to check how accurate the model is. For that we can prepare a classification report.

Preparing the classification report

For this, we need to import the classification_report function from the sklearn.metrics module. Then call it inside the print() function. we test it with our Y_test array and the results are as follows:

from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict))

Output:

                   precision    recall  f1-score   support

           2       1.00           0.94      0.97        90
           4       0.90           1.00      0.95        47

accuracy                           0.96       137
macro avg       0.95      0.97      0.96       137
weighted avg       0.97      0.96      0.96       137

As the result says the precision of the model is very good. For malignant class (value = 4): The precision score is: 100%. For the benign class (value = 2) the precision score is: 90%