When it comes to the implementation of Machine Learning algorithms, the list starts from linear regression to decision trees. They are of various types. Mainly when it comes to separations of data points along a linear axis the recommendations from the experts are:
- K-means clustering.
- Support Vector Machines.
As we all know that an ML model is of two types:
- Supervised Learning: Needs a guide to select the input data from the programmer.
- Unsupervised Learning: Needs no guide to select the input data. It’s a learn itself model.
General Theory
The main aim of this article is to make the reader aware of how the technique of SVM works. On the internet, data is available in raw. So, when we structure the data and visualize it, the results are either a discrete or continuous distribution. According to this, SVMs are used for two purposes:
- Classification: For discrete data parameters.
- Regression: For Continuous data parameters.
This is one of the main reasons why Support Vector Machines are highly used for classification and regression purposes. The definition says that: Support Vector Machines are a set of learning algorithms that help us classify and analyze the nature of data.
Components of SVM
- Support vectors: These are the main components. They are simple data points that lie on both sides of the Maximum margin.
- Maximum margin: The maximum limit till the data classification takes place.
- Maximum margin hyperplane: The maximum mid-limit margin that lies between the positive and negative hyperplanes.
- Positive hyperplane: Right side of the margin plane.
- Negative hyperplane: Left side of the margin plane.
Diagram

In this diagram, we can clearly see that the main margin is separating all the different data points according to the color. We have used the color for showing their nature of difference. The main aim of SVM is to show the distinction and classify each point with the best possible marginal line.
Example and applications
Suppose we have a class: Vehicle. Our task is to fetch the Sports Utility Vehicle (SUV) from that class. Now there are various other types. Now, when we try to arrange them manually in order then it may take a lot of time. This also creates some errors. So, to make the classification more stable we can create a Support Vector Machine that will classify all the models of cars from the parent vehicle class. It shall work on the following steps:
- The model will take a sample image.
- Then it compares it with the test data of vehicle types provided already.
- After that, it tells us which type of model of the car is there in that input image.
- No other algorithm can make things simpler than an SVM.
Implementing Support Vector Machines
In this section, we shall implement all the necessary implementation for the Support Vector Machine. So, let’s get started!
Environment details:
- Python 3.9.7
- IDE: Jupyter Notebooks
- Environment: Anaconda 3
- Dataset: Cancer dataset (cell_samples.csv)
Importing the necessary libraries for data reading and preprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from warnings import filterwarnings
filterwarnings("ignore")
Reading the dataset
cancer_data = pd.read_csv("Datasets/cell_samples.csv", sep = ",")
cancer_data.head()
Output:

Checking for null values
cancer_Data.isna().sum()

Getting the general info about the dataset
print("The shape of the dataset is: ", cancer_data.shape)
print("The size of the dataset is: ", cancer_data.size, " bytes\n")
print("The count of each attribute of the dataset is: \n")
print(cancer_data.count())
print("\nThe datatype of each attribute is: \n")
print(cancer_data.dtypes)
Output:
The shape of the dataset is: (699, 11)
The size of the dataset is: 7689 bytes
The count of each attribute of the dataset is:
ID 699
Clump 699
UnifSize 699
UnifShape 699
MargAdh 699
SingEpiSize 699
BareNuc 699
BlandChrom 699
NormNucl 699
Mit 699
Class 699
dtype: int64
The datatype of each attribute is:
ID int64
Clump int64
UnifSize int64
UnifShape int64
MargAdh int64
SingEpiSize int64
BareNuc object
BlandChrom int64
NormNucl int64
Mit int64
Class int64
dtype: object
Converting the BareNuc column into integer type
cancer_data = cancer_data[pd.to_numeric(cancer_data["BareNuc"], errors = "coerce").notnull()]
cancer_data["BareNuc"] = cancer_data["BareNuc"].astype("int")
cancer_data.dtypes
ID int64
Clump int64
UnifSize int64
UnifShape int64
MargAdh int64
SingEpiSize int64
BareNuc int32
BlandChrom int64
NormNucl int64
Mit int64
Class int64
dtype: object
Separating the two classes from the data frame
For cancer cells type classification we have two types of cells for classification:
- Malignant: value = 4 in our dataset
- Benign: value = 2 in our dataset
We create two separate data frames of the same names. Then, try to classify them using data visualization techniques. Taking only the first fifty value from the core dataset. This makes plotting easier.
malignant = cancer_data[cancer_data["Class"] == 4][0:50]
benign = cancer_data[cancer_data["Class"] == 2][0:50]
plt.figure(figsize = (10, 5))
ax = plt.axes()
ax.set_facecolor("white")
plt.title("Separating the data points - Clump and UniformShape")
plt.scatter(malignant["Clump"], malignant["UnifShape"] , color = "red", marker = "*")
plt.scatter(benign["Clump"], benign["UnifShape"], color = "green", marker = "+")
plt.legend(["Malignant cell class", "Benign cell class"])
plt.show()

Creating independent and dependent data column lists with their numpy arrays:
dependent_data = cancer_data[["ID", "Class"]]
independent_data = cancer_data[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize',
'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
X_data = np.array(independent_data)
X_data[0:5]
Y_data = np.array(dependent_data["Class"])
Y_data[0:5]
Output:
array([[ 5, 1, 1, 1, 2, 1, 3, 1, 1],
[ 5, 4, 4, 5, 7, 10, 3, 2, 1],
[ 3, 1, 1, 1, 2, 2, 3, 1, 1],
[ 6, 8, 8, 1, 3, 4, 3, 7, 1],
[ 4, 1, 1, 3, 2, 1, 3, 1, 1]], dtype=int64)
array([2, 2, 2, 2, 2], dtype=int64)
Splitting the data into train and test variables
From the sklearn.model_selection import the train_test_split function. This splits the data into four arrays:
- X_train
- X_test
- y_train
- y_test
Out of these the training arrays are tow dimensional and the testing arrays are one dimensional. Just remember to take the test_size = 0.2 as we need only 20 percent of the total dataset to test our model accuracy.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_data, Y_data, test_size = 0.2, random_state = 4)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
Output:
(546, 9)
(546,)
(137, 9)
(137,)
Importing the SVM from sklearn and creating a classifier instance
First we import the model and then we import SVC. It is the classifier class for separating the support vectors. Create an instance “Classify”. Give the kernel value as “linear” it will linearly separate the support vectors. Then we fit the X_train data and Y_train data inside the model using the fit() function. After that create an instance “y_predict”, which holds all the predictions in a one-dimensional array.
from sklearn import svm
classify = svm.SVC(kernel = "linear")
Classify.fit(X_train, y_train)
y_predict = Classify.predict(X_test)
print(y_predict)
Output:
array([2, 4, 2, 4, 2, 2, 2, 2, 4, 2, 2, 4, 4, 4, 4, 2, 2, 2, 2, 2, 4, 2,
4, 4, 4, 4, 2, 2, 4, 4, 4, 2, 4, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 4,
4, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 4, 4, 2, 4, 4,
4, 2, 2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 4, 4, 2, 2, 2, 2, 4, 4, 2, 4,
2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 2, 4, 2, 4, 2, 2, 4, 2, 4, 2, 2, 4,
2, 2, 4, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 2, 2, 4, 2, 2, 4, 2, 4, 2,
2, 2, 2, 2, 4], dtype=int64)
So, we have successfully separated all the cancerous patients with the noncancerous ones. The cells having 4 as value are cancerous and with that 2 are noncancerous. Now, that we have got the predictions we can run them against our Y_test array to check how accurate the model is. For that we can prepare a classification report.
Preparing the classification report
For this, we need to import the classification_report function from the sklearn.metrics module. Then call it inside the print() function. we test it with our Y_test array and the results are as follows:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict))
Output:
precision recall f1-score support
2 1.00 0.94 0.97 90
4 0.90 1.00 0.95 47
accuracy 0.96 137
macro avg 0.95 0.97 0.96 137
weighted avg 0.97 0.96 0.96 137
As the result says the precision of the model is very good. For malignant class (value = 4): The precision score is: 100%. For the benign class (value = 2) the precision score is: 90%
Conclusion
So, in this way we have successfully implemented the Support Vector Machines using Python and built a predictive model from the given input data.