Cluster Analysis in Python - A Quick Guide

Sometimes we need to cluster or separate data about which we do not have much information, to get a better visualization or to understand the data better. There are three main algorithms to perform such functions and in this article, we are going to learn about three different approaches for clustering data. Let’s get started!

What is the Clustering of Data and Cluster Analysis?

Clustering of data means grouping data into small clusters based on their attributes or properties. Cluster analysis is used in a variety of applications such as medical imaging, anomaly detection brain, etc.

Cluster analysis is a type of unsupervised machine learning algorithm. It is used for data that do not have any proper labels. Clustering comes in handy for such kinds of data.

Types of Clustering Algorithms

Following are some of the most popular clustering algorithms :

Affinity Propagation
Hierarchical Agglomerative Clustering
BIRCH
DBSCAN
K-Means
Mini-Batch K-Means
Mean Shift
OPTICS
Spectral Clustering
Mixture of Gaussians

In this article, we will discuss the three most popular algorithms among these: K-Means Clustering, DBSCAN, and HAC.

1. K-Means Clustering Algorithm

In this type of algorithm, the data divide or segregate the data into “K disjoint clusters”.You need to choose the number of clusters(K) according to your data. Cluster centers or centroids represent each cluster.

Here is how the algorithm works:

Step 1: First of all, choose the cluster centers or the number of clusters.
Step 2: Delegate each point to its nearest cluster center by calculating the Euclidian distance.
Step 3:The cluster centroids will be optimized based on the mean of the points assigned to that cluster.
Step 4: Once we see that the cluster centroids are not making many movements or moving small distances, we can safely say that the K-means cluster has converged.

Let’s see how to implement K-means clustering in Python. We have used the famous Iris Dataset for implementing our K-Means algorithm.

from copy import deepcopy
import numpy as np 
import pandas as pd 
from matplotlib import pyplot as plt

Let’s now import a CSV file and create a dataframe.

df = pd.read_csv("/content/Iris.csv") 
df.drop('Id',axis=1,inplace=True)

df.head()

df["Species"] = pd.Categorical(df["Species"])
df["Species"] = df["Species"].cat.codes

# Changing dataframe to numpy matrix
data = df.values[:, 0:4]
category = df.values[:, 4]

Time to create the K Means cluster. To make things easier, we’ll be creating a plot using the matplotlib module.

k = 3

# Training data
n = data.shape[0]

# Number of features in the data
c = data.shape[1]

# Generating random centers
mean = np.mean(data, axis = 0)
std = np.std(data, axis = 0)
centers = np.random.randn(k,c)*std + mean

# Plotting data 
colors=['blue', 'yellow', 'green']
for i in range(n):
    plt.scatter(data[i, 0], data[i,1], s=7, color = colors[int(category[i])])
plt.scatter(centers[:,0], centers[:,1], marker='.', c='r', s=150)

Although the K-means cluster is a robust algorithm, it may not converge at a local optimum minimum.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

The density-based clustering algorithm is based on the idea that a cluster in space is a high point of density that is separated from other clusters by regions of low point density. This clustering algorithm is ideal for data that has a lot of noise and outliers. This algorithm takes two parameters minPts which is the minimum number of points clustered together in a dense region and eps(epsilon) which is used to measure the distance between points.

Let’s understand how the algorithm works.

Step 1: In the first step, it picks up a random arbitrary point in the dataset and then travels to all the points in the dataset.
Step 2: If the algorithm finds that there are ”minpts” within a distance of eps (epsilon) from the chosen point, the algorithm considers all these points to be part of the same cluster.
Step 3: The algorithm is then repeated for neighborhood points and the clusters are thus expanded.

Let’s see how we can implement DBSCAN in python.

First, we will import the necessary libraries.

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

Now we will generate random data.

centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)

X = StandardScaler().fit_transform(X)

In the next step, we will perform DBSCAN.

db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels))

Output:

Estimated number of clusters : 3
Estimated number of noise points : 18
Homogeneity : 0.953
Completeness : 0.883
V-measure : 0.917
Adjusted Rand Index : 0.952
Adjusted Mutual Information : 0.916
Silhouette Coefficient : 0.626

Now, let’s plot the results that we saw in our output above.

import matplotlib.pyplot as plt
%matplotlib inline
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

The best-case runtime complexity of the DBSCAN algorithm is 0 (nlogn).

Hierarchical Agglomerative Clustering(HAC)

This type of clustering method follows a bottom-up approach. Each object is first treated as a single element or cluster. With each iteration, two most likely clusters are combined to form a large cluster. This process is repeated until every point comes under one big cluster.

Let’s see how the algorithm works.

Step 1: In the first step, estimate the degree of similarity between every two objects in the dataset.
Step 2: Now, with the help of the linkage function, start grouping objects into a hierarchical cluster tree based on the distance. Hence, the objects that are close are combined or linked using the linkage function.
Step 3: Divide the hierarchical tree into clusters.

Let’s see how to implement the algorithm in Python. We will generate data points using a numpy array.

import numpy as np

X = np.array([[5,3],
    [10,15],
    [15,12],
    [56,10],
    [30,40],
    [85,70],
    [91,80],
    [50,78],
    [60,55],
    [70,91],])

Now we will plot the data points we have generated. Here we are labeling the data points from 1 to 10.

import matplotlib.pyplot as plt

labels = range(1, 11)
plt.figure(figsize=(10, 7))
plt.subplots_adjust(bottom=0.1)
plt.scatter(X[:,0],X[:,1], label='True Position')

for label, x, y in zip(labels, X[:, 0], X[:, 1]):
    plt.annotate(
        label,
        xy=(x, y), xytext=(-3, 3),
        textcoords='offset points', ha='right', va='bottom')
plt.show()

Output:

You might notice that the data points form three clusters. One with 1, 2, 3, another with 4 and 5, and another from 6 to 10. But in the case of multi-dimensional data, it is very difficult to point out such clusters with the naked eye.

Let’s plot the dendrogram for the data points.

from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt

linked = linkage(X, 'single')

labelList = range(1, 11)

plt.figure(figsize=(10, 7))
dendrogram(linked,
            orientation='top',
            labels=labelList,
            distance_sort='descending',
            show_leaf_counts=True)
plt.show()

Output:

The algorithm will first find the points which are closest to one another by calculating Euclidean Distance or Manhattan Distance. You can see from the previous plot that 2 and 3 and 6 and 7 were closest to each other and hence in the dendrogram they have been joined.

The vertical height of the dendrogram denoted the Euclidean distance between two points. In this next step the algorithm will move on to join one cluster to its nearest cluster and so on. This step is repeated until and unless one big cluster is formed and all the points are joined.

Conclusion

In summary, we have learned three popular clustering algorithms and how to use them in python. These three algorithms have very different approaches to clustering. You have to choose the clustering algorithm based on your dataset. We use clustering when we want to group the data without any prior information about the data, which means in an unsupervised way!