Clustering is a type of unsupervised learning which is used to group similar objects in one cluster.

Speaking technically, clustering is the process of grouping data points based on any criteria into different groups such that data points belonging to a group are similar to each other and are dissimilar to the data points in other groups.

One important point to remember is that since clustering is a type of unsupervised learning, the data is unlabeled. So the grouping is done without any knowledge of the labels.

Also read: Supervised vs Unsupervised Learning – Differences to know!

In machine learning and data mining, the criteria to group certain objects into one cluster is the distance between the data points.

The distance between each and every point is calculated, and these points are grouped accordingly.

**Distance Measures**

Most clustering algorithms use distance measures to assess the similarities or differences between a pair of objects.

There are different measures of distance, but Euclidean distance is popularly used.

**Euclidean Distance for Clustering**

The graph for Euclidean distance between two points, P and Q, is shown below.

Euclidean distance is the traditional metric for geometry. It is one of the most used algorithms in cluster analysis. One of the algorithms that use this formula is K-means. The formula for Euclidean distance is given below.

Here, the distance between the two points ( x and y ) is denoted as ‘d’. Inside the root, we have the square of the distance between the ordinate points(x1 and x2) and the square of the distance between abscissa (y1 and y2).

Let us understand with an example.

Consider two points A(3,5) and B(5,2).

The Euclidean distance can be calculated as:

**The Manhattan Distance **

The Manhattan distance is also called Taxicab geometry.

Let us see the working of the Manhattan Distance with the help of an example.

If you observe the above image, let us assume that one of the black dot(at the lower end of the graph) is the source from which the taxi driver should navigate to another black dot (at the higher level). Now, the only possible way for the taxi driver to reach the destination is by traveling through right angles, given that the city is in the form of a grid.

To elaborate, since one point is at the lower end of the graph and one at the higher end, the difference between these two points would be negative. Hence, we use the absolute difference between these two points.

The manhattan distance is defined as the absolute difference between the pair of coordinates measured at right angles.

Suppose we have two points A(x1,y1) and B(x2,y2). The Manhattan distance is calculated as follows.

Let us take two points A(2,4) and B(5,6).

The Manhattan distance is calculated as shown below.

**The Jaccard Distance**

The Jaccard distance is slightly different from the above two measures.

The Jaccard index determines the similarity between two data points by considering the union and intersection of the two points.

The Jaccard index can be defined as the ratio of intersection by the union of the two points.

Suppose there are two points, A and B; the Jaccard index is calculated as:

The above formula can be represented in the form of a representation of intersection and union as given below.

And the Jaccard distance is calculated as the **1-Jaccard Index.**

Consider two sets A{1,2,3,5} and B{3,5,6,7,8}.

First, we need to find the intersection and union of the two sets.

The intersection is defined as the common points in the two sets.

Union is defined as the combination of all points from the two sets.

Here, J(A, B) is the Jaccard Index, and dj is the Jaccard distance.

**The Minkowski Distance**

Minkowski Distance is the generalized version of Euclidean and Manhattan Distance.

In an N-dimensional space, a point is represented as:

If we observe carefully, we can notice two points.

- If p = 1, the Minkowski distance is the same as the Manhattan distance
- If p = 2, the above formula is the same as the Euclidean distance

**Clustering Algorithms**

There are a handful of clustering algorithms that group similar objects using different approaches. In this post, we are going to look at the following clustering algorithms.

- K-means
- Hierarchical clustering

** K-means Clustering Algorithm**

K-means clustering is an unsupervised learning algorithm that groups unlabeled data points into different clusters based on the distance between them.

It is also called t**he flat clustering **algorithm. The K-means is a centroid-based algorithm. It means that every cluster has a centroid associated with it, and the data point which has the minimum distance from the centroid is grouped in that cluster.

We can say that K-means is an iterative algorithm that groups the data points into clusters in such a way that each dataset belongs to only one group with similar trends.

It keeps repeating the same steps until no points are left to perform clustering.

**The ‘K’ in K-means**

The K in K-means typically says how many clusters we need to group all the data points.

Suppose if K=2, there are two clusters involved.

If K=3, there are three clusters in total.

You may confuse K-means with KNN.

While K-means is an unsupervised learning algorithm, KNN comes under supervised machine learning.

Also, the ‘K’ in K-means denotes the number of clusters required to group all the data points. But the ‘K’ in KNN stands for the number of neighbors you need to take into consideration before predicting that a particular point belongs to a certain class.

**Working of K-means**

Let us see the steps to perform K-means clustering.

**Step 1**: The K needs to be predetermined. That means we need to specify the number of clusters that are to be used in this algorithm.

**Step 2**: K data points from the given dataset are selected randomly. These data points become the initial centroids.

Once the centroids are computed, we need to find the distance between each and every data point from these centroids.

**Step 3: **The data points with a minimum distance from a centroid are put in the respective cluster.

**Step 4**: The new centroids are calculated based on the points gathered in step 3.

Repeat step 3.

We keep repeating steps 3 and 4 until two consecutive iterations result in the same clusters.

**Exploring the K-Means Class**

The syntax for the class is given below.

```
class sklearn.cluster.KMeans(n_clusters=8, *, init='k-means++', n_init='warn', max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd')
```

The parameters of the syntax are explained below.

**n_clusters: **Gives the number of clusters to be formed. Accepts integer. The default value is 8.

This argument does not take any integer greater than 8.

**init**: It is a method for initializing the algorithm. The type it takes is an array. The default value is `kmeans++`

This method selects initial clusters by a probability distribution which speeds up convergence. There are two variants of this method-“greedy kmeans++” and “vanilla kmeans++.” “greedy kmeans++” is preferred to “vanilla kmeans++” as it makes several trials at each sampling and chooses the best centroid.

**n_init: **Gives the number of times the algorithm runs with different centroids. The type is ‘auto’ or int. The default value is 10.

When `n_init='auto'`

, and if `init=random`

, the number of runs is 10.

When `n_init='auto'`

, and if `init='kmeans++'`

, the number of runs is 1.

**max_iter**: Maximum number of times the algorithm executes in a single run. Accepts only integers. The default value is 300.

If the `max_iter`

is, say 50, the algorithm runs 50 times before giving the final output.

**tol: **It gives the relative tolerance of two consecutive iterations to declare convergence. The K-means algorithm is said to converge if any two consecutive iterations yield the same clusters. The type is float. The default value is 1e-4.

**verbose: **Gives the verbosity mode. Verbosity is defined as showing more information for a specific task. So the higher the verbose, the more information is.

Accepts integers. The default value is 0.

**random_state: **This parameter is used to seed the centroids. If the random_state is None or 0, the centroids change or every run. If the random_state is an integer(e.g., 1,2), the centroids are constant. The type is an integer. The default is None.

**copy_x: **If copy_x is True, the original data is not modified. Else, the original data is modified.

There are a few exceptions in this case.

If the original data is not in C-contiguous form, the data is modified even if the copy_x is False. If the data is sparse but not in a compressed sparse row(CSR) format, a copy will be made even if the copy_x is False.

It accepts the bool data type. The default is True.

**algorithm: **Says which algorithm to use. We can choose from:{“lloyd”, “elkan”, “auto”, “full”}. “lloyd” uses Expectation-Maximization(EM) style.

“elkan” is used on datasets with well-defined clusters.

**Note:** “auto” and “full” are removed in the newer versions as they are aliases of “lloyd”.

**Implementation in Python**

Here is a simple code to show the implementation of K-means.

```
#importing the necessary libraries
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
```

In line 2, we are importing the KMeans class from `sklearn.cluster`

module.

`sklearn`

is an alias name for the `scikit-learn`

library. This library is used for many classification, clustering, and regression tasks.

**import numpy as np**: The NumPy library is imported as ‘np’ as a short form for creating arrays.

**import matplotlib.pyplot as plt**: The matplot library has a module called `pyplot`

that is used for graphical representations.

```
#creating the data points and their respective classes
X=np.array([[1.713,1.586],[0.180,1.786],[0.353,1.24],[0.940,1.566],[1.486,0.759],[1.266,1.106],[1.540,0.419],[0.459,1.799],[0.733,0.186]])
y=np.array([0,1,1,0,1,0,1,1,1])
#using the kmeans
model=KMeans(n_clusters=2,random_state=2401)
model.fit(X,y)
print("Input Data: \n\nVAR1 \tVAR2 \tCLASS\n")
v=0
for i in X:
print(i[0],"\t",i[1],"\t"," ",y[v])
v+=1
print("*"*50)
print("Test Data:")
t=[]
v1=float(input("Enter value of var1: "))
v2=float(input("Enter value of var2: "))
t.append(v1)
t.append(v2)
print("The predicted output is: ",model.predict([t]))
```

So we have a 2D array that is stored in X. This 2D array fills up the values for VAR1 and VAR2, respectively.

Example: If we take [1.713,1.586], we have VAR1=1.713 and VAR2=1.586.

Next, we have the respective classes of these data points stored in a variable called y.

For example, the data point [1.713,1.586] belongs to class 0.

We are creating an instance of the KMeans class, which is stored in the variable called a model.

The parameter `n_clusters`

gives the number of clusters which in this case is 2.

The parameter `random_state`

makes sure that the algorithm will generate the same clusters in every interaction.

The value can be any integer. For example, `random_state`

=1.

If the `random_state=`

0, the output keeps changing.

**model.fit(X,y) : **We are fitting the model to converge on the given data to produce the clusters.

Lines 7 to 12 are written to print the training data we have created in the earlier steps.

In line 14, we are creating an empty list to store the test data.

In the next two lines, we are creating new variables v1 and v2 for storing the test data.

We are using float to convert the data we give into the type of training data.

The test data is appended to the newly created empty list.

For example, if the test data is v1=0.906 and v2=0.606, t=[0.906,0.606].

In the last line, we are predicting the class of the test data using `predict`

.

The output is given below.

Next, we are going to create the clusters and their respective centroids.

As observed from the output, the new test data,[0.906,0.606] belongs to class 1.

The code is given below.

```
for i in range(2):
plt.scatter(X[y==i,0],X[y==i,1],label=("Cluster ",i))
plt.scatter(model.cluster_centers_[:,0],model.cluster_centers_[:,1],label="Centroids",color="black",marker="*",s=100)
plt.scatter(t[0],t[1],color='m',label="test_data")
plt.legend()
plt.show()
```

Since there are two clusters, we iterate the loop two times.

**plt.scatter(X[y==i,0],X[y==i,1],label=(“Cluster”,i): **So this line is used to create clusters and group the elements of the first and second columns into a cluster based on the y class.

We are also providing the label Cluster.

In the next line, we are creating centroids.The attribute cluster_centers_[:,0] and cluster_centers_[:,1] is used for indexing.

The label for the created centroids is named Centroids. The centroids are black color star-shaped whose size is specified to be 100.

The next line is used to plot the test data we gave in the previous section. The color of the test data is given as magenta.

**plt.legend()**: The legend is used to provide the meaning for the elements used in the scatter plot.

**plt.show():** This line is used to show the scatter plot that has been created.

The output is given below.

**Hierarchical Clustering**

Hierarchical clustering is another clustering algorithm that uses a hierarchical tree-like structure called a dendrogram to represent the clusters.

There are two approaches to hierarchical clustering.

Agglomerative(Bottom-up) and Divisive(Top-Down).

We are going to see agglomerative hierarchical clustering.

Hierarchical clustering comes into the picture because, with K-means, we need to decide on the number of clusters even before the algorithm starts. But in Hierarchical clustering, we don’t need to know the number of clusters beforehand.

Also, the K-means always create clusters of the same size. It is not the case with Hierarchical clustering. In hierarchical clustering, the size of the clusters need not be the same.

**Agglomerative Clustering**

Agglomerative clustering uses a bottom-up approach. It means that this algorithm considers each data point as a single cluster and merges these clusters based on the distance between them into one big cluster at the end that contains all the data points.

**Working of Agglomerative Clustering**

Here are the steps we need to follow to implement agglomeration clustering.

**Step 1:** If there are n data points, these points are considered as n different clusters.

**Step 2: **Take any two data points that are closer to each other and merge them into one cluster.

After this step, there would be n-1 clusters.

**Step 3: **Take the next two clusters with minimum distance and merge them.

By now, there would be n-2 clusters left.

Repeat step 3 until there is only one cluster left.

While merging any two clusters, we need to observe a distance measure which is called linkage.

There are three types of linkages:

**Single Linkage**

In a single linkage, while merging two clusters, the data points of each cluster that are closer to each other are considered to be the point of merging.

**Complete Linkage**

In complete linkage, the data points that are farther from each other are considered for merging.

**Average Linkage**

In average linkage, the average of each cluster is calculated, which is determined to be the merging point.

**Exploring** **the AgglomerativeClustering Class**

The syntax is given below.

```
class sklearn.cluster.AgglomerativeClustering(n_clusters=2, *, affinity='deprecated', metric=None, memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', distance_threshold=None, compute_distances=False)
```

**n_clusters***: *Gives the number of clusters to find. It is None if `distance_threshold`

is not None. The type is int or None. The default is 2.

**affinity***: *The affinity is a metric used to calculate the distance between the clusters. If linkage is “ward”, only “euclidean” is accepted.

Type: str or callable. The default is ’euclidean’.

**metric***: *The metric is used to calculate* *the linkage. It can be any of the following: {“euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”}. If the default is set, euclidian is used. If linkage is “ward”, only “euclidean” is accepted. If “precomputed” is used, a distance matrix is needed as input for the fit method

Type: str or callable. The default is None.

**memory:** Used to cache the output of the tree. If None is used, no caching is done. If any path is given, it is taken as the path to a directory.

Type: str. The default is None.

**connectivity: **It is a matrix that stores the neighboring instances for all the samples. It can be a matrix that makes the algorithm structured. If the default(None) is used, the cluster is unstructured.

The type is an array, and the default value is None.

**compute_full_tree:** This argument decides if a complete dendrogram has to be constructed.

If `compute_full_tree`

is set to `True`

, a full dendrogram will be computed. This may be helpful in keeping track of the history of the merges performed.

If `compute_full_tree`

is set to `False`

, only the final cluster is returned. This saves memory and computation time, especially for large datasets.

If `compute_full_tree`

is set to `'auto'`

(the default), it is up to the library to decide whether or not to compute the full dendrogram based on the size of the input data and the `n_clusters`

parameter.

**linkage:** Suggest which linkage to be used. It can be anything from {‘ward’, ‘complete’, ‘average’,’ single’}. The default is ‘ward’.

- ‘ward’ gives the minimum variance of the clusters being merged
- ‘average’ computes the average of the distances of each observation of the two sets and uses it as the merging factor.
- ‘complete’ or ‘maximum’ linkage uses the maximum distances between all observations of the two sets.
- ‘single’ uses the minimum of the distances between all observations of the two sets.

**distance_threshold:** It gives the limit above which the clusters will not be merged.

The data type is float, and the default is None.

**compute_distances:** It gives the distances between clusters even if distance_threshold is not used.

It can be used for dendrogram visualization.

It accepts the bool data type, and the default is False.

**Implementation of Agglomerative Clustering**

Firstly, we need to import the necessary modules and libraries.

```
import numpy as np
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12]
y = [21, 19, 24, 17, 16, 25, 24, 22, 21, 21]
#plotting the x and y values
plt.scatter(x, y)
plt.show()
```

**import numpy as np: **NumPy is a library that is used for creating and working with arrays. We are using the standard alias name ‘np’.

**from sklearn.cluster import AgglomerativeClustering**: The scikit-learn is a library that has a class called AgglomerativeClustering.

**import matplotlib.pyplot as plt**: The matplot library is imported as plt.

Scipy is a library that is used for technical computations. It has attributes like dendrogram and linkage, which are necessary for Hierarchical clustering.

In the next two lines, we are creating two arrays that are supposed to be merged into one cluster.

**plt.scatter(x,y):** This line is used to create a scatter plot.

**plt,show(): **This function is used to show the scatter plot on the screen.

```
#dendrogram
data = list(zip(x, y))
linkage_data = linkage(data, method='ward', metric='euclidean')
dendrogram(linkage_data)
plt.show()
```

In the second line, we are concatenating the x and y variables and storing it in data.

We can calculate the distance measure and linkage by specifying the method as ward and metric as euclidean.

The ward by default uses a single linkage.

We are passing the data to the dendrogram function, which creates a dendrogram of the data points.

Here, the x and y values may not be the same as the original data points because of the merge and linkage.

```
hc= AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
labels=hc.fit_predict(data)
plt.scatter(x, y, c=labels)
plt.show()
```

In the first line, we are creating an instance of the AgglomerativeClustering that is stored in a variable called hc.

The number of clusters is 2, the distance measure is euclidean and the type of linkage is ward.

In the next line, we are fitting the data to predict the clusters.

The plt,scatter is used to create a scatter plot with x and y variables. The ‘c’ is used to give different colors to the clusters.

The output is given below.

Suppose in line 3, the attribute ‘c’ is not mentioned.

```
plt.scatter(x, y)
plt.show()
```

The output would still be a scatter plot with two clusters but with no color difference.

**Conclusion**

To summarize, we have seen how clustering works with unlabeled data and the various distance measures(Euclidean, Manhattan, Jaccard) and their examples.

We have seen the K-means algorithm, how it works and its implementation in python, the major difference between K-means and Knn algorithms, and also how we can create centroids and clusters in the algorithm.

We have also observed the working of `plt.legend()`

.

We have also seen the Hierarchical clustering, how K-means and Hierarchical clustering are different, and also its implementation.

It is also observed how we can give different colors to differentiate between two clusters.

## References

Refer to the scikit-learn documentation of K-Means class.

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Refer to the scikit-learn documentation of AgglomerativeClustering class.

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

Refer to the stack overflow question on Agglomerative Clustering.

https://stackoverflow.com/questions/32219881/why-cant-i-import-the-agglomerativeclustering-class