A cluster is a group of homogeneous objects; in other words, objects with similar properties are collected in one cluster, while things with dissimilar properties are collected in another. Clustering is the process of categorizing objects into a number of groups in which the objects in each group are substantially similar to those in other groups. Various clustering algorithms have been used so far like K-Means clustering, mean-shift clustering, etc. But in this article, we will discuss the toolbox, named CLASSIX, for clustering which does clustering more precisely, fast but also explains how it is carried. Below are the major points listed that are to be discussed in this post.
Table of contents
- What is clustering?
- How does CLASSIX cluster the data?
- Implementing CLASSIX in Python
Let’s first discuss clustering.
What is clustering?
Clustering is the process of putting items together so that members of the same group (cluster) be more common with their peers than members of other groups. Clustering looks at all of the input data and is commonly used in machine learning (ML) methods.
Sign up for your weekly dose of what's up in emerging technology.
When machine learning practitioners create a cluster, they examine all of the different data points and group them together based on what features they have in common with other data. The algorithm determines the clustering strategy.
Clustering procedures can involve calculating the average distance between data points in dimensional spaces, counting the number of intervals for each set of data, predicting the number of clusters, or basing them on dense data areas. Clustering produces explicit links between data points, as well as explanations for why each data point belongs in its cluster.
How does CLASSIX cluster the data?
Distance-based clustering algorithms, such as k-means, take into account the pairwise distance between points when deciding whether or not they should be grouped together. DBSCAN and other density-based clustering algorithms adopt a more global approach, assuming that data occurs in continuous zones of high density surrounded by low-density regions.
Many density-based clustering methods have the advantage of being able to handle clusters of any shape without having to define the number of clusters in advance. They, on the other hand, usually necessitate greater parameter adjustment.
CLASSIX is a method that shares characteristics of both distance and density-based methods. The approach is divided into two stages: aggregation and merging. Data points are sorted along with their first principal component and then grouped using a greedy aggregation technique during the aggregation phase.
Sorting is essential for traversing the data with almost linear complexity, as long as the number of pairwise distance computations is modest. While the initial sorting requires an average-case complexity, it is only performed on scalar values regardless of data point dimensionality. As a result, the cost of this initial sorting is almost insignificant when compared to calculations on full-dimensional data.
Following the aggregation step, overlapping groups are merged into clusters using either a distance or density-based criterion. Although the density-based merging criterion produces marginally better clusters than the distance-based criterion, the latter is significantly faster. CLASSIX is controlled by only two parameters, and its setup is simple.
In summary, the radius parameter determines the least permissible cluster size, whereas the minPts parameter specifies the tolerance for grouping in the aggregation phase. This is identical to the settings used in DBSCAN, however, CLASSIX does not run spatial range searches for each data point due to the initial sorting of the data points.
Implementation of CLASSIX in Python
In this section, we’ll perform clustering on the IRIS dataset by removing the target column and making a completely unsupervised problem. As discussed earlier we are going to use the method CLASSIX to cluster data, here I am setting the radius to 0.35, groping method to density, minimum points in clustering to be 3 points.
Now let’s first quickly install, import dependencies and prepare the dataset.
# install library !pip install ClassixClustering # imports import pandas as pd import matplotlib.pyplot as plt from classix import CLASSIX # prepare data data = pd.read_csv('/content/IRIS.csv') data.drop(['species'], inplace=True, axis=1)
Now we just need to call the function by setting the parameters as above mentioned and fit the data.
# initailize the clustering clx = CLASSIX(radius=0.35, minPts=3, group_merging='density') # fitting the data clx.fit(data)
After the fitment, this method will give you clustering results as below.
As we have set minPts to 3 the algorithm will agglomerate the cluster having points lesser than the minPts to the bigger clusters. Now let’s check this visually.
# visualize the clusters plt.figure(figsize=(5,5)) plt.scatter(data.values[:,0], data.values[:,2], c=clx.labels_) plt.show()
Apart from this, this algorithm is so capable that it can give a brief explanation of how it has clustered the data by using method .explain().
# explaining the clusters clx.explain()
Through this article, we have discussed clustering. Later, we looked at a quick clustering approach based on sorting data points by their first primary coordinate, which is CLASSIX. The quick aggregation of neighbouring data points into groups is a crucial feature of CLASSIX. Because of the simplicity of the aggregate and merging processes, clustering results may be explained, as we have shown. More experiments are carried out on this dataset that is mentioned in the notebook link in the reference.