How to perform fast and explainable clustering using CLASSIX?

Clustering is the process of putting items together so that members of the same group (cluster) be more common with their peers than members of other groups.

A cluster is a group of homogeneous objects; in other words, objects with similar properties are collected in one cluster, while things with dissimilar properties are collected in another. Clustering is the process of categorizing objects into a number of groups in which the objects in each group are substantially similar to those in other groups. Various clustering algorithms have been used so far like K-Means clustering, mean-shift clustering, etc. But in this article, we will discuss the toolbox, named CLASSIX, for clustering which does clustering more precisely, fast but also explains how it is carried. Below are the major points listed that are to be discussed in this post.   

Table of contents

  1. What is clustering?
  2. How does CLASSIX cluster the data?
  3. Implementing CLASSIX in Python

Let’s first discuss clustering.

What is clustering?

Clustering is the process of putting items together so that members of the same group (cluster) be more common with their peers than members of other groups. Clustering looks at all of the input data and is commonly used in machine learning (ML) methods.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

When machine learning practitioners create a cluster, they examine all of the different data points and group them together based on what features they have in common with other data. The algorithm determines the clustering strategy.

Clustering procedures can involve calculating the average distance between data points in dimensional spaces, counting the number of intervals for each set of data, predicting the number of clusters, or basing them on dense data areas. Clustering produces explicit links between data points, as well as explanations for why each data point belongs in its cluster.

How does CLASSIX cluster the data?

Distance-based clustering algorithms, such as k-means, take into account the pairwise distance between points when deciding whether or not they should be grouped together. DBSCAN and other density-based clustering algorithms adopt a more global approach, assuming that data occurs in continuous zones of high density surrounded by low-density regions. 

Many density-based clustering methods have the advantage of being able to handle clusters of any shape without having to define the number of clusters in advance. They, on the other hand, usually necessitate greater parameter adjustment.

CLASSIX is a method that shares characteristics of both distance and density-based methods. The approach is divided into two stages: aggregation and merging. Data points are sorted along with their first principal component and then grouped using a greedy aggregation technique during the aggregation phase.

Sorting is essential for traversing the data with almost linear complexity, as long as the number of pairwise distance computations is modest. While the initial sorting requires an average-case complexity, it is only performed on scalar values regardless of data point dimensionality. As a result, the cost of this initial sorting is almost insignificant when compared to calculations on full-dimensional data.

Following the aggregation step, overlapping groups are merged into clusters using either a distance or density-based criterion. Although the density-based merging criterion produces marginally better clusters than the distance-based criterion, the latter is significantly faster. CLASSIX is controlled by only two parameters, and its setup is simple. 

In summary, the radius parameter determines the least permissible cluster size, whereas the minPts parameter specifies the tolerance for grouping in the aggregation phase. This is identical to the settings used in DBSCAN, however, CLASSIX does not run spatial range searches for each data point due to the initial sorting of the data points.

Implementation of CLASSIX in Python

In this section, we’ll perform clustering on the IRIS dataset by removing the target column and making a completely unsupervised problem. As discussed earlier we are going to use the method CLASSIX to cluster data, here I am setting the radius to 0.35, groping method to density, minimum points in clustering to be 3 points.

Now let’s first quickly install, import dependencies and prepare the dataset. 

# install library
!pip install ClassixClustering 

# imports
import pandas as pd
import matplotlib.pyplot as plt
from classix import CLASSIX

# prepare data
data = pd.read_csv('/content/IRIS.csv')
data.drop(['species'], inplace=True, axis=1)

Now we just need to call the function by setting the parameters as above mentioned and fit the data. 

# initailize the clustering
clx = CLASSIX(radius=0.35, minPts=3,  group_merging='density')
# fitting the data
clx.fit(data)

After the fitment, this method will give you clustering results as below.

As we have set minPts to 3 the algorithm will agglomerate the cluster having points lesser than the minPts to the bigger clusters. Now let’s check this visually.

# visualize the clusters
plt.figure(figsize=(5,5))
plt.scatter(data.values[:,0], data.values[:,2], c=clx.labels_)
plt.show()

Apart from this, this algorithm is so capable that it can give a brief explanation of how it has clustered the data by using method .explain(). 

# explaining the clusters
clx.explain()

Final words

Through this article, we have discussed clustering. Later, we looked at a quick clustering approach based on sorting data points by their first primary coordinate, which is CLASSIX. The quick aggregation of neighbouring data points into groups is a crucial feature of CLASSIX. Because of the simplicity of the aggregate and merging processes, clustering results may be explained, as we have shown. More experiments are carried out on this dataset that is mentioned in the notebook link in the reference.

References 

Vijaysinh Lendave
Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR