How to perform fast and explainable clustering using CLASSIX?

Clustering is the process of putting items together so that members of the same group (cluster) be more common with their peers than members of other groups.

A cluster is a group of homogeneous objects; in other words, objects with similar properties are collected in one cluster, while things with dissimilar properties are collected in another. Clustering is the process of categorizing objects into a number of groups in which the objects in each group are substantially similar to those in other groups. Various clustering algorithms have been used so far like K-Means clustering, mean-shift clustering, etc. But in this article, we will discuss the toolbox, named CLASSIX, for clustering which does clustering more precisely, fast but also explains how it is carried. Below are the major points listed that are to be discussed in this post.   

Table of contents

  1. What is clustering?
  2. How does CLASSIX cluster the data?
  3. Implementing CLASSIX in Python

Let’s first discuss clustering.

What is clustering?

Clustering is the process of putting items together so that members of the same group (cluster) be more common with their peers than members of other groups. Clustering looks at all of the input data and is commonly used in machine learning (ML) methods.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

When machine learning practitioners create a cluster, they examine all of the different data points and group them together based on what features they have in common with other data. The algorithm determines the clustering strategy.

Clustering procedures can involve calculating the average distance between data points in dimensional spaces, counting the number of intervals for each set of data, predicting the number of clusters, or basing them on dense data areas. Clustering produces explicit links between data points, as well as explanations for why each data point belongs in its cluster.


Download our Mobile App



How does CLASSIX cluster the data?

Distance-based clustering algorithms, such as k-means, take into account the pairwise distance between points when deciding whether or not they should be grouped together. DBSCAN and other density-based clustering algorithms adopt a more global approach, assuming that data occurs in continuous zones of high density surrounded by low-density regions. 

Many density-based clustering methods have the advantage of being able to handle clusters of any shape without having to define the number of clusters in advance. They, on the other hand, usually necessitate greater parameter adjustment.

CLASSIX is a method that shares characteristics of both distance and density-based methods. The approach is divided into two stages: aggregation and merging. Data points are sorted along with their first principal component and then grouped using a greedy aggregation technique during the aggregation phase.

Sorting is essential for traversing the data with almost linear complexity, as long as the number of pairwise distance computations is modest. While the initial sorting requires an average-case complexity, it is only performed on scalar values regardless of data point dimensionality. As a result, the cost of this initial sorting is almost insignificant when compared to calculations on full-dimensional data.

Following the aggregation step, overlapping groups are merged into clusters using either a distance or density-based criterion. Although the density-based merging criterion produces marginally better clusters than the distance-based criterion, the latter is significantly faster. CLASSIX is controlled by only two parameters, and its setup is simple. 

In summary, the radius parameter determines the least permissible cluster size, whereas the minPts parameter specifies the tolerance for grouping in the aggregation phase. This is identical to the settings used in DBSCAN, however, CLASSIX does not run spatial range searches for each data point due to the initial sorting of the data points.

Implementation of CLASSIX in Python

In this section, we’ll perform clustering on the IRIS dataset by removing the target column and making a completely unsupervised problem. As discussed earlier we are going to use the method CLASSIX to cluster data, here I am setting the radius to 0.35, groping method to density, minimum points in clustering to be 3 points.

Now let’s first quickly install, import dependencies and prepare the dataset. 

# install library
!pip install ClassixClustering 

# imports
import pandas as pd
import matplotlib.pyplot as plt
from classix import CLASSIX

# prepare data
data = pd.read_csv('/content/IRIS.csv')
data.drop(['species'], inplace=True, axis=1)

Now we just need to call the function by setting the parameters as above mentioned and fit the data. 

# initailize the clustering
clx = CLASSIX(radius=0.35, minPts=3,  group_merging='density')
# fitting the data
clx.fit(data)

After the fitment, this method will give you clustering results as below.

As we have set minPts to 3 the algorithm will agglomerate the cluster having points lesser than the minPts to the bigger clusters. Now let’s check this visually.

# visualize the clusters
plt.figure(figsize=(5,5))
plt.scatter(data.values[:,0], data.values[:,2], c=clx.labels_)
plt.show()

Apart from this, this algorithm is so capable that it can give a brief explanation of how it has clustered the data by using method .explain(). 

# explaining the clusters
clx.explain()

Final words

Through this article, we have discussed clustering. Later, we looked at a quick clustering approach based on sorting data points by their first primary coordinate, which is CLASSIX. The quick aggregation of neighbouring data points into groups is a crucial feature of CLASSIX. Because of the simplicity of the aggregate and merging processes, clustering results may be explained, as we have shown. More experiments are carried out on this dataset that is mentioned in the notebook link in the reference.

References 

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Vijaysinh Lendave
Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.