CLARANS (Clustering Large Applications based on RANdomized Search) is a Data Mining algorithm designed to cluster spatial data. We have already covered K-Means and K-Medoids clustering algorithms in our previous articles. This article talks about another clustering technique called CLARANS along with its Pythonic demo code. CLARANS was introduced by Raymond T. Ng and Jiawei Han of IEEE Computer Society (research paper).
CLARANS is a partitioning method of clustering particularly useful in spatial data mining. We mean recognizing patterns and relationships existing in spatial data (such as distance-related, direction-relation or topological data, e.g. data plotted on a road map) by spatial data mining.
Why CLARANS algorithm?
As mentioned in our K-Medoids algorithm’s article, the K-Medoids clustering technique can resolve the limitation of the K-Means algorithm of being adversely affected by noise/outliers in the input data. But K-Medoids proves to be a computationally costly method for considerably large values of ‘k’ (number of clusters) and large datasets.
Sign up for your weekly dose of what's up in emerging technology.
The CLARA algorithm was introduced as an extension of K-Medoids. It uses only random samples of the input data (instead of the entire dataset) and computes the best medoids in those samples. It thus works better than K-Medoids for crowded datasets. However, the algorithm may give wrong clustering results if one or more sampled medoids are away from the actual best medoids.
CLARANS algorithm takes care of the cons of both K-Medoids and CLARA algorithms besides dealing with difficult-to-handle data mining data, i.e. spatial data. It maintains a balance between the computational cost and the influence of data sampling on clusters’ formation.
Steps of CLARANS algorithm
- Select ‘k’ random data points and label them as medoids for the time being.
- Select a random point say ‘a’ from the points picked in step (1), and another point say ‘b’ which is not included in those points.
- We would already have the sum of distances of point ‘a’ from all other points since that computation is required for selecting the points in step (1). Perform similar computation for point ‘b’.
- If the sum of distances from all other points for point ‘b’ turns out to be less than that for point ‘a’, replace ‘a’ by ‘b’.
- The algorithm performs such a randomized search of medoids ‘x’ times where ‘x’ denotes the number of local minima computed, i.e. number of iterations to be performed, which we specify as a parameter. The set of medoids obtained after such ‘x’ number of steps is termed as ‘local optimum’.
- A counter is incremented every time a replacement of points is made. The process of examining the points for possible replacement is repeated till the counter does not exceed the maximum number of neighbors to be examined (specified as a parameter).
- The set of medoids obtained when the algorithm stops is the best local optimum choice of medoids.
Image source: Research paper
Here’s a demonstration of using CLARANS algorithm on the sklearn library’s Breast Cancer Wisconsin dataset. Though the dataset is primarily used for binary classification tasks, we use it to show how CLARANS algorithm can form separate clusters of the constituent data points falling under one of the two target categories (‘malignant’ or ‘benign’). The pyclustering data mining library has been used here for Pythonic implementation of CLARANS. The code has been implemented using Google colab with Python 3.7.10 and pyclustering 0.10.1.2 versions. Step-wise explanation of the code is as follows:
- Install pyclustering library.
!pip install pyclustering
- Import required libraries and modules
#Class for implementing CLARANS algorithm from pyclustering.cluster.clarans import clarans #To execute a function with execution time recorded from pyclustering.utils import timedcall #sklearn package for using a toy dataset from sklearn import datasets #Class for plotting multi-dimensional data from pyclustering.cluster import cluster_visualizer_multidim
- Import the Breast Cancer dataset
bc_dataset = datasets.load_breast_cancer() #Display the dataset bc_dataset
Sample condensed output:
- Extract the data points from the loaded dataset.
bc_data = bc_dataset.data
Convert the dataset from a numpy array to a list because a list of lists is fed as an input to the CLARANS’ implementation of the pyclustering library.
bc_data = bc_data.tolist()
Display the data in the form of a list
Sample condensed output:
[[17.99, 10.38, 122.8, 1001.0, 0.1184, 0.2776, 0.3001, 0.1471, 0.2419, 0.07871, 1.095, 0.9053, 8.589, 153.4, 0.006399, 0.04904, 0.05373, 0.01587, 0.03003, 0.006193, 25.38, 17.33, 184.6, 2019.0, 0.1622, 0.6656, 0.7119, 0.2654, 0.4601, 0.1189], [20.57, 17.77, 132.9, 1326.0, 0.08474, 0.07864, 0.0869, 0.07017, 0.1812, 0.05667, 0.5435, 0.7339, 3.398, 74.08, 0.005225, 0.01308, 0.0186, 0.0134, 0.01389, 0.003532, 24.99, 23.41, 158.8, 1956.0, 0.1238, 0.1866, 0.2416, 0.186, 0.275, 0.08902],…
- Instantiate the clarans class.
clarans_obj = clarans(bc_data, 2, 3, 5) “”” where, ‘bc_data’ is the input data fed as a list of list of objects to be clustered ‘2’ is number of clusters to be formed (since there are 2 target categories); ‘3’ is the number of obtained local minima and ‘5’ is the maximum number of neighboring data points examined “””
- process() method analyzes the clusters as per the CLARANS algorithm. We call the process() method and encapsulate it in the call to timedcall() function so that the time taken for executing process() method also gets recorded.
#timedcall() returns a tuple containing the execution time and result of executing the function (tks, res) = timedcall(clarans_obj.process); #Print the execution time print("Execution time : ", tks, "\n");
Execution time : 80.50073736700006
- Get the clusters allocated by the algorithm
clst = clarans_obj.get_clusters()
- Get the list of medoids of the clusters allocated by the algorithm.
med = clarans_obj.get_medoids()
- Print the results
print("Index of clusters' points :\n",clst) print("\nLabel class of each point :\n ",bc_dataset.target) print("\nIndex of the best medoids : ",med)
Sample condensed output:
- Here, the input data has 30 features. Cluster_visualizer class of the pyclustering library can be used to visualize the 1D, 2D or 3D data. While for more than three-dimensional data, cluster_visualizer_multidim class can be used as follows:
vis = cluster_visualizer_multidim() “”” Append the list of clusters; specify the clusters list, list of data points, marker sign and marker size to be used for visualization “”” vis.append_clusters(clst,bc_data,marker="*",markersize=5) #Display the clusters formed in multiple dimensions vis.show(pair_filter=[[1,2],[1,3],[27,28],[27,29]],max_row_size=2) “”” pair_filter parameter specifies the list of feature pairs for which the clusters will be visualized. max_row_size parameter specifies the maximum number of rows across which the plot will be spread. “””
- Google colab notebook of the above implementation is available here.