Clustering is a part of unsupervised subject learning where the major task is to make groups of data points such that the grouped data points have similarities. The basic clustering algorithm procedure consists of a basic concept to determine the distance between the data points and divide them into groups. The closest points are being separated into one group. There are various algorithms available that work on only these basic phenomena or by improving the same phenomena. And most of the algorithms are very good at dealing with the data where they don’t have categorical values available on the data or the data is fully scattered in the space. We can’t specify that the data point belongs to which category. Here if a data has categorical values, the clustering for categorical or boolean attributes comes into the picture, which can be performed by the ROCK(a RObust Clustering using linKs).
As we know, clustering is useful for segregating a large set of data into different meaningful groups. Data points of the group have similar attributes or are close to each other when drawn in the space. Other algorithms use the distance values between the points for performing clustering. ROCK employs links when merging them into a cluster.
For example, in a market basket database, we collect the customer’s transaction and set the different items purchased by the customer in a transaction. Here we can perform the clustering of the customers such that customers of similar behaviour can be segregated into one cluster. Let’s go deeper. One cluster consists of those customers who are married and who buy baby products or any other daily need products frequently. While in the other cluster, it may consist of rich people groups who buy imported products. So in that situation, we first have to categorize the customer. Then, we can cluster them to perform targeted marketing according to the different segments of the customers.
Here the above-market basket database contains the data points in categorical value, not in numerical value. Transactions in the database can be viewed as records with boolean attributes, each attribute corresponding to a single item. For further uses, we can give true value to an item if it is presented in the transaction and false if the item is not presented in the transaction. This is just an example of categorical values which are not only limited to true and false. There can be many more categories in the dataset. Clustering of this kind of categorical value is the main focus of the ROCK.
What is the ROCK algorithm?
The following image describes the basic steps of the algorithm:
Here in the steps we can see in the first step it draws the random samples from the data in the space then in the second step, it follows the procedures of finding links between the data points which are randomly collected from the dataset and by employing those links it makes the clusters and saves the samples into the disk.
Let’s have an overview of the clustering algorithm of the ROCK. Mathematically the algorithm of ROCK can be represented as follows:
Where the algorithm accepts the S samples randomly collected from the data sets and the number of clusters K as input, then it computes the links between the S. in the next each sample is separated in the cluster and for each cluster i, it makes a local heap for all the clusters and maintains the heap till the execution of the algorithm then the heap contains every cluster(j) in such a way that the link[i,j] is non-zero. The clusters j in the local heap are ordered in decreasing order. Also, the algorithm maintains the global heap Q that contains all the clusters and the Q global heap orders every local heap in decreasing order then the clusters contained by the local heap are shared to the global heap, and the available clusters in their local heap are arranged in the decreasing order of their goodness measures.
The following function can represent computation of the links:
In computation of the link, the algorithm computes a list of neighbors for every data sample that makes the pairs of neighbors, and for each pair, the data sample generates and contributes one link. This process is repeated for every sample. Then at the end, the link counts for all pairs of points will be tainted.
Implementing ROCK in Python
Implementation of the ROCK can be done by using the library pyclustering, a python and c++ library for data mining tasks like clustering algorithms, oscillatory networks, neural networks etc. library is supported for Linux, windows and mac os operating systems.
SciPy, matplotlib, numpy, pillow libraries and python 3.6 or above. note(I am using google colab for performing the ROCK)
Installing the library:
!pip install pyclustering
Next in the article, I will be using the ROCK package from the pyclustering library on synthetic data and data provided by the library itself.
Importing the libraries
from pyclustering.cluster import cluster_visualizer,cluster_visualizer_multidim from pyclustering.cluster.rock import rock; from pyclustering.utils import read_sample; from random import random;
Here in the above code, I have called a package cluster_visualizer to visualise the clusters and the rock package for clustering and other prerequisites.
Generating one-dimensional synthetic data.
data = [ [random()] for i in range(10) ] + [ [random() + 3] for i in range(10) ] + [ [random() + 5] for i in range(10) ] + [ [random() + 8] for i in range(10) ] data
Fitting the data in the ROCK module:
rock_instance = rock(data, 1, 4, 0.5); rock_instance.process(); clusters = rock_instance.get_clusters();
Here we are telling the module rock_instance to produce four clusters of the data.
Visualizing the clusters:
visualizer = cluster_visualizer() visualizer.append_clusters(clusters, input_data) visualizer.show();
Here we can see the ROCK module has segregated the data points into 4 clusters. For this data it has performed well.
Next in the article, we will perform clustering using ROCK in three-dimensional data that the pyclustering library has provided in the packages.
Reading the data:
from pyclustering.samples.definitions import FCPS_SAMPLES from pyclustering.utils import read_sample data = read_sample(FCPS_SAMPLES.SAMPLE_HEPTA)
Fitting the ROCK in data.
# Create an instance of the ROCK algorithm for cluster analysis. Seven clusters should be allocated. rock = rock(data, 1.0, 7) # Run cluster analysis. rock.process() # Obtain results of clustering. clusters = rock.get_clusters()
Visualizing the clusters
visualizer = cluster_visualizer() visualizer.append_clusters(clusters, data) visualizer.show();
Here we can see that our clusters are prepared as instructed to generate 7 clusters of the dataset. The model has performed well with the dataset. We have seen how easy it was to implement with some basic requirements. Pyclustering is also a good library; they have covered most of the algorithms in their packages for performing clustering algorithms.