Clustering is the process of dividing huge data into smaller parts. It is an unsupervised learning problem. Mostly we perform clustering when the analysis is required to extract the information of an interesting pattern or the field, for example, extracting similar user behaviour in a customer database.
Many clustering algorithms are available to use, and all of them have their characteristics and use cases. We can not have all the similar time kinds of datasets. Some algorithms are made to use when the dataset is low in quantity, and some of them can be used when the dataset is in high quantity. Clustering algorithms are made to find the natural feature groups in the feature space of input data.
Sign up for your weekly dose of what's up in emerging technology.
Examples of clustering algorithms are:
- Agglomerative clustering
- K- means
- Spectral clustering
In this article, we are going to discuss the BIRCH clustering algorithm. The article assumes that the reader has the basic knowledge of clustering algorithms and their terminology.
BIRCH(Balanced Iterative Reducing and Clustering hierarchies)
Basic clustering algorithms like K means, agglomerative clustering are some of the most commonly used clustering algorithms. But when performing clustering on very large datasets, BIRCH and DBSCAN are the advanced clustering algorithms useful for performing precise clustering on large datasets. Moreover, BIRCH is very useful because of its easy implementation.
Before going on with the implementation, we will discuss the algorithms and features of the BIRCH.
Without going into the mathematics of BIRCH, more formally, BIRCH is a clustering algorithm that clusters the dataset first in small summaries, then after small summaries get clustered. It does not directly cluster the dataset. This is why BIRCH is often used with other clustering algorithms; after making the summary, the summary can also be clustered by other clustering algorithms.
It is provided as an alternative to MinibatchKMeans. It converts data to a tree data structure with the centroids being read off the leaf. And these centroids can be the final cluster centroid or the input for other cluster algorithms like AgglomerativeClustering.
BIRCH is a scalable clustering method based on hierarchy clustering and only requires a one-time scan of the dataset, making it fast for working with large datasets. This algorithm is based on the CF (clustering features) tree. In addition, this algorithm uses a tree-structured summary to create clusters. The tree structure of the given data is built by the BIRCH algorithm called the Clustering feature tree(CF tree).
In context to the CF tree, the algorithm compresses the data into the sets of CF nodes. Those nodes that have several sub-clusters can be called CF subclusters. These CF subclusters are situated in no-terminal CF nodes.
The CF tree is a height-balanced tree that gathers and manages clustering features and holds necessary information of given data for further hierarchical clustering. This prevents the need to work with whole data given as input. The tree cluster of data points as CF is represented by three numbers (N, LS, SS).
- N = Number of items in subclusters
- LS = vector sum of the data points
- SS = Sum of the squared data points
In the image below, we can easily understand the numbers.
Here we can see in the example how a CF generates there. In the image, five samples are available (3,4), (2, 6), (4, 5), (4, 7), (3, 8),.
So by the image N = 5, LS = (16,30) and SS = (54, 190).
The below image can represent the CF tree structure.
In the image, we can see that the root node has a non-leaf node where every non-leaf node is having B entries and the leaf node has L cluster feature(CF), so if every CF in the leaf node has L entries, it will satisfy the threshold value T is a maximum diameter of radius. So here, we can see that each leaf node is a sub-cluster, more formally saying it is a summary, not a data point.
Let’s see the basics of algorithms.
There are mainly four phases which are followed by the algorithm of BIRCH.
- Scanning data into memory.
- Condense data(resize data).
- Global clustering.
- Refining clusters.
In these four phases, two of them (resize data and refining clusters) are optional. They come in the process when more clarity is required. But scanning data is just like loading data into a model. After loading the data, the algorithm scans the whole data and fits them into the CF trees. In condensing, it resets and resizes the data for better fitting into the CF tree. In global clustering, it sends CF trees for clustering using existing clustering algorithms. Finally, refining fixes the problem of CF trees where the same valued points are assigned to different leaf nodes.
Scikit Learn provides the module for direct implementation of BIRCH under the cluster class packages. We need to provide values to the parameters according to the requirement.
There are three parameters in the BIRCH algorithm.
- Threshold – The maximum number of data samples to be considered in a subcluster of the leaf node in a CF tree.
- Branching_factor – It is the factor that is used to specify the number of CF sub-clusters that can be made in a node.
- N_clusters – number of clusters.
Implementation of the BIRCH using python
Importing the required libraries
import matplotlib.pyplot as plt from sklearn.datasets.samples_generator import make_blobs from sklearn.cluster import Birch
Generating a dataset using make blobs.
data, clusters = make_blobs(n_samples = 1000, centers = 12, cluster_std = 0.50, random_state = 0) data.shape
Creating a BIRCH model
model = Birch(branching_factor = 50, n_clusters = None, threshold = 1.5)
Fitting the dataset in the model.
Creating the prediction of the dataset using the generated model.
pred = model.predict(data)
Making the scatterplot for checking the results.
plt.scatter(data[:, 0], data[:, 1], c = pred)
Here in the output, we can see that we have created 12 clusters of randomly generated samples using make blob, and we can see the algorithm is working finely. The main feature of using the BIRCH is its CF-tree feature. There can be many problems related to huge data.
It is a good algorithm with the advantages of a single scan, and also, the CF-tree feature increases the quality of clusters, but one thing where it lags is it uses only numeric or vector data.