# Guide To BIRCH Clustering Algorithm(With Python Codes)

BIRCH clustering algorithm is provided as an alternative to MinibatchKMeans. It converts data to a tree data structure with the centroids being read off the leaf. And these centroids can be the final cluster centroid or the input for other cluster algorithms like AgglomerativeClustering.

Clustering is the process of dividing huge data into smaller parts. It is an unsupervised learning problem. Mostly we perform clustering when the analysis is required to extract the information of an interesting pattern or the field, for example, extracting similar user behaviour in a customer database.

Many clustering algorithms are available to use, and all of them have their characteristics and use cases. We can not have all the similar time kinds of datasets. Some algorithms are made to use when the dataset is low in quantity, and some of them can be used when the dataset is in high quantity. Clustering algorithms are made to find the natural feature groups in the feature space of input data.

Examples of clustering algorithms are:

#### THE BELAMY

• Agglomerative clustering
• DBSCAN’
• K- means
• Spectral clustering
• BIRCH

In this article, we are going to discuss the BIRCH clustering algorithm. The article assumes that the reader has the basic knowledge of clustering algorithms and their terminology.

BIRCH(Balanced Iterative Reducing and Clustering hierarchies)

Basic clustering algorithms like K means, agglomerative clustering are some of the most commonly used clustering algorithms. But when performing clustering on very large datasets, BIRCH and DBSCAN are the advanced clustering algorithms useful for performing precise clustering on large datasets. Moreover, BIRCH is very useful because of its easy implementation.

Before going on with the implementation, we will discuss the algorithms and features of the BIRCH.

Without going into the mathematics of BIRCH, more formally, BIRCH is a clustering algorithm that clusters the dataset first in small summaries, then after small summaries get clustered. It does not directly cluster the dataset. This is why BIRCH is often used with other clustering algorithms; after making the summary, the summary can also be clustered by other clustering algorithms.

It is provided as an alternative to MinibatchKMeans. It converts data to a tree data structure with the centroids being read off the leaf. And these centroids can be the final cluster centroid or the input for other cluster algorithms like AgglomerativeClustering.

BIRCH is a scalable clustering method based on hierarchy clustering and only requires a one-time scan of the dataset, making it fast for working with large datasets. This algorithm is based on the CF (clustering features) tree. In addition, this algorithm uses a tree-structured summary to create clusters. The tree structure of the given data is built by the BIRCH algorithm called the Clustering feature tree(CF tree).

In context to the CF tree, the algorithm compresses the data into the sets of CF nodes. Those nodes that have several sub-clusters can be called CF subclusters. These CF subclusters are situated in no-terminal CF nodes.

The CF tree is a height-balanced tree that gathers and manages clustering features and holds necessary information of given data for further hierarchical clustering. This prevents the need to work with whole data given as input. The tree cluster of data points as CF is represented by three numbers (N, LS, SS).

• N = Number of items in subclusters
• LS =  vector sum of the data points
• SS = Sum of the squared data points

In the image below, we can easily understand the numbers.

Here we can see in the example how a CF generates there. In the image, five samples are available (3,4), (2, 6), (4, 5), (4, 7), (3, 8),.

So by the image N = 5, LS = (16,30) and SS = (54, 190).

The below image can represent the CF tree structure.

In the image, we can see that the root node has a non-leaf node where every non-leaf node is having B entries and the leaf node has L cluster feature(CF), so if every CF in the leaf node has L entries, it will satisfy the threshold value T is a maximum diameter of radius. So here, we can see that each leaf node is a sub-cluster, more formally saying it is a summary, not a data point.

Let’s see the basics of algorithms.

There are mainly four phases which are followed by the algorithm of BIRCH.

•  Scanning data into memory.
• Condense data(resize data).
• Global clustering.
• Refining clusters.

In these four phases, two of them (resize data and refining clusters) are optional. They come in the process when more clarity is required. But scanning data is just like loading data into a model. After loading the data, the algorithm scans the whole data and fits them into the CF trees. In condensing, it resets and resizes the data for better fitting into the CF tree. In global clustering, it sends  CF trees for clustering using existing clustering algorithms. Finally, refining fixes the problem of CF trees where the same valued points are assigned to different leaf nodes.

Scikit Learn provides the module for direct implementation of BIRCH under the cluster class packages. We need to provide values to the parameters according to the requirement.

There are three parameters in the BIRCH algorithm.

• Threshold – The maximum number of data samples to be considered in a  subcluster of the leaf node in a CF tree.
• Branching_factor – It is the factor that is used to specify the number of CF sub-clusters that can be made in a node.
• N_clusters – number of clusters.

Implementation of the BIRCH using python

Importing the required libraries

Input:

``````import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import Birch``````

Generating a dataset using make blobs.

Input:

``````data, clusters = make_blobs(n_samples = 1000, centers = 12, cluster_std = 0.50, random_state = 0)
data.shape``````

Output:

Creating a BIRCH model

Input:

``model = Birch(branching_factor = 50, n_clusters = None, threshold = 1.5)``

Fitting the dataset in the model.

Input:

``model.fit(data)``

Output:

Creating the prediction of the dataset using the generated model.

Input:

``pred = model.predict(data)``

Making the scatterplot for checking the results.

Input:

``plt.scatter(data[:, 0], data[:, 1], c = pred)``

Output:

Here in the output, we can see that we have created 12 clusters of randomly generated samples using make blob, and we can see the algorithm is working finely. The main feature of using the BIRCH is its CF-tree feature. There can be many problems related to huge data.

It is a good algorithm with the advantages of a single scan, and also, the CF-tree feature increases the quality of clusters, but one thing where it lags is it uses only numeric or vector data.

References

## More Great AIM Stories

### Can ChatGPT Replace Software Testers & SQL Analysts?

Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.

## AIM Upcoming Events

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Early Bird Passes expire on 10th Feb

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Meta gives away a free video dataset of 846 hours

The Casual Conversations dataset comprises 846 hours of 45,000 videos, each up to a minute long on average.

### 3D animation using AI: Behind Plask

This free AI-powered 3D animation editor and mocap tool will completely change the way we edit our videos.

### Israel based Corsight AI claims it can model a person’s face from DNA sample

The company has previously drawn flak over exaggerating the competency and accuracy of its facial recognition system.

### Getting started with tensors from scratch in PyTorch

Various deep learning frameworks such as PyTorch do their computation on numbers in the form of tensors. Tensors are one of the basic fundamental aspects or types of data in deep learning.

### A complete tutorial on visualizing probability distributions in python

In mathematics, especially in probability theory and statistics, probability distribution represents the values of a variable that holds the probabilities of an experiment. distribution.

### Beyond gaming: How techies and designers are using Minecraft for it’s ubiquitousness

The ‘gaming’ platform actually supports collaborative learning, awareness building and engagement, given it is essentially a no man’s land.

### How to communicate secretly with SteganoGAN?

Steganography is the practice of concealing a secret message within or even on top of something public.

### NFL and AWS build “The Digital Athlete”: What is it?

The Digital Athlete is a virtual representation of an NFL player that can be used to better predict and eventually prevent player injury.

### What are autocorrelation and partial autocorrelation in time series data?

In time series analysis and forecasting, autocorrelation and partial autocorrelation are frequently employed to analyze the data.

### A guide to Kats: python tool by Meta for effective time-series analysis

Kats stands for Kits to Analyze Time Series, which was developed by the researchers at Facebook, now Meta. One of the most important things about Kats is that it is very easy to use. Also, it is a very light weighted library of generic time series analysis in a very generalized nature.