Search

# How To Do Hierarchical Clustering Using UPGMA

UPGMA is one of the most widespread hierarchical clustering algorithms because it is easy to understand and fast in practice.

UPGMA (Unweighted Pair Group Method with Arithmetic mean), introduced initially as average linkage analysis, is an agglomerative (bottom-up) hierarchical clustering approach. It is arguably one of the most widespread hierarchical clustering algorithms. This is because UPGMA is conceptually easy to understand and fast in practice, an important consideration when working with big datasets. It is generally used as a hierarchical clustering tool in bioinformatics and other data mining and pattern recognition areas. Furthermore, it is used in phylogenetics and taxonomy to build evolutionary trees and related fields, including ecology and metagenomics.

### Algorithm & Approach

Hierarchical clustering is an approach to cluster analysis that aims to group similar data points by building a hierarchy of clusters. Hierarchical clustering algorithms fall under two categories: agglomerative( bottom-up)  and divisive (top-down).  Like AGNES, UPGMA follows the bottom-up approach; each point starts in a cluster of its own. In every iteration, the two nearest/ most similar clusters are combined to form a new higher-level cluster. This is repeated until there’s only the desired number of clusters, `k`, left.

What differentiates UPGMA from AGNES or its weighted counterpart WPGMA is how the similarity of clusters is calculated. In UPGMA,  the distance between any two clusters A and B with cardinality |A| and |B| respectively is calculated as the mean distance between the points of each cluster, i.e., the average of all the distances d(x,y) between pairs of points x in A and y in B.

### Implementing UPGMA in Python

1. Import necessary library and classes
``` import math
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs```
1. Let’s first create the dataset we will be working with.
``` dataset = make_blobs(n_samples = 300,
n_features = 2,
centers = 3,
cluster_std = 2,
random_state = 42)
plt.figure(figsize= (12, 8))
plt.scatter(dataset[0][:,0], dataset[0][:,1], c = dataset[1]) ```

The `dataset` is a two-tuple containing the data points and their respective clusters, we only require the points.

`points = dataset[0]`
1. Next is to write a function that calculates the average distance between pair of points in two clusters.
``` def average_distance(cluster1, cluster2):
distance = 0.0
n1 = len(cluster1)
n2 = len(cluster2)
for i in range(n1):
for j in range(n2):
point1 = cluster1[i]
point2 = cluster2[j]
dimension = len(point1)
d = sum((point1[k]-point2[k])**2 for k in range(dimension))
d = math.sqrt(d)
distance += d
distance = distance / (n1*n2)
return distance ```
1. Finally, using the `average_distance` method, we can define the UPGMA clustering function.

Create a cluster for each data point.

``` def UPGMA(points, k):
clusters = []
n = len(points)
for i in range(len(points)):
cluster = [points[i]]
clusters.append(cluster) ```

Iteratively combine similar clusters until there are only k clusters left.

```     n_clusters = n
while n_clusters > k:
cluster1, cluster2 = [], []
index1, index2 = 0, 0 ```

Iterate through all possible cluster combinations to find the cluster pair with the least distance.

```         best_distance = math.inf
for i in range(n_clusters):
for j in range(i+1, n_clusters):
dis = average_distance(clusters[i], clusters[j])
if dis < best_distance:
best_distance = dis
cluster1 = clusters[i]
cluster2 = clusters[j]
index1 = i
index2 = j ```

Merge the two clusters with the least distance.

```         if index1 > index2:
clusters.pop(index1)
clusters.pop(index2)
else:
clusters.pop(index2)
clusters.pop(index1)
new_cluster = cluster1 + cluster2
clusters.append(new_cluster)
n_clusters = n_clusters - 1

return clusters ```
1. That’s it, now we can use our UPGMA function to perform clustering.
``` clusters = UPGMA(points, 3)
colors = ['ro', 'bo', 'go']
plt.figure(figsize= (12, 8))
[plt.plot([x[0] for x in clusters[i]], [x[1] for x in clusters[i]], colors[i])
for i in range(len(clusters))]
plt.axis('off') ```

### UPGMA clustering using SciPy

1. Import the hierarchical clustering class from SciPy
`import scipy.cluster.hierarchy as hier`
1. Use the `average()` method, which implements UPGMA to calculate the linkage matrix.
`matrix = hier.average(points)`
1. Pass this matrix to the `fcluster()` method to create `t` clusters.
``` scipy_clusters = hier.fcluster(matrix, t = 3, criterion= 'maxclust' )
plt.figure(figsize= (12, 8))
plt.scatter(dataset[0][:,0], dataset[0][:,1], c = scipy_clusters)
plt.axis('off') ```

The above implementation can be found in a Colab notebook here.

##### References
A machine learning enthusiast with a knack for finding patterns. In my free time, I like to delve into the world of non-fiction books and video essays.

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Is GPT-4 Really Better than Radiologists?

“Radiology report summaries created by GPT-4 are comparable, and in some cases, even preferred over

### TSMC: The Wizard Behind AI’s Curtain

TSMC anticipates a substantial CAGR of nearly 50% in the AI sector from 2022 to 2027.

Not really.

### Google Gemini To Arrive Sooner Than Expected

This is after announcing the AI at the Google I/O 2023, the company had postponed

### ByteDance to Launch Platform to Build Custom Chatbots

This comes just a few days after OpenAI had delayed its plan to launch a

### This New AI tool Could Mark the Beginning of the End for TikTok and Instagram Influencers

Alibaba Group announces a model framework that can transform still images into dynamic character videos

### Embracing Identity: The Journey of Sujoy Das

“Why is it that corporate diversity efforts are often limited to specific times of the

### The Biggest Data Breaches of 2023

The most significant breaches that impacted the global landscape in 2023.

### NVIDIA Planning Big Expansions in Japan

Prime Minister Fumio Kishida has extended billions of dollars in financial support to bolster TSMC

### Runway Partners with Getty to Build Video Generation Model for Enterprises

Runway enterprise users can refine RGM with their proprietary datasets, benefiting various industries like Hollywood,