MITB Banner

How To Do Hierarchical Clustering Using UPGMA

UPGMA is one of the most widespread hierarchical clustering algorithms because it is easy to understand and fast in practice.

Share

UPGMA

UPGMA (Unweighted Pair Group Method with Arithmetic mean), introduced initially as average linkage analysis, is an agglomerative (bottom-up) hierarchical clustering approach. It is arguably one of the most widespread hierarchical clustering algorithms. This is because UPGMA is conceptually easy to understand and fast in practice, an important consideration when working with big datasets. It is generally used as a hierarchical clustering tool in bioinformatics and other data mining and pattern recognition areas. Furthermore, it is used in phylogenetics and taxonomy to build evolutionary trees and related fields, including ecology and metagenomics. 

Algorithm & Approach

Hierarchical clustering is an approach to cluster analysis that aims to group similar data points by building a hierarchy of clusters. Hierarchical clustering algorithms fall under two categories: agglomerative( bottom-up)  and divisive (top-down).  Like AGNES, UPGMA follows the bottom-up approach; each point starts in a cluster of its own. In every iteration, the two nearest/ most similar clusters are combined to form a new higher-level cluster. This is repeated until there’s only the desired number of clusters, k, left. 

What differentiates UPGMA from AGNES or its weighted counterpart WPGMA is how the similarity of clusters is calculated. In UPGMA,  the distance between any two clusters A and B with cardinality |A| and |B| respectively is calculated as the mean distance between the points of each cluster, i.e., the average of all the distances d(x,y) between pairs of points x in A and y in B. 

Implementing UPGMA in Python

  1. Import necessary library and classes
 import math
 import matplotlib.pyplot as plt
 from sklearn.datasets import make_blobs
  1. Let’s first create the dataset we will be working with. 
 dataset = make_blobs(n_samples = 300,
                     n_features = 2,
                     centers = 3, 
                     cluster_std = 2,
                     random_state = 42)
 plt.figure(figsize= (12, 8))
 plt.scatter(dataset[0][:,0], dataset[0][:,1], c = dataset[1]) 
Dataset with three clusters

The dataset is a two-tuple containing the data points and their respective clusters, we only require the points.

points = dataset[0]
  1. Next is to write a function that calculates the average distance between pair of points in two clusters. 
 def average_distance(cluster1, cluster2):
     distance = 0.0
     n1 = len(cluster1)
     n2 = len(cluster2)
     for i in range(n1):
         for j in range(n2):
             point1 = cluster1[i]
             point2 = cluster2[j]
             dimension = len(point1)
             d = sum((point1[k]-point2[k])**2 for k in range(dimension))
             d = math.sqrt(d)
             distance += d
     distance = distance / (n1*n2)
     return distance 
  1. Finally, using the average_distance method, we can define the UPGMA clustering function.

Create a cluster for each data point.

 def UPGMA(points, k):
     clusters = []
     n = len(points)
     for i in range(len(points)):
         cluster = [points[i]]
         clusters.append(cluster) 

Iteratively combine similar clusters until there are only k clusters left.

     n_clusters = n
     while n_clusters > k:
         cluster1, cluster2 = [], []
         index1, index2 = 0, 0 

Iterate through all possible cluster combinations to find the cluster pair with the least distance.

         best_distance = math.inf
         for i in range(n_clusters):
             for j in range(i+1, n_clusters):
                 dis = average_distance(clusters[i], clusters[j])
                 if dis < best_distance:
                     best_distance = dis
                     cluster1 = clusters[i]
                     cluster2 = clusters[j]
                     index1 = i
                     index2 = j 

Merge the two clusters with the least distance.

         if index1 > index2:
             clusters.pop(index1)
             clusters.pop(index2)
         else:
             clusters.pop(index2)
             clusters.pop(index1)
         new_cluster = cluster1 + cluster2
         clusters.append(new_cluster)
         n_clusters = n_clusters - 1

     return clusters 
  1. That’s it, now we can use our UPGMA function to perform clustering.
 clusters = UPGMA(points, 3)
 colors = ['ro', 'bo', 'go']
 plt.figure(figsize= (12, 8))
 [plt.plot([x[0] for x in clusters[i]], [x[1] for x in clusters[i]], colors[i])
  for i in range(len(clusters))]
 plt.axis('off') 
Result of our UPGMA clustering function

UPGMA clustering using SciPy

  1. Import the hierarchical clustering class from SciPy
import scipy.cluster.hierarchy as hier
  1. Use the average() method, which implements UPGMA to calculate the linkage matrix.
matrix = hier.average(points)
  1. Pass this matrix to the fcluster() method to create t clusters.
 scipy_clusters = hier.fcluster(matrix, t = 3, criterion= 'maxclust' )
 plt.figure(figsize= (12, 8))
 plt.scatter(dataset[0][:,0], dataset[0][:,1], c = scipy_clusters)
 plt.axis('off') 

The above implementation can be found in a Colab notebook here.

References
Share
Picture of Aditya Singh

Aditya Singh

A machine learning enthusiast with a knack for finding patterns. In my free time, I like to delve into the world of non-fiction books and video essays.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.