# Comprehensive Guide To K-Medoids Clustering Algorithm

K-Medoids is a clustering algorithm resembling the K-Means clustering technique. It falls under the category of unsupervised machine learning.

K-Medoids is a clustering algorithm resembling the K-Means clustering technique. It falls under the category of unsupervised machine learning. It majorly differs from the K-Means algorithm in terms of the way it selects the clusters’ centres. The former selects the average of a cluster’s points as its centre (which may or may not be one of the data points) while the latter always picks the actual data points from the clusters as their centres (also known as ‘exemplars’ or ‘medoids’). K-Medoids also differs in this respect from the K-Medians algorithm whic,h is the same as K-means, except that it chooses the medians (instead of means) of the clusters as centres.

We have already covered the basics of the K-Means clustering method in our previous article. Refer to it before proceeding if you are unfamiliar with the K-Means algorithm.

## Working of the K-Medoids approach

The steps followed by the K-Medoids algorithm for clustering are as follows:

1. Randomly choose ‘k’ points from the input data (‘k’ is the number of clusters to be formed). The correctness of the choice of k’s value can be assessed using methods such as silhouette method.
1. Each data point gets assigned to the cluster to which its nearest medoid belongs.
1. For each data point of cluster i, its distance from all other data points is computed and added. The point of ith cluster for which the computed sum of distances from other points is minimal is assigned as the medoid for that cluster.
1. Steps (2) and (3) are repeated until convergence is reached i.e. the medoids stop moving.

## Complexity of K-Medoids algorithm

The complexity of the K-Medoids algorithm comes to O(N2CT) where N, C and T denote the number of data points, number of clusters and number of iterations respectively. With similar notations, the complexity K-Means algorithm can be given as O(NCT).

Mean of the data points is a measure that gets highly affected by the extreme points. So in K-Means algorithm, the centroid may get shifted to a wrong position and hence result in incorrect clustering if the data has outliers because then other points will move away from  . On the contrary, a medoid in the K-Medoids algorithm is the most central element of the cluster, such that its distance from other points is minimum. Since medoids do not get influenced by extremities, the K-Medoids algorithm is more robust to outliers and noise than K-Means algorithm. The following figure explains how mean’s and medoid’s positions can vary in the presence of an outlier.

Image source

Besides, K-Medoids algorithm can be used with arbitrarily chosen dissimilarity measure (e.g. cosine similarity) or any distance metric, unlike K-Means which usually needs Euclidean distance metric to arrive at efficient solutions.

K-Medoids algorithm is found useful for practical applications such as face recognition. The medoid can correspond to the typical photo of the individual whose face is to be recognized. But if K-Means algorithm is used instead, some blurred image may get assigned as the centroid, which has mixed features from several photos of the individual and hence makes the face recognition task difficult.

## Practical Implementation

Here’s a demonstration of implementing K-Medoids algorithm on a dataset containing 8*8 dimensional images of handwritten digits. The task is to divide the data points into 10 clusters (for classes 0-9) using K-Medoids. The dataset used is a copy of the test set of the original dataset available on UCI ML Repository. The code here has been implemented in Google colab using Python 3.7.10 and scikit-learn-extra 0.1.0b2 versions. Step-wise explanation of the code is as follows:

1. Install scikit-learn-extra Python module, an extension of scikit-learn designed for implementing more advanced algorithms that cannot be used by mere inclusion of scikit-learn in the code.

`!pip install scikit-learn-extra`

1. Import required libraries and modules.
``` import numpy as np
import matplotlib.pyplot as plt
from sklearn_extra.cluster import KMedoids
#Import the digits’ dataset available in sklearn.datasets package
“””
Instead of using all 64 attributes of the dataset, we use Principal Component Analysis (PCA)  to reduce the dimensions of features set such that most of the useful information is covered.
“””
from sklearn.decomposition import PCA
“””
Import module for standardizing the dataset i.e. rescaling the data such that its has mean of 0 and standard deviation of 1
“””
from sklearn.preprocessing import scale ```
1. Prepare the input data
``` #Load the digits dataset
#Standardize the data
digit_data = scale(dataset.data)
“””
Compute number of output classes i.e. number of digits for which we have the data (here 10 (0-9))
“””
num_digits = len(np.unique(dataset.target)) ```
1. Reduce the dimensions of the data using PCA.
``` red_data = PCA(n_components=2).fit_transform(digit_data)
“””
PCA constructs new components by linear combinations of original features. ‘n_components’ parameter denotes the number of newly formed components to be considered. fit_transform() method fits the PCA models and performs dimensionality reduction on digit_data.
“”” ```
1. Plot the decision boundaries for each cluster. Assign a different color to each for differentiation.
``` h = 0.02 #step size of the mesh
#Minimum and maximum x-coordinates
xmin, xmax = red_data[:, 0].min() - 1, red_data[:, 0].max() + 1
#Minimum and maximum y-coordinates
ymin, ymax = red_data[:, 1].min() - 1, red_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(xmin, xmax, h), np.arange(ymin, ymax, h)) ```
1. Define an array of K-Medoids variants to be used. We have used  three different distance metrics (Manhattan distance, Euclidean distance and Cosine dissimilarity/distance) for computing the distance of each data point from every other data point while selecting the medoid. Visit this page to know about the distance metrics used in detail.

The parameters we have specified in the KMedoids() method have the following significance:

• metric – distance metric to be used (default: ‘euclidean’)
• n_clusters – number of clusters to be formed and hence the number of medoids (one per cluster) (default value: 8)
• init – ‘heuristic’ method used for medoid initialization

For each data point, itd distance from all other points is computed and the

distances are summed up. N_clusters number of points for which such a sum of

distances are minimum, are chosen as medoids.

• max_iter – maximum number of the algorithm’s iterations to be performed when fitting the data

The KMedoids() method of scikit-learn-extra by default used the PAM (Partition Around Medoids) algorithm for finding the medoids.

``` models = [
(
KMedoids(metric="manhattan", n_clusters=num_digits,
init="heuristic", max_iter=2),"Manhattan metric",
),
(
KMedoids(metric="euclidean", n_clusters=num_digits,
init="heuristic", max_iter=2),"Euclidean metric",
),
(KMedoids(metric="cosine", n_clusters=num_digits, init="heuristic",
max_iter=2), "Cosine metric", ),
] ```
1. Initialize the number of rows and columns of the plot for plotting subplots of each of the three metrics’ results.
``` #number of rows = integer(ceiling(number of model variants/2))
num_rows = int(np.ceil(len(models) / 2.0))
#number of columns
num_cols = 2 ```
1. Fit each of the model variants to the data and plot the resultant clustering.
``` #Clear the current figure first (if any)
plt.clf()
#Initialize dimensions of the plot
plt.figure(figsize=(15,10))
“””
The ‘models’ array defined in step (6) contains three tuples, each having a model variant’s parameters and its descriptive text. We iterate through each of the tuples, fit the data to the model and plot the results.
“””
for i, (model, description) in enumerate(models):
# Fit each point in the mesh to the model
model.fit(red_data)
#Predict the labels for points in the mesh
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result  into a color plot
Z = Z.reshape(xx.shape)
#Subplot for the ith model variant
plt.subplot(num_cols, num_rows, i + 1)
#Display the subplot
plt.imshow(
Z,    #data to be plotted
interpolation="nearest",
#bounding box coordinates (left,right,bottom,top)
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,  #colormap
aspect="auto", #aspect ratio of the axes
origin="lower",  #set origin as lower left corner of the axes
)
plt.plot(
red_data[:, 0], red_data[:, 1], "k.", markersize=2, alpha=0.3
)
# Plot the centroids as white cross marks
centroids = model.cluster_centers_
plt.scatter(
centroids[:, 0],
centroids[:, 1],
marker="x",
s=169,  #marker’s size (points^2)
linewidths=3, #width of boundary lines
color="w",  #white color for centroids markings
zorder=10,  #drawing order of axes
)
#describing text of the tuple will be title of the subplot
plt.title(description)
plt.xlim(xmin, xmax)  #limits of x-coordinates
plt.ylim(ymin, ymax)  #limits of y-coordinates
plt.xticks(())
plt.yticks(())
#Upper title of the whole plot
plt.suptitle(
#Text to be displayed
"K-Medoids algorithm implemented with different metrics\n\n",
fontsize=20,  #size of the fonts
)
plt.show() ```

Output:

A zealous learner aspiring to advance in the domain of AI/ML. Eager to grasp emerging techniques to get insights from data and hence explore realistic Data Science applications as well.

## Our Upcoming Events

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Apple Launches iCringe with a Sustainability Twist

With Mother Nature in mind, Apple is making impactful strides towards carbon-neutral products. However, there is a slight hiccup

### Data Science Hiring Process at Zoho

Zoho has over 10 open positions for both freshers and experienced professionals.

### Will AGI Be Built in China?

AGIEval Seems to Think So

### NVIDIA Expands Cloud Business with Investments, Partnerships

With NVIDIA partnership, Hugging Face users get access to SOTA GPUs and infrastructure needed to rapidly train and finetune foundation models at scale and drive a new wave of enterprise LLM development.

### Intel Soon to be on Par with NVIDIA

A green CPU with a blue GPU might soon be possible.

### Shell Hackathon to Protect Against Cyber Threats

The aim of the Cyber Threat Detection Hackathon is to build a model capable of identifying code in a body of text.

### ChatGPT is Down, I Can’t Code Anymore

Don’t they know I have a product to ship?

### Decoding SAP Labs’ Generative AI Motto

The German ERP software provider is investing heavily in upskilling its employees.

### Why AI Tech Honchos are Meeting Behind Closed Doors

What transpired when the who’s who of tech leaders convened in Capitol Hill last week to discuss AI behind closed doors?

### AI Clock is Ticking: Wake Up Call for Education Institutions

It’s not too late