K-Medoids is a clustering algorithm resembling the K-Means clustering technique. It falls under the category of unsupervised machine learning. It majorly differs from the K-Means algorithm in terms of the way it selects the clusters’ centres. The former selects the average of a cluster’s points as its centre (which may or may not be one of the data points) while the latter always picks the actual data points from the clusters as their centres (also known as ‘exemplars’ or ‘medoids’). K-Medoids also differs in this respect from the K-Medians algorithm whic,h is the same as K-means, except that it chooses the medians (instead of means) of the clusters as centres.
We have already covered the basics of the K-Means clustering method in our previous article. Refer to it before proceeding if you are unfamiliar with the K-Means algorithm.
Working of the K-Medoids approach
The steps followed by the K-Medoids algorithm for clustering are as follows:
- Randomly choose ‘k’ points from the input data (‘k’ is the number of clusters to be formed). The correctness of the choice of k’s value can be assessed using methods such as silhouette method.
- Each data point gets assigned to the cluster to which its nearest medoid belongs.
- For each data point of cluster i, its distance from all other data points is computed and added. The point of ith cluster for which the computed sum of distances from other points is minimal is assigned as the medoid for that cluster.
- Steps (2) and (3) are repeated until convergence is reached i.e. the medoids stop moving.
Complexity of K-Medoids algorithm
The complexity of the K-Medoids algorithm comes to O(N2CT) where N, C and T denote the number of data points, number of clusters and number of iterations respectively. With similar notations, the complexity K-Means algorithm can be given as O(NCT).
Advantages of the technique
Mean of the data points is a measure that gets highly affected by the extreme points. So in K-Means algorithm, the centroid may get shifted to a wrong position and hence result in incorrect clustering if the data has outliers because then other points will move away from . On the contrary, a medoid in the K-Medoids algorithm is the most central element of the cluster, such that its distance from other points is minimum. Since medoids do not get influenced by extremities, the K-Medoids algorithm is more robust to outliers and noise than K-Means algorithm. The following figure explains how mean’s and medoid’s positions can vary in the presence of an outlier.
Besides, K-Medoids algorithm can be used with arbitrarily chosen dissimilarity measure (e.g. cosine similarity) or any distance metric, unlike K-Means which usually needs Euclidean distance metric to arrive at efficient solutions.
K-Medoids algorithm is found useful for practical applications such as face recognition. The medoid can correspond to the typical photo of the individual whose face is to be recognized. But if K-Means algorithm is used instead, some blurred image may get assigned as the centroid, which has mixed features from several photos of the individual and hence makes the face recognition task difficult.
Here’s a demonstration of implementing K-Medoids algorithm on a dataset containing 8*8 dimensional images of handwritten digits. The task is to divide the data points into 10 clusters (for classes 0-9) using K-Medoids. The dataset used is a copy of the test set of the original dataset available on UCI ML Repository. The code here has been implemented in Google colab using Python 3.7.10 and scikit-learn-extra 0.1.0b2 versions. Step-wise explanation of the code is as follows:
- Install scikit-learn-extra Python module, an extension of scikit-learn designed for implementing more advanced algorithms that cannot be used by mere inclusion of scikit-learn in the code.
!pip install scikit-learn-extra
- Import required libraries and modules.
import numpy as np import matplotlib.pyplot as plt from sklearn_extra.cluster import KMedoids #Import the digits’ dataset available in sklearn.datasets package from sklearn.datasets import load_digits “”” Instead of using all 64 attributes of the dataset, we use Principal Component Analysis (PCA) to reduce the dimensions of features set such that most of the useful information is covered. “”” from sklearn.decomposition import PCA “”” Import module for standardizing the dataset i.e. rescaling the data such that its has mean of 0 and standard deviation of 1 “”” from sklearn.preprocessing import scale
- Prepare the input data
#Load the digits dataset dataset = load_digits() #Standardize the data digit_data = scale(dataset.data) “”” Compute number of output classes i.e. number of digits for which we have the data (here 10 (0-9)) “”” num_digits = len(np.unique(dataset.target))
- Reduce the dimensions of the data using PCA.
red_data = PCA(n_components=2).fit_transform(digit_data) “”” PCA constructs new components by linear combinations of original features. ‘n_components’ parameter denotes the number of newly formed components to be considered. fit_transform() method fits the PCA models and performs dimensionality reduction on digit_data. “””
- Plot the decision boundaries for each cluster. Assign a different color to each for differentiation.
h = 0.02 #step size of the mesh #Minimum and maximum x-coordinates xmin, xmax = red_data[:, 0].min() - 1, red_data[:, 0].max() + 1 #Minimum and maximum y-coordinates ymin, ymax = red_data[:, 1].min() - 1, red_data[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(xmin, xmax, h), np.arange(ymin, ymax, h))
- Define an array of K-Medoids variants to be used. We have used three different distance metrics (Manhattan distance, Euclidean distance and Cosine dissimilarity/distance) for computing the distance of each data point from every other data point while selecting the medoid. Visit this page to know about the distance metrics used in detail.
The parameters we have specified in the KMedoids() method have the following significance:
- metric – distance metric to be used (default: ‘euclidean’)
- n_clusters – number of clusters to be formed and hence the number of medoids (one per cluster) (default value: 8)
- init – ‘heuristic’ method used for medoid initialization
For each data point, itd distance from all other points is computed and the
distances are summed up. N_clusters number of points for which such a sum of
distances are minimum, are chosen as medoids.
- max_iter – maximum number of the algorithm’s iterations to be performed when fitting the data
The KMedoids() method of scikit-learn-extra by default used the PAM (Partition Around Medoids) algorithm for finding the medoids.
models = [ ( KMedoids(metric="manhattan", n_clusters=num_digits, init="heuristic", max_iter=2),"Manhattan metric", ), ( KMedoids(metric="euclidean", n_clusters=num_digits, init="heuristic", max_iter=2),"Euclidean metric", ), (KMedoids(metric="cosine", n_clusters=num_digits, init="heuristic", max_iter=2), "Cosine metric", ), ]
- Initialize the number of rows and columns of the plot for plotting subplots of each of the three metrics’ results.
#number of rows = integer(ceiling(number of model variants/2)) num_rows = int(np.ceil(len(models) / 2.0)) #number of columns num_cols = 2
- Fit each of the model variants to the data and plot the resultant clustering.
#Clear the current figure first (if any) plt.clf() #Initialize dimensions of the plot plt.figure(figsize=(15,10)) “”” The ‘models’ array defined in step (6) contains three tuples, each having a model variant’s parameters and its descriptive text. We iterate through each of the tuples, fit the data to the model and plot the results. “”” for i, (model, description) in enumerate(models): # Fit each point in the mesh to the model model.fit(red_data) #Predict the labels for points in the mesh Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) # Put the result into a color plot Z = Z.reshape(xx.shape) #Subplot for the ith model variant plt.subplot(num_cols, num_rows, i + 1) #Display the subplot plt.imshow( Z, #data to be plotted interpolation="nearest", #bounding box coordinates (left,right,bottom,top) extent=(xx.min(), xx.max(), yy.min(), yy.max()), cmap=plt.cm.Paired, #colormap aspect="auto", #aspect ratio of the axes origin="lower", #set origin as lower left corner of the axes ) plt.plot( red_data[:, 0], red_data[:, 1], "k.", markersize=2, alpha=0.3 ) # Plot the centroids as white cross marks centroids = model.cluster_centers_ plt.scatter( centroids[:, 0], centroids[:, 1], marker="x", s=169, #marker’s size (points^2) linewidths=3, #width of boundary lines color="w", #white color for centroids markings zorder=10, #drawing order of axes ) #describing text of the tuple will be title of the subplot plt.title(description) plt.xlim(xmin, xmax) #limits of x-coordinates plt.ylim(ymin, ymax) #limits of y-coordinates plt.xticks(()) plt.yticks(()) #Upper title of the whole plot plt.suptitle( #Text to be displayed "K-Medoids algorithm implemented with different metrics\n\n", fontsize=20, #size of the fonts ) plt.show()
Join Our Discord Server. Be part of an engaging online community. Join Here.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
A zealous learner aspiring to advance in the domain of AI/ML. Eager to grasp emerging techniques to get insights from data and hence explore realistic Data Science applications as well.