Understanding Mahalanobis Distance And Its Use Cases

road-166543_1920

Advertisement

Last month, we celebrated the National Statistics Day to commemorate the 125th birth anniversary of India’s legendary statistician, PC Mahalanobis. His contributions to applied statistics are immense and he is best known for developing a statistical measure called Mahalanobis distance, which is used in multivariate regression analysis. Mahalanobis distance finds wide applications in the field of classification and clustering. In this article, we will explore the Mahalanobis distance (MD) and its significance in statistics.

Regression Analysis In Statistics

Regression analysis is crucial in machine learning due to the fact that ML deals with errors and relationships in the data that goes into the model. This topic of statistics is widely used to study the variables that account in the project. ML models and algorithms sometimes analyse a variety of data to figure out a consistent output.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

In multivariate analysis, the data under a study’s consideration, is usually delineated into a number of variables. These variables have to denote a relationship with respect to a standalone variable for the study. Multivariate analysis helps in addressing this challenge through methods that use correlation (strength of relationship) between the variables.

What Is Mahalanobis Distance?

Generally, variables (usually two in number) in the multivariate analysis are described in a Euclidean space through a coordinate (x-axis and y-axis) system. Suppose if there are more than two variables, it is difficult to represent them as well as measure the variables along the planar coordinates. This is where the Mahalanobis distance (MD) comes into picture. It considers the mean (sometimes called centroid) of the multivariate data as the reference.

The MD measures the relative distance between two variables with respect to the centroid. Therefore, farther the variable is from the centroid, the larger the MD is. The definition of the concept is given below.

The Mahalanobis distance of an observation x = (x1, x2, x3….xN)T from a set of observations with mean μ= (μ123….μN)T and covariance matrix S is defined as:

MD(x) = √{(xμ)TS-1 (xμ)

The covariance matrix provides the covariance associated with the variables (the reason covariance is followed is to establish the effect of two or more variables together).

Uses And Applications

As mentioned earlier, MD is primarily used in classification and clustering problems where there is a need to establish correlation between different groups/clusters of data. Another application of MD is discriminant analysis and pattern analysis, which are based on classification.

It has also found relevance in principal component analysis (PCA), where the correlated variables are transformed to a set of uncorrelated variables called principal components. If the value of MD is squared, it is found that the sum of squares of all non-zero principal components are equal to that of MD. This is helpful to ascertain the right components based on the requirement in data analysis.

Use Cases Of Mahalanobis Distance

Image processing: The aspect of MD in image processing has spurred researchers to bring in this concept to serve various areas of the field. One such study is the anomaly detection in hyperspectral images, which are used to detect surface materials in the ground. A group of researchers from IEEE have developed a method called low-rank and sparse matrix decomposition-based Mahalanobis distance method for anomaly detection. This uses MD to detect the probable anomalies lying in the images analysed from sparse matrix decomposition.

Precision Medicine: A specific study in RNA sequencing has applied MD to analyse molecules to predict breast cancer survival. The researchers make use of MD to establish statistical significance of deregulated pathways in a subject’s transcriptome. Results show that the pathways derived from MD has improvement over previous research on the study, and offer better clinical interpretation of cancer survival.

Neurocomputing: Another research study incorporates MD in the detection of arrhythmia in an electrocardiogram (ECG). In this study, MD resolves the clustering problems associated with traditional Euclidean Distance (ED) observed in clustering features in ECG. The method followed is a Fuzzy C-means (FCM) clustering based on MD. Results obtained in the study show that ECG program iterations are reduced by almost 50 percent when MD-based FCM is used.

Physics: In Physics, Positron Emission Tomography (PET) is one specific area which finds application in medicine as well as in research, mainly in neuroimaging. One novel study implements MD to analyse gamma quanta by reconstructing signals in the energy levels, which is necessary for high-resolution images. The use of MD improves the image resolution significantly with better accuracy.

Comments:

Mahalanobis Distance is a very useful statistical measure in multivariate analysis. Any application that incorporates multivariate analysis is bound to use MD for better results. Furthermore, it is important to check the variables in the proposed solution using MD since a large number might diminish the significance of MD.

More Great AIM Stories

Abhishek Sharma
I research and cover latest happenings in data science. My fervent interests are in latest technology and humor/comedy (an odd combination!). When I'm not busy reading on these subjects, you'll find me watching movies or playing badminton.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MORE FROM AIM