Last month, we celebrated the National Statistics Day to commemorate the 125th birth anniversary of India’s legendary statistician, PC Mahalanobis. His contributions to applied statistics are immense and he is best known for developing a statistical measure called Mahalanobis distance, which is used in multivariate regression analysis. Mahalanobis distance finds wide applications in the field of classification and clustering. In this article, we will explore the Mahalanobis distance (MD) and its significance in statistics.
Regression Analysis In Statistics
Regression analysis is crucial in machine learning due to the fact that ML deals with errors and relationships in the data that goes into the model. This topic of statistics is widely used to study the variables that account in the project. ML models and algorithms sometimes analyse a variety of data to figure out a consistent output.
In multivariate analysis, the data under a study’s consideration, is usually delineated into a number of variables. These variables have to denote a relationship with respect to a standalone variable for the study. Multivariate analysis helps in addressing this challenge through methods that use correlation (strength of relationship) between the variables.
What Is Mahalanobis Distance?
Generally, variables (usually two in number) in the multivariate analysis are described in a Euclidean space through a coordinate (x-axis and y-axis) system. Suppose if there are more than two variables, it is difficult to represent them as well as measure the variables along the planar coordinates. This is where the Mahalanobis distance (MD) comes into picture. It considers the mean (sometimes called centroid) of the multivariate data as the reference.
The MD measures the relative distance between two variables with respect to the centroid. Therefore, farther the variable is from the centroid, the larger the MD is. The definition of the concept is given below.
The Mahalanobis distance of an observation x = (x1, x2, x3….xN)T from a set of observations with mean μ= (μ1,μ2,μ3….μN)T and covariance matrix S is defined as:
MD(x) = √{(x– μ)TS-1 (x– μ)
The covariance matrix provides the covariance associated with the variables (the reason covariance is followed is to establish the effect of two or more variables together).
Uses And Applications
As mentioned earlier, MD is primarily used in classification and clustering problems where there is a need to establish correlation between different groups/clusters of data. Another application of MD is discriminant analysis and pattern analysis, which are based on classification.
It has also found relevance in principal component analysis (PCA), where the correlated variables are transformed to a set of uncorrelated variables called principal components. If the value of MD is squared, it is found that the sum of squares of all non-zero principal components are equal to that of MD. This is helpful to ascertain the right components based on the requirement in data analysis.
Use Cases Of Mahalanobis Distance
Image processing: The aspect of MD in image processing has spurred researchers to bring in this concept to serve various areas of the field. One such study is the anomaly detection in hyperspectral images, which are used to detect surface materials in the ground. A group of researchers from IEEE have developed a method called low-rank and sparse matrix decomposition-based Mahalanobis distance method for anomaly detection. This uses MD to detect the probable anomalies lying in the images analysed from sparse matrix decomposition.
Precision Medicine: A specific study in RNA sequencing has applied MD to analyse molecules to predict breast cancer survival. The researchers make use of MD to establish statistical significance of deregulated pathways in a subject’s transcriptome. Results show that the pathways derived from MD has improvement over previous research on the study, and offer better clinical interpretation of cancer survival.
Neurocomputing: Another research study incorporates MD in the detection of arrhythmia in an electrocardiogram (ECG). In this study, MD resolves the clustering problems associated with traditional Euclidean Distance (ED) observed in clustering features in ECG. The method followed is a Fuzzy C-means (FCM) clustering based on MD. Results obtained in the study show that ECG program iterations are reduced by almost 50 percent when MD-based FCM is used.
Physics: In Physics, Positron Emission Tomography (PET) is one specific area which finds application in medicine as well as in research, mainly in neuroimaging. One novel study implements MD to analyse gamma quanta by reconstructing signals in the energy levels, which is necessary for high-resolution images. The use of MD improves the image resolution significantly with better accuracy.
Comments:
Mahalanobis Distance is a very useful statistical measure in multivariate analysis. Any application that incorporates multivariate analysis is bound to use MD for better results. Furthermore, it is important to check the variables in the proposed solution using MD since a large number might diminish the significance of MD.