Data is in many forms, such as numerical data or/and categorical data in a tabular form, image data, video data, text data, and audio data. The size of data affects the selection of storage space, compute memory, hardware and software configurations such as distributed processing, and so on. The size of data plays an important role in data science, whatever the form of data may be. With trillions of new data generated every day, which require petabytes of storage memory, less sized data is preferred in most cases. Here come some common questions: What factor decides the size of data? Is it a variable? If it is a variable, how can we reduce the size of a given data? This article discusses the answers to these questions.
Suppose tabular data is generated for a task using a questionnaire that contains 100 questions. Another tabular data is generated with yet another questionnaire that contains the most important 20 questions for the same task. The former data is said to have 100 features, and the later data is said to have 20 features. In other words, the numbers 100 and 20 can be said to be dimensions of the corresponding datasets. 100 features should need more memory space than 20 features. Thus, the number of features (or dimensions) decide the size of the data. The same task is fulfilled either with 100 features or with 20 features. Therefore, the size of data is a variable linearly proportional to the number of features or dimensions.
We say that the most important 20 features can replace the 100 features. But, how to decide those 20 important features? It is difficult to decide the most important features in advance before collecting the data because the features left as less important may contain some valuable information. Hence, the usual habit is to collect data to a greater number of dimensions.
The exponential increase in the number of smartphones and high-dimension camera devices yield huge-sized images and videos. The high-dimensional data collection is unavoidable. At the same time, it should be noted that a machine learning or deep learning model needs enormous data for training. The model and hardware memory could not handle a huge volume of big-sized data always—moreover, the higher the dimensions the higher the chance to produce misleading patterns.
Here comes the need for dimensionality reduction. With high-dimensional data in hand, dimensionality reduction is the process of extracting the most important dimensions, discarding the unimportant dimensions. Popular dimensionality reduction techniques follow some solid mathematical procedures derived from statistics and/or linear algebra. Here, we discuss the ideas from some valuable articles and tutorials which guide us in understanding and implementing the different dimensionality reduction techniques.
To Begin with
The problem of high dimensions has a long history back decades, and there have been numerous attempts to face it. The unnecessary dimensions become a noise that suppresses the true pattern. The programming model will try to learn the noise rather than the true pattern. Moreover, these noises lead to non-convergence in some dynamic problems. Human eyes can comfortably visualize and understand data plotted in two dimensions and, to some extent, in three dimensions. Data analysis tools are mostly developed for low-dimensional data (2D or 3D). Thus analysing and visualizing high-dimensional data is hard and less effective in most cases.
The following fields are badly affected by the high-dimensionality of the data:
The approaches for Dimensionality Reduction can be roughly classified into two categories. The first one is to discard less-variance features. The second one is to transform all the features into a few high-variance features. We will have a few of the original features in the former approach that do not undergo any alterations. But in the later approach, we will not have any of the original features, rather, we will have a few mathematically transformed features.
The former approach is straightforward. It measures the variance in each feature. It claims that a feature with minimal variance may not have any pattern in it. Therefore, it discards the features in the order of their variance from the lowest to the highest. Backward Feature Elimination, Forward Feature Construction, Low Variance Filter and Lasso Regression are the popular techniques that fall under this category.
The later approach claims that even a less-important feature may have a small piece of valuable information. It does not agree with discarding features based on variance analysis. Rather, it generates a set of new low-dimensional features out of the original high-dimensional features through some mathematical transformations developed with linear algebra and statistics. The resulting new features have high variance within each feature. Principal Component Analysis (PCA), Singular Value Decomposition (SVD) and Linear Discriminant Analysis (LDA) are the popular techniques that fall under this category.
Algorithms and Examples for Dimensionality Reduction
One of the most popular algorithms is Principal Component Analysis. It takes data in the form of a matrix and transforms the large set of variables into smaller sets while still maintaining most of the information from the original matrix. High-dimensional data such as images and videos are usually represented in matrices. Moreover, the approach of PCA is derived from matrix algebra.
The Eigenvalues and Eigenvectors are calculated from the given matrix. The sum of the Eigenvalues is the total variance of the data. The Eigenvector corresponding to the highest-valued Eigenvalue is the most contributing feature with the highest variance. This feature is technically termed as the first Principal Component. The feature with the second-highest Eigenvalue is the second Principal Component. Thus, the top few Principal Components contribute to most of the data.
Next is to discuss the modules and methods available in Python meant for Principal Component Analysis (PCA) and the methodology to apply it for a Classification problem in Machine Learning. It introduces SciKit-Learn’s decomposition module that offers exclusive methods for principal component-based dimensionality reduction.
PCA acts like a BlackBox since it can be used to determine the variability of the independent variables for the dependent variable and cannot be used to see which independent variables are more important for prediction.
Apart from PCA, many other techniques can be used to reduce the size of the data – Latent Discriminant Analysis(LDA), Singular Value Decomposition (SVD), Kernel PCA, etc.
Linear Discriminant Analysis groups data based on the target (or classes), and it models the variations among those classes (inter-class variations) and similarities within the classes (intra-class similarity). On the other hand, PCA mathematically processes the independent features to calculate the eigenvalues, eigenvectors and covariance matrix to arrive at the principal components.
One of the major drawbacks that PCA possesses is that it can process only linear data. KernelPCA is a special form of PCA that helps modeling non-linear data. The class boundaries formed by the end results of KernelPCA is a quadratic or cubic, or higher-order curve, whereas that of PCA is a line.
Singular Value Decomposition is a method for matrix decomposition. It is considered as more stable in comparison with EigenValue Decomposition. Singular Value Decomposition (SVD) is available as a method in NumPy’s Linear Algebra module (numpy.linalg.svd). It identifies the principal components and automatically arranges them by rank. The top-ranked components contribute greatly to the original data.
For example – To Perform SVD on an image. It decomposes the original image into three components: U matrix, Sigma vector, and V matrix. Those matrices have entries arranged according to their ranks. By selecting a small top subset of these matrices and integrating them, one can obtain a low-dimensional image with most details preserved.
To continue with examples, we can also do image compression using PCA with the help of n image data. An image is a matrix-like representation of pixel values. Pixel values are usually real non-negative numbers whose values denote the colour intensity of the corresponding points in the image. Viewing it as a matrix, PCA calculates the eigenvalues, eigenvectors and covariance matrices and reform the entire image data with a few principal components. Similar to SVD, PCA reconstructs a high-dimensional image into a low-dimensional image without losing important information.
While declaring that PCA is great in data compression, there are a few situations where PCA can not be incorporated:
- PCA can be applied to numerical data but not to categorical data. Categorical data must be converted into numeric form by one-hot encoding or any suitable method before applying PCA on it.
- PCA completely transforms the input data into new data. In other words, it destroys all the original features to create some new ones. So we can not interpret which original features contribute to which portion.
- When each feature in the input data is important to the model, we might accidentally discard some crucial patterns by reducing the PCA dimensions. It is important to choose the number of resulting principal components correctly.
Got interested in dimensionality reduction and wish to read more? Here are some good resources.
- The Curse of Dimensionality
- High Dimensional Geometry and Dimension Reduction
- Singular Value Decomposition
- Linear discriminant analysis
- SciKit-Learn’s Decomposition module
- SciKit-Learn’s Discriminant Analysis module
- NumPy’s Linear Algebra module