When working with large real-world datasets, dealing with the curse of dimensionality is a common problem. The dimensionality reduction techniques can help overcome these problems. The first step in dimensionality reduction is feature extraction. It’s simply removing redundant and irrelevant features from a dataset in order to extract relevant features. Choosing a visualization method for such high-dimensional data is a time-consuming task. To automate this process, we can use HyperTools, a Python-based tool designed specifically for higher-dimensional data visualization. In this article, we will discuss HyperTools in detail and how it can help in this task. Below are the major points that we are going to discuss.
Table Of contents
- Note on data visualization
- How HyperTools visualize high dimensional data?
- Visualizing high dimensional data with HyperTools
Let’s first discuss Data visualization.
Sign up for your weekly dose of what's up in emerging technology.
Note on data visualization
Data visualization is a graphical representation of information and data. Data visualization techniques, such as charts, graphs, and maps, make it simple to see and understand trends, outliers, and patterns in data.
We have a lot of data in our hands nowadays, so data visualization tools and technologies are essential for analyzing massive amounts of data and making data-driven decisions. It is used in a variety of fields, such as, and. Visualize phenomena that you can’t see directly, such as weather patterns, medical conditions, or mathematical relationships, in order to simulate complex events.
How HyperTools visualize high dimensional data?
While plotting the change in a variable over time as coordinates on a graph is a relatively simple way to visualize low-dimensional data, it is not always obvious how to visualize high-dimensional datasets in a similarly intuitive way.
In this section, we’ll look at the open-source ToolBox Hypertools, which is based on Python and creates visualizations from high-dimensional datasets by reducing the dimensionality on its own. It is primarily based on matplotlib, sklearn, and seaborn. The core of this tool is that it uses dimensionality reduction techniques to embed high-dimensional datasets in a lower-dimensional space, then plots the data using a simple yet powerful API with numerous data manipulation (e.g., hyperalignment, clustering, normalizing, etc.) and plot styling options.
Data trajectories and point clouds are central to the toolbox’s design. HyperTools uses dimensionality reduction algorithms to create similar 2D and 3D trajectories for time series of high-dimensional observations. The trajectories can be visualized as animations or as interactive static plots. Static datasets (e.g. collections of observations or attributes) can benefit from the same dimensionality reduction and alignment algorithms.
In summary, the HyperTools toolbox provides a set of powerful functions for visualizing and manipulating high-dimensional data with the help of dimensionality reduction and data alignment algorithms. The toolbox is built with ease of use in mind, so even complex visualizations and analyses can be done with just a single line of code. Another major goal is to make it simple for users to create visually appealing publication-quality plots with only a single line of code, which is often the case.
The Below picture depicts how the toolbox uses the Dimensionality Reduction algorithms to map the data in 2D and 3D plots (a). And the second (b) one is about to show the data from different coordinate systems. The left panel shows three trajectories with similar geometries but different coordinate systems, while the right panel shows how those trajectories can be aligned into a common space using hyperalignment (via linear transformations).
Visualizing high dimensional data with HyperTools
To use this toolbox, we need to install it and this can be done by using simply pip. Directly installing using pip without specifying version will install the latest version and there Version Conflict issue with the latest package to avoid this Install 0.6.3 version otherwise, you will end with a dependencies issue.
! pip install hypertools==0.6.3
To use the HyperTools plot functionality, first, load the dataset to be analyzed into the Python workspace and convert it to a Numpy array or a Pandas data frame. The data format should be sampled (S) by features (F). Once the dataset has been converted to this format, simply import the library and call the plot function, as shown below for various plots. We’ll be using built-in datasets in this case.
For the first plot, we are using inbuilt spiral data which is a NumPy array containing data of 3D spirals used to highlight the linear transformation of data.
import hypertools as hyp # load the sample data weights = hyp.load('spiral') # Creating plot weights.plot(size=[7,5])
This plot function will perform dimensionality reduction (by default, using PCA), convert the S x F data matrix to an S x 3 matrix, and then generate an interactive 3D line plot that can be visually explored as below.
To facilitate automatic data clustering, HyperTools includes the K-Means clustering algorithm. The cluster keyword argument to the plot function, given a predetermined number of clusters, k, uses k-means clustering to automatically assign each observation to a cluster and then colours each observation’s point based on its cluster membership.
In this case, we are obtaining a cluster for the digits of the sklearn-created dataset, which represents 64 different pixel data in the NumPy array. We’ll draw a cluster for the first five digits.
from sklearn import datasets data = datasets.load_digits(n_class=5) df = data.data hue = data.target.astype('str') hyp.plot(df, '.', hue=hue, ndims=3,legend=[0,1,2,3,4],size=[7,5])
To have a 3D animated plot simply toggle the animate keyword to create an animation inside the plot function. This will result in a 3D animated representation of the data, with the animation taking place across the rows of the data matrix.
To plot multiple datasets on a single plot, the user can pass a list of data matrices, as with static plots, and format strings and keyword arguments can be used to customize the plot appearance. Each frame of the animation depicts a subset of the total data trajectory contained within a cube.
The displayed portion of the data trajectory advances by a small amount in each frame, and the camera angle rotates around the cube, providing visual access to various aspects of the data as the animation progresses.
The 3D animation can be obtained as below,
import numpy as np data = hyp.tools.load('weights', align=True) group1 = np.mean(data.data[:17], 0) group2 = np.mean(data.data[18:], 0) hyp.plot([group1, group2], animate=True, save_path='animation.mp4',zoom=2)
Through this article, we have discussed what data visualization is and importantly seen how we can visualize the high dimensional data using a Python-based tool called Hypertools. This tool is very useful. By providing a simple (yet powerful) set of plotting functions and data manipulation tools, the HyperTools toolbox aims to aid in high-dimensional data visualization.