Last updated June 26, 2022
In AI Mysteries

LUX: Python API for Automated Exploratory Data Analysis

Do you want to automate data analysis in your projects? LUX is an API which yields efficient and a quick data analysis. Have a look into it.

Share

Published on June 26, 2022

by Darshan M

Listen to this story

Exploratory data analysis is the process of understanding the data thoroughly for key characteristics and understanding each feature’s importance statistically. As the name suggests exploratory data analysis helps in exploring the data statistically and formulating certain hypotheses as required. Generally, exploratory data analysis is considered a tedious task and this is where LUX is paving its way by automating the entire exploratory data analysis in one single step. So in this article let us see how to use LUX python API to perform exploratory data analysis.

The need for Automating EDA
Automating EDA using LUX
Implementation of LUX python API
Final words

Before starting with the LUX environment let’s first discuss the need for Automated EDA.

The need for Automating EDA

Exploratory data analysis is a process of analyzing the datasets to summarize the important statistical significance of features and visualize the spread of each feature through appropriate visuals. But visualizing each of the features is a tedious task when there are many features in the dataset as checking the correlation of each of the features would be a lengthy process. So this is where the process of automating EDA plays a vital role in reducing the overall time spent on data analysis and time spent on optimal feature selection and also on outlier analysis.

Are you looking for a complete repository of Python libraries used in data science, check out here.

This is where the need for Automating the exploratory data analysis process occurs and automation of exploratory data analysis is supported by various Python libraries and APIs like LUX, SweetViz, AutoViz, and many more come into play. In this article let us explore how to use the LUX Python API to automate the exploratory data analysis procedure.

Automating EDA using LUX

LUX is a simple python API that helps in quick and easy data exploration by providing easily interpretable plots by just reading the data frame in the LUX-activated working environment. Visualizations are produced in an interactive widget with various feature tabs to slide through and understand the characteristics of the data.

Some of the standard widgets supported under the LUX module are as follows.

Correlation
Distribution
Occurrence
Geographical

Correlation widget

The correlation widget helps in analyzing the correlation between two numerical features of the data in the form of a scatter plot. All the numerical features will be mapped into sets of two features and the correlation between the two features can be visualized in order to analyze the features with a higher correlation.

Distribution widget

The distribution widget of the LUX python API is responsible for generating histogram visuals for all the numerical features providing the count of each of the features through histogram bins. The distribution widget mainly helps in analyzing the frequency of numerical features

Occurrence widget

The occurrence widget of LUX python API is responsible for generating horizontal bar plots by analyzing the frequency of occurrence of categorical features present in the data. For each of the categorical features and for each class of categorical features frequency occurrence is provided in the form of visuals under the occurrence widget.

Geographical Widget

The geographical widget of LUX API basically shows the choropleth maps for geographical locations in the dataset. The mean of certain numerical features is computed for each region on the maps and by just hovering on the map the mean value for each region can be computed under each of the geographical locations in the data.

Implementation of LUX python API

In this let us see how to use the LUX Python API to automate the exploratory data analysis process. For utilizing the LUX python API we have to first install the LUX API in the working environment.

!pip install lux-api

Now after installing the LUX API in the working environment let us import the API in the working environment along with the pandas module to read the dataset.

import lux
import pandas as pd

In certain working environments, certain widgets for visualizing from APIs have to be permitted by installing the corresponding visualizing widgets. Here let us see how to permit the LUX API to produce visuals in Google Colab.

from google.colab import output
output.enable_custom_widget_manager()

Once the widgets have been set the final step is just to read the dataset using the pandas module in the working environment.

df=pd.read_csv('/content/drive/MyDrive/Colab notebooks/EDA using LUX/WA_Fn-UseC_-HR-Employee-Attrition.csv')
df

This is how by just reading the dataframe in the LUX activated working environment the entire exploratory data analysis process is automated and various widgets are produced.

Correlation widget output interpretation

In the above image let us consider the first plot for Monthly Income and Total Working Hours and we can see how these two features are correlated with each other.

Distribution Widget Output Interpretation

In the above image if we consider the first plot we can easily interpret the frequency occurrence of each of the numerical features present in the dataset.

Occurrence Widget Output Interpretation

In the above image if we consider the first plot we can clearly see that the feature Performance rating is having two categories we can correspondingly also analyze the frequency occurrence of each of the categories.

Custom feature visualization using LUX

Instead of visualizing the entire dataset LUX also has the flexibility to analyze the characteristics of required features as shown below. But for custom features selected there are new three widgets supported under LUX API there are Enhance, Filter and Generalize.

df.intent = ["YearsAtCompany","HourlyRate"]
df

So here two numerical features are selected from the data to understand its various characteristics using the LUX API.

So if required features are selected from the dataset this is the visualization widget generated by the LUX API. Let us try to understand what each widget has to convey.

Enhance widget output interpretation

The enhance widget explains how additional features of the dataset affect the relationship of the two custom variables selected. In the above output, we can see how HourlyRate and Years at Company are related to other features of the dataset like StandardHours and Attrition.

Filter widget output interpretation

The filter widget considers the two custom features and produces correlation plots for a various subsets of features by analyzing each feature relation with respect to the custom features selected from the dataset.

Generalize widget output interpretation

The generalize widget considers only the custom features selected and removes if there any filter constraints in the features and shows a histogram distribution of the custom features selected for analysis.

Analyzing Geographical data using LUX API

For analyzing geographical data using LUX API a population dataset was used across various states.

df = pd.read_csv("https://github.com/covidvis/covid19-vis/blob/master/data/interventionFootprintByState.csv?raw=True",index_col=0)
df.head()

Later for obtaining the analysis through the LUX API framework the dataframe instance was just called in the working environment.

df

Interpreting the Geographical widget

In the above image, we can see that for various states in the dataset the mean values for various numerical features were computed for various areas in the respective states. Just by hovering over the map, we can interpret the mean of the corresponding numerical feature for the respective areas in each state.

Final words

Automating exploratory data analysis helps in cutting down 60% of the work that goes into data cleaning and analysis. By automating exploratory data analysis, optimal feature selection and checking correlation among features becomes easy and due to this more time can be utilized to produce more generic and reliable models for the respective tasks using the data. Among various automated exploratory data analysis APIs python offers LUX is one such API where the entire analysis of the data is obtained by just reading the data in the LUX activated environment to generate suitable insights from the data.

References

Access all our open Survey & Awards Nomination forms in one place

Darshan M

Darshan is a Master's degree holder in Data Science and Machine Learning and an everyday learner of the latest trends in Data Science and Machine Learning. He is always interested to learn new things with keen interest and implementing the same and curating rich content for Data Science, Machine Learning,NLP and AI