“Always plot your data.”Martin Henze
Data collection is tricky, data curation is tricky and everything about data is messy. EDA or exploratory data analysis via visual inspection makes data less terrifying. It exposes the outliers, it pulls the curtain on human biases. But performing EDA is a skill in itself and to demonstrate the same, Martin Henze came on board MLDS 2021 organised by Analytics India Magazine.
In his talk titled, “Effective EDA and data visualization”, Martin Henze discussed the importance of plotting data when making data driven decisions. Martin is an astrophysicist by training who ventured into machine learning fascinated by data. His contributions on Kaggle forums are second to none. His notebooks have garnered a great reputation for being tremendously educational.
Why Bother About EDA And Data Viz
Martin began his talk by bringing up the popular Anscombe’s quartet. The Anscombe quartet (shown above) was introduced by a Frenchman who goes by the same name, back in 1973. He demonstrated that data sets with nearly identical simple descriptive statistics can have very different distributions and appear very different when plotted. Martin too emphasised on the importance of graphing data before analysing it and the effect of outliers and other influential observations on statistical properties.
While building an end-to-end pipeline, Martin recommends to perform the basic preprocessing, a simple baseline model (or slightly better), and getting the outputs in shape for their intended downstream use as early as possible. He would then iterate over the different parts of the pipeline; focussing on cycling quickly through the first iterations.
Here are Martin’s guidelines to getting good at EDA:
- Interpret your findings.
- Narration, narration, narration
- Craft engaging visuals
- Explain your approach on the blogs/forums etc.
Let’s take a look at one of Martin’s popular Kaggle notebooks where he masterfully performs EDA to gain more insights:
The above plot is made over the data provided for the M5 competition on Kaggle where a participant has to predict the sales data provided by the retail giant Walmart 28 days into the future. The data comprises 3,049 individual products from three categories and seven departments, sold in ten stores in three states.
Martin plotted the data related to the ten stores and three categories in the same, non-interactive layout as shown above. This allowed him to gain insights such as “Foods” are the most common category and the number of “Household” rows is closer to the number of “Foods” rows than the corresponding sales figures, indicating that more “Foods” units are sold than “Household” ones.
Check the full EDA here.
In one of our previous interviews, when asked about how he would proceed with a data science problem, Martin suggested that he would always start his projects with a comprehensive EDA (exploratory data analysis). I’m a visual learner, said Martin. His EDA approach typically includes lots of plots that help him scrutinise the relationships and oddities within the data. “It is a mistake to jump too quickly into modelling.” For real world data, EDA steps will usually include quite a bit of data cleaning and wrangling. According to Martin, data cleaning provides important information on the kind of challenges your model might face on unseen data. “Question your assumptions carefully and you will gain a better understanding of the data and the context in which it is extracted,” says Martin.
Throughout his highly informative talk, Martin emphasised how plotting the data is important. The better the EDA, the higher the chances of mitigating biases in the dataset. Here are a few quick takes by Martin:
- Apply log transformations when you see skewed distribution.
- Beware of Bias. Undersampling with darker skin. ML is very powerful.
- Communication is key so you use appropriate domain expertise.