“Like a plucked apple, research in theoretical statistics tends to dry up after it has been removed from its source of nourishment.”Gelman et al.
Data science is a product of statistics. Some even call it glorified statistics. While the debate doesn’t seem to go away anytime soon, we take a look at a few of what Columbia University has deemed the “most important” statistical ideas in the last 50 years. The ideas also have significant implications in data science.
EDA was popularised by John W. Tukey through his book “Exploratory Data Analysis” published in the 70s. Tukey suggested that emphasis needed to be placed on using data to suggest hypotheses to test. “For nearly 60 years, statistics, science and the nation benefited enormously from the efforts of John W. Tukey,” says David C. Hoaglin who was a student of Tukey. Such was the influence of EDA, and we can still see it flourishing in the realms of data science.
EDA offered a graphical technique that helped better understand and diagnose problems of new complex probability models that are being fit to data. EDA, write the researchers at Columbia University, deviated from the norms of testing the hypotheses and emphasised the discovery aspect of the process. EDA has been influential in the development of specific graphical methods and in moving the field of statistics away from theorem-proving and toward a more open healthier perspective on the role of learning from data in science, the researchers stated.
Data scientist and Kaggle Grandmaster Martin Henze strongly recommends beginning any data science problem with a comprehensive EDA (exploratory data analysis). I’m a visual learner, said Martin. “It is a mistake to jump too quickly into modelling. Question your assumptions carefully, and you will gain a better understanding of the data and the context in which it is extracted,” he added.
Counterfactual Causal Inference
Frontrunners like Judea Pearl have combined causality and computer science. Her research has gained significance of late in AI, which gets a bad rap from dishing uninterpretable results. However, historically, there has been research on models for causal attribution in multiple Dimensions by the likes of Judea Pearl. Gelman and Vehtari write that there has been a common thread of modelling causal questions in terms of counterfactuals or potential outcomes. The counterfactual framework, state Gelman and his co-author, places causal inference within a statistical or predictive framework where causal estimands are precisely defined and expressed in terms of unobserved data within a statistical model, connecting to ideas in survey sampling and missing-data imputation
Study in Causality offers a platform to rule out plausible alternative explanations. Empowering a machine to think in terms of causality leads to a certain form of intelligence, which is close to what humans think like.
Overparameterized Model Fitting
We are now living in a world where AI labs launch models that need billions of parameters (think: GPT-3). They are now closing in on the one trillion mark too. While most of this innovation can be attributed to the latest hardware advancements, the idea of fitting models with large parameters has been doing rounds for a while. According to Gelman et al., since the 1970s, statistics burdened itself with the challenge of fitting overparameterized models — sometimes more parameters than data points — with the help of regularization procedure to get stable estimates and good predictions. The idea of regularization can be implemented as a penalty function on the parameters or the predicted curve.
A recent work, titled “Meta learning without memorization”, explores the objectives of regularization to make algorithms successfully use data from non-mutually-exclusive tasks to adapt to novel tasks efficiently.
In a work titled, “Causal Regularizer”, the authors proposed a causal regularizer that steers predictive models towards causally-interpretable solutions. Their analysis on a large-scale Electronic Health Records (EHR) showed that their causally-regularized model outperforms other methods in causal accuracy, and is competitive in predictive performance. This has a great future in healthcare, where many causal factors should coincide to affect the target variable.
Improved data collection strategies (think sensors, Internet) have resulted in enormous datasets. But, data collection and curation consumes nearly 80% of a data engineer’s typical day. Data is still a problem. More so a couple of decades ago. The idea behind bootstrap distribution is to use it as an approximation to the data’s sampling distribution. According to researchers, parametric bootstrapping, prior and posterior predictive checking, and simulation-based calibration allow replication of datasets from a model instead of directly resampling from the data. Calibrated simulation in the face of uncertain data volumes is a standard procedure rooted in statistics and helps in analysing complex models or algorithms.
Gelman and Vehtari believe the future research will lean more towards inferential methods, taking ideas such as unit testing from software engineering and applying them to problems of learning from noisy data. “As our statistical methods become more advanced, there will be a continuing need to understand the links between data, models, and substantive theory,” concluded the authors.
The ideas mentioned above have laid the foundation for modern-day deep learning and other such tools. Even something as elementary as decision making is considered to be a product of statistics. Bayesian optimization, reinforcement learning, A/B testing are a few other examples.
Game Changers From Statistics: 100 Years In The Making
- Sampling theory
- Bayesian inference
- Confidence intervals
- Hypothesis testing
- Maximum likelihood
- Exploratory Data Analysis
- Adaptive Decision Analysis
- Counterfactual Causal inference
Check the full survey here.