The Role Of Statistics In The Era Of Big Data?

Techniques in AI/ML have made great advances and it is a great exploratory step. But for now, it has not matured enough to reach an inferential step.

The concepts in statistics and mathematics are the building blocks of the techniques and tools we use to gain deeper insights into structured and unstructured data. Statistical concepts lie at the heart of data science.

In this informative session at SkillUp 2021, a two-day event organised by Analytics India Magazine, Rajeeva Karandikar of Chennai Mathematical Institute, presented a few examples (from history) to explain how to make the most of the available data and enormous computing power by combining statistical ideas with modern AI/ML tools.

Rajeeva Karandikar is the Director at Chennai Mathematical Institute. He is a Fellow of the Indian Academy of Sciences and Indian National Science Academy. His research interests include probability theory and stochastic processes, applications of statistics and cryptography. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Statistics quo

“Perhaps in 90% of the problems that need some decision based on available data, the standard tools in artificial intelligence or machine learning and statistics will yield the best or nearly the best answer. But the remaining 10% will need something more than just the tools,” said Karandikar.

He said not all data problems don’t have the benefit of big data, such as opinion polls, quality control, vaccine identification and approval, drug discovery and approval. Thus, statistical ideas and techniques are definitely relevant in such cases.

Karandikar called up instances from history to prove the significant role of statistics and data. Sir Francis Galton, cousin of Darwin who studied the inheritance of genetic traits, was his first example. 

“It appeared from these experiments that the offspring did not tend to resemble their parent seeds in size, but always to be more mediocre than they- to be smaller than the parents, if the parents were large; to be larger than the parents if the parents were very small”- Galton

Karandikar explained that Galton obtained data on the heights of parents and (grown-up) sons and got a confirmation of his ideas. He chose heights as it was easy to obtain data on them. His analysis of the data confirmed his hypothesis. 

“Today we can obtain data on heights of a large number of individuals and their father’s heights (say from India passport database). It can be seen that any data-driven tool will confirm the conclusion reached by Galton. However, interchanging roles of the heights of sons and fathers lead to an exactly opposite conclusion. This nature can also be seen in simulated data,” Karandikar said.

Correlation and regression

Next, Karandikar discussed some important topics of statistics, such as correlation and regression. Most of the data-driven analysis tries to discover relationships among different variables and this is what correlation and regression are all about. It is also important to understand that correlation does not imply causation.

While correlation, as well as regression, are techniques to discover linear relationships, one needs to use transformations to get more complex relationships. Artificial intelligence, machine learning and other data-driven techniques likewise try to find the relations, linear or otherwise among the variables. 

Karandikar said one must not use any such tools without understanding the domain of the task. He illustrated this by providing several examples of correlation and standard error, such as prediction of 2007 Cricket ODI World Cup, relationship between IIT-JEE/CAT scores and performance at IIT/IIM, among others. 

Karandikar also talked about some important terms of statistics, such as spurious correlations or nonsense correlations, standard error, Simpson’s paradox or amalgamation paradox, omitted variable bias, GIGO, GMGO, among others. 

“Techniques in AI/ML have made great advances and it is a great exploratory step. But for now, it has not matured enough to reach an inferential step. When used in conjunction with domain knowledge, we can do wonders,” said Karandikar while concluding the session. 

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox