Last updated January 14, 2022
In AI Origins & Evolution

Council Post: The big data myth in deep learning

Share

Illustration by The big data myth in deep learning

Published on January 14, 2022

by Padmakumar Nambiar

Deep learning, a subset of machine learning, is arguably one of the hottest topics in tech circles. The true north star of deep learning, a neural network with multiple hidden layers, is to mimic or simulate the “learning capacity” of the human brain – allowing machines to learn from data. The quality and volume of data required for deep learning is what differentiates it from traditional machine learning models.

A supervised deep learning model is fed with a large volume of labelled data of diverse examples from which the model learns about features to look for. Using these extracted features, the model ‘learns’ and generates output. Deep learning has benefited immensely with the arrival of GPUs. GPUs use parallelism to outperform CPUs in matrix computations. GPUs can carry out several computational tasks simultaneously. It aids in the distribution of training processes and speeding up operations. With GPUs, one can accumulate several cores that use fewer resources without compromising efficiency or power.

Ideal deep learning framework

Deep learning models consume insane amounts of data. The output accuracy/efficiency of these models is highly dependent on the volume and variety of data it is fed with. The advancements in big data and powerful GPUs have boosted the potential of deep learning at scale. Examples of large deep learning models include Facebook’s (now Meta) DeepFace that used a training set of 4 million facial images to reach an accuracy of 97.35 percent.

That said, a lot of the data fed into these models are not relevant or of good quality. This makes one wonder: Do our models really need large volumes of data, especially when a sizable part of it is sub-par?

To answer this question, I would like to draw parallels with human intelligence. We all have people in our lives we admire for their intelligence and knowledge in a wide range of subjects. Their ability to think quickly and connect often leaves us positively surprised. But do you think all this knowledge and intelligence have been developed at once? The answer is an emphatic no. Acquiring knowledge takes time. We all come across both useful and irrelevant information; the trick lies in picking the most relevant pieces of information to build your knowledge database.

Similarly, machines need to be fed with relevant and useful data. Subsequently, they must be taught to decide for themselves what information suits best for the task at hand.

Debunking the data myth

The idea that deep learning requires a large volume of data and computational resources is a myth. The community is slowly realising this, and a shift from “big-data” to “good-data” focussed model building is underway. Data-centric AI is a good case in point.

Many organisations are building models capable of navigating the world using common sense. These deep learning models understand everyday objects and actions, handle unforeseen situations and learn from experiences. Many institutions are pumping money into these endeavours. A12 is developing a portfolio using which the progress of a machine over a task can be measured. Big tech companies like Microsoft are committing resources to resolve ambiguities and help models learn from inferences.

Leading names from the world of AI and machine learning have also been vocal about the mad race to collect as much data as possible to make their models better. Andrew Ng, the founder of Google Brain, said one must not buy into the hype of large volumes of data. It is a growing belief among the AI and deep learning community that the next frontier of innovation in these fields would not come from a large volume of data but from efficient algorithms built using smaller, but relevant, data sets.

This article is written by a member of the AIM Leaders Council. AIM Leaders Council is an invitation-only forum of senior executives in the Data Science and Analytics industry. To check if you are eligible for a membership, please fill the form here.

Access all our open Survey & Awards Nomination forms in one place

Padmakumar Nambiar

Padmakumar is hands-on Data Science and Machine Learning Senior Management Executive with over 27 years of experience in Developing/Leading Enterprise Middleware and BigData software for various consumer experience domains like Healthcare, Retail & Construction Engg. He is highly experienced in all aspects of Software Engineering, Project Management, building teams from the ground up, and customer support.