Stop Making the ‘Data Scarcity’ Excuse For Your Problems

Yann LeCun told AIM that there is not as much the scarcity of data as the scarcity of ways to take advantage of the data.

Published on January 6, 2023

by Mohit Pandey

Listen to this story

Data is one of the most important aspects in generative AI, including the likes of DALL.E, Midjourney, Stable Diffusion, and others, alongside large language models like GPT, PaLM, and more, that are trained on tens or hundred billions parameters. Most people in AI believe that increasing the size and quality of datasets is the only way forward, and quibble about increasing the data flywheel for training their AI models. But, it’s time that they started looking beyond data scarcity and LLMs to create intelligent AI systems.

Speaking to Analytics India Magazine, Yoshua Bengio, also agreed that “the bigger, the better” logic for AI is good, but not feasible in the long run. He said that by taking the latest architectures and simply scaling computer power, along with the hopes of increasing data, is a brute force technique and not one that tackles all other problems.

Quality versus Quantity

For quite some time now, there has been a debate about quality versus quantity of data that are being used to create AI models and deploy them. Google’s Francois Chollet said that increasing the size of data instead of the quality can ruin the models instead of tuning them better.

It is also important to note that larger language models can lead to poor data quality and more fine tuning of parameters and heavy workloads, while smaller datasets can lead to more biases and lesser fine-tuning and minimal computational resources. But, the bias problem remains unaddressed, which can probably be solved using a multi-modal approach. A multi-modal approach refers to the use of multiple models or methods to achieve a goal or solve a problem.

For instance, Yann LeCun suggests a different approach of using multi-modal to solve a single problem mimicking an animal’s brain. In this case, he proposes to use a configurator module, perception module, world module, cost module, short-term module, and actor module.

Who is Enabling Data Scarcity?

One of the redditors, who goes by the user name ‘Top-Avocado-2564,’ recently said that the current AI/ML systems are built in a way that they require large amounts of data. This happens because of a lack of diversity in deep learning research. The research is led by big tech companies that have the infrastructure and computing capabilities to support large volumes of data for computing. Therefore, there is a unidirectional understanding that if we can increase the amount of data, we can probably solve the current problems or limitations.

Further, in the thread, another reddit user by the name, ‘piyabati’ said if we move back a decade and look at how the hardware limitations had led researchers to run comparatively smaller datasets enabling models to operate and infer with a degree of freedom, though it’d predicted incorrect results sometimes. When the hardware improved, researchers had a lot of labelled data to work with, which allowed improvements in the models.

Now, this has led to companies believing that increasing datasets and making them larger, without actually changing or improving the scientific understanding, is enough to make progress in AI. This tells us that the ‘data scarcity’ is as big of a problem as ‘lack of diversity’ in using the available data.

Data Scarcity, Really?

Yann LeCun told AIM that there is not as much the scarcity of data as the scarcity of ways to take advantage of the data. When we compare the workings of a machine to a human being— essentially the goal— there is an obvious difference in training. Humans do not require the knowledge of a billion words to form a sentence like machines do.

Been Kim, research scientist at Google Brain, said that science and engineering should go hand-in-hand. There is no grand unified theory which can assess when, or if even, a machine has become conscious. Until then we have to rely on mathematical optimisation and build machines that are a decimal percentage better than the last one.

The Need for Narrow AI Approach

If you look at the present day AI systems like ChatGPT or DALL-E 2, they were not built with the intention of solving a specific problem. Their goal was to attempt to take steps towards building machines that can be trained on large amounts of data to produce ‘statistically’ and ‘mathematically’ better output, and nothing close to human intelligence. In other words, ‘human-task-imitating-machines,’ and not ‘human-like machines’.

However, in the last few years, we have seen GPT-like models being used for healthcare, or solving protein-fold prediction problems. These are some examples of Narrow AI, when AI is built to focus on specific problems or use cases. This is most likely to be one of the plausible approaches, instead of a broader approach.

Arguably, generative models like ChatGPT and DALL-E might be good for fun and entertainment purposes, but they fall short in the greater scheme of things. Examples of AI solving a specific issue, for example climate change, healthcare industry, or developing industrial AI, are somewhat missing. In addition, for any of these examples, the scarcity of data cannot be the limiting factor here.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.

Top 7 Vision Models Transforming the Future of AI in 2024

One Year of Midjourney: A Psychedelic Image Generator to Realistic Photo Album

GANs, Diffusion Ride Dragon in AI Image Generation

How Generative AI is Reshaping the Landscape of the Metaverse

Harnessing Human Emotions in Generative AI

Reaping the Synergies Between Quantum Computing and Generative AI

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

India is Making its Own AI Servers

Pritam Bordoloi

PLI scheme marks the beginning of India ‘s manufacturing venture

GPT-5 Likely to be Released After the US Elections

Donna Eva

Generative AI Jobs in India can Fetch You up to Rs 1 Crore

Siddharth Jindal

Top Editorial Picks

Elon Musk Set to Meet Indian Spacetech Startups During Upcoming Visit

Shyam Nandan Upadhyay

Happiest Minds Technologies Acquires Macmillan Learning India, Expands Edutech Reach

Shritama Saha

Meta Releases Llama 3, Beats Claude 3 Sonnet and Gemini Pro 1.5

Mohit Pandey

Nothing Becomes the First Smartphone Company to Integrate OpenAI’s ChatGPT

Siddharth Jindal

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Featured

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Through the implementation of advanced data management methodologies, resilient data observability solutions, and cutting-edge AI frameworks, Course5 is spearheading the