MITB Banner

Stop Making the ‘Data Scarcity’ Excuse For Your Problems

Yann LeCun told AIM that there is not as much the scarcity of data as the scarcity of ways to take advantage of the data.
Share
Listen to this story

Data is one of the most important aspects in generative AI, including the likes of DALL.E, Midjourney, Stable Diffusion, and others, alongside large language models like GPT, PaLM, and more, that are trained on tens or hundred billions parameters. Most people in AI believe that increasing the size and quality of datasets is the only way forward, and quibble about increasing the data flywheel for training their AI models. But, it’s time that they started looking beyond data scarcity and LLMs to create intelligent AI systems. 

Speaking to Analytics India Magazine, Yoshua Bengio, also agreed that “the bigger, the better” logic for AI is good, but not feasible in the long run. He said that by taking the latest architectures and simply scaling computer power, along with the hopes of increasing data, is a brute force technique and not one that tackles all other problems.

Quality versus Quantity

For quite some time now, there has been a debate about quality versus quantity of data that are being used to create AI models and deploy them. Google’s Francois Chollet said that increasing the size of data instead of the quality can ruin the models instead of tuning them better. 

It is also important to note that larger language models can lead to poor data quality and more fine tuning of parameters and heavy workloads, while smaller datasets can lead to more biases and lesser fine-tuning and minimal computational resources. But, the bias problem remains unaddressed, which can probably be solved using a multi-modal approach. A multi-modal approach refers to the use of multiple models or methods to achieve a goal or solve a problem. 

For instance, Yann LeCun suggests a different approach of using multi-modal to solve a single problem mimicking an animal’s brain. In this case, he proposes to use a configurator module, perception module, world module, cost module, short-term module, and actor module.

Who is Enabling Data Scarcity? 

One of the redditors, who goes by the user name ‘Top-Avocado-2564,’ recently said that the current AI/ML systems are built in a way that they require large amounts of data. This happens because of a lack of diversity in deep learning research. The research is led by big tech companies that have the infrastructure and computing capabilities to support large volumes of data for computing. Therefore, there is a unidirectional understanding that if we can increase the amount of data, we can probably solve the current problems or limitations. 

Further, in the thread, another reddit user by the name, ‘piyabati’ said if we move back a decade and look at how the hardware limitations had led researchers to run comparatively smaller datasets enabling models to operate and infer with a degree of freedom, though it’d predicted incorrect results sometimes. When the hardware improved, researchers had a lot of labelled data to work with, which allowed improvements in the models.

Now, this has led to companies believing that increasing datasets and making them larger, without actually changing or improving the scientific understanding, is enough to make progress in AI. This tells us that the ‘data scarcity’ is as big of a problem as ‘lack of diversity’ in using the available data. 

Data Scarcity, Really?

Yann LeCun told AIM that there is not as much the scarcity of data as the scarcity of ways to take advantage of the data. When we compare the workings of a machine to a human being— essentially the goal— there is an obvious difference in training. Humans do not require the knowledge of a billion words to form a sentence like machines do. 

Been Kim, research scientist at Google Brain, said that science and engineering should go hand-in-hand. There is no grand unified theory which can assess when, or if even, a machine has become conscious. Until then we have to rely on mathematical optimisation and build machines that are a decimal percentage better than the last one. 

The Need for Narrow AI Approach

If you look at the present day AI systems like ChatGPT or DALL-E 2, they were not built with the intention of solving a specific problem. Their goal was to attempt to take steps towards building machines that can be trained on large amounts of data to produce ‘statistically’ and ‘mathematically’ better output, and nothing close to human intelligence. In other words, ‘human-task-imitating-machines,’ and not ‘human-like machines’. 

However, in the last few years, we have seen GPT-like models being used for healthcare, or solving protein-fold prediction problems. These are some examples of Narrow AI, when AI is built to focus on specific problems or use cases. This is most likely to be one of the plausible approaches, instead of a broader approach. 

Arguably, generative models like ChatGPT and DALL-E might be good for fun and entertainment purposes, but they fall short in the greater scheme of things. Examples of AI solving a specific issue, for example climate change, healthcare industry, or developing industrial AI, are somewhat missing. In addition, for any of these examples, the scarcity of data cannot be the limiting factor here.

PS: The story was written using a keyboard.
Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.
Related Posts

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed