Stop Making the ‘Data Scarcity’ Excuse For Your Problems

Yann LeCun told AIM that there is not as much the scarcity of data as the scarcity of ways to take advantage of the data.
Listen to this story

Data is one of the most important aspects in generative AI, including the likes of DALL.E, Midjourney, Stable Diffusion, and others, alongside large language models like GPT, PaLM, and more, that are trained on tens or hundred billions parameters. Most people in AI believe that increasing the size and quality of datasets is the only way forward, and quibble about increasing the data flywheel for training their AI models. But, it’s time that they started looking beyond data scarcity and LLMs to create intelligent AI systems. 

Speaking to Analytics India Magazine, Yoshua Bengio, also agreed that “the bigger, the better” logic for AI is good, but not feasible in the long run. He said that by taking the latest architectures and simply scaling computer power, along with the hopes of increasing data, is a brute force technique and not one that tackles all other problems.

Quality versus Quantity

For quite some time now, there has been a debate about quality versus quantity of data that are being used to create AI models and deploy them. Google’s Francois Chollet said that increasing the size of data instead of the quality can ruin the models instead of tuning them better. 

It is also important to note that larger language models can lead to poor data quality and more fine tuning of parameters and heavy workloads, while smaller datasets can lead to more biases and lesser fine-tuning and minimal computational resources. But, the bias problem remains unaddressed, which can probably be solved using a multi-modal approach. A multi-modal approach refers to the use of multiple models or methods to achieve a goal or solve a problem. 

For instance, Yann LeCun suggests a different approach of using multi-modal to solve a single problem mimicking an animal’s brain. In this case, he proposes to use a configurator module, perception module, world module, cost module, short-term module, and actor module.

Who is Enabling Data Scarcity? 

One of the redditors, who goes by the user name ‘Top-Avocado-2564,’ recently said that the current AI/ML systems are built in a way that they require large amounts of data. This happens because of a lack of diversity in deep learning research. The research is led by big tech companies that have the infrastructure and computing capabilities to support large volumes of data for computing. Therefore, there is a unidirectional understanding that if we can increase the amount of data, we can probably solve the current problems or limitations. 

Further, in the thread, another reddit user by the name, ‘piyabati’ said if we move back a decade and look at how the hardware limitations had led researchers to run comparatively smaller datasets enabling models to operate and infer with a degree of freedom, though it’d predicted incorrect results sometimes. When the hardware improved, researchers had a lot of labelled data to work with, which allowed improvements in the models.

Now, this has led to companies believing that increasing datasets and making them larger, without actually changing or improving the scientific understanding, is enough to make progress in AI. This tells us that the ‘data scarcity’ is as big of a problem as ‘lack of diversity’ in using the available data. 

Data Scarcity, Really?

Yann LeCun told AIM that there is not as much the scarcity of data as the scarcity of ways to take advantage of the data. When we compare the workings of a machine to a human being— essentially the goal— there is an obvious difference in training. Humans do not require the knowledge of a billion words to form a sentence like machines do. 

Been Kim, research scientist at Google Brain, said that science and engineering should go hand-in-hand. There is no grand unified theory which can assess when, or if even, a machine has become conscious. Until then we have to rely on mathematical optimisation and build machines that are a decimal percentage better than the last one. 

The Need for Narrow AI Approach

If you look at the present day AI systems like ChatGPT or DALL-E 2, they were not built with the intention of solving a specific problem. Their goal was to attempt to take steps towards building machines that can be trained on large amounts of data to produce ‘statistically’ and ‘mathematically’ better output, and nothing close to human intelligence. In other words, ‘human-task-imitating-machines,’ and not ‘human-like machines’. 

However, in the last few years, we have seen GPT-like models being used for healthcare, or solving protein-fold prediction problems. These are some examples of Narrow AI, when AI is built to focus on specific problems or use cases. This is most likely to be one of the plausible approaches, instead of a broader approach. 

Arguably, generative models like ChatGPT and DALL-E might be good for fun and entertainment purposes, but they fall short in the greater scheme of things. Examples of AI solving a specific issue, for example climate change, healthcare industry, or developing industrial AI, are somewhat missing. In addition, for any of these examples, the scarcity of data cannot be the limiting factor here.

Download our Mobile App

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR