MITB Banner

What is Stopping Generative AI from Achieving Growth?

Scaling language models is increasingly becoming difficult with lack of high-quality data being the biggest challenge.

Share

Listen to this story

Generative AI has been on the rise. Researchers are pushing to scale these models to more parameters to increase their performance. For example, GPT-3, used in many enterprise-level applications, has 175 billion parameters. Similarly, DeepMind’s Gopher with 280 billion parameters, NVIDIA and Microsoft’s Megatron Turing NLG with 530 billion parameters, and Google’s PaLM with 540 billion parameters are some notable examples. 

But it begs the question: How do we get machine learning architectures to scale performance, and ensure a fair trade-off, given factors like compute, dataset size, and model parameters? 

OpenAI’s paper on scaling laws published in 2020 demonstrated that increasing model size with relatively small data resulted in better performance—thus, the trend observed with the influx of large language models. However, a recent study by Deepmind showed that given a fixed compute budget, the model size and the dataset size (token) should increase proportionately. Furthermore, the findings showed that a smaller model trained on more data could perform better than several oversized models with lesser data size. 

“New scaling law states that if you get a 10x increase in compute, you can make your model 3x bigger, and the data you train should be more than 3x larger,” Manu Joseph, creator of PyTorch Tabular, told AIM

Dearth of training data

An AI model is trained on both high-quality and low-quality data. And, with the rate at which the dataset size is growing, it is increasingly difficult for stock data to catch up to it. An Epoch paper predicted that we are within one order of magnitude of exhausting high-quality data, and between 2023 and 2027, we will no longer have stock data. Therefore, the need is to improve data efficiency or make new data sources available to keep up with the growth of AI. 

As Manu says, “Collecting more data to train LLMs is a challenge since there is a shortage of good-quality test data, and most of the text on the internet is duplicated.” Hence, it is yet to be known how we can give a variety of input data to LLMs for training. 

Along similar lines, Google’s François Chollet mentioned the same thing while discussing the latest sensation in AI, ChatGPT. Chollet says that if much of the web becomes inundated with GPT-generated content, the performance of generative text models will degrade as they start training on their own output. He adds, “it might be that the dataset sizes for text models have already peaked, simply because the [Signal to noise] S2N ratio will start declining.” This means that the rapid growth of generative AI will eventually slow down. 

Increasing stock data

While the researchers claim that stock data is declining, there are also certain limitations to their estimates:

  • Researchers are increasingly using synthetic data to train neural models. Synthetic data is labelled information generated by computer simulations or algorithms and is an alternative to real-world data. Synthetic data is, therefore, particularly important for generating data at scale. However, there is still uncertainty around the usefulness of synthetic data, as studies show that it still needs to solve the data quality problem. 
  • Government or corporate players can facilitate the production of large quantities of data through widespread screen recordings, video surveillance, or video recordings (for example, self-driving cars) that continuously stream real-world data. For example, as Manu points out, with models like Whisper from OpenAI, transcribed videos can be considered as a source to feed more data into LLMs. 
  • Further, we can also generate high-quality data from low-quality sources, for instance, by introducing robust automatic quality metrics. For instance, a recent MIT research proposed a framework for generating high-quality text data by optimising a critique score that combines the fluency, similarity, and misclassification metrics. Existing frameworks allow classifiers to misclassify input data due to adversarial attacks. But, the Rewrite & Rollback (R&R) framework explores multiple word substitutions and, based on the critique score, only approves a rewrite if it does not lead to misclassification of data. In this manner, it can improve the text classifiers’ quality. 

To get better text models, developing a process that ensures the quality of sources will be important. Hence, while data is still paramount, merely piling on more data is not the solution and is ultimately likely to create much worse models. However, until we develop more efficient ways of learning from data, large quantities of data to train LLMs will always be important.

Scaling large models, the Google way

Google researchers have developed a method called UL2 Repair (UL2R), which can improve the scaling properties of LLMs with negligible extra computation cost and almost no new data sources. UL2R is the second stage of pre-training that uses a mixture-of-denoisers objective. The mixture-of-denoisers objective includes both casual language modelling objectives—GPT-3, PaLM—which are better for long-form generation and denoising objective (T5) which is better for fine-tuning—thereby, leading to better performance in both scenarios. Hence, pre-training a language model on a different objective from scratch is possible with low computation cost with UL2R. 

Further, the findings showed that in their scaling experiments on downstream few-shot NLP tasks, adapting PaLM with UL2R was two times more efficient at the 540B scale, achieving the performance of the final PaLM 540B model with only half the computation and saving up to 4.4 million TPUv4 hours.  

Additionally, in a second paper, researchers show that the instruction fine-tuning model, dubbed ‘Flan’, can fine-tune tasks with only a small portion of compute cost compared to pre-training. The method involves fine-tuning a collection of NLP datasets phrased as instructions. The results showed that Flan-PaLM 540 B had an average 9.4% performance increase over PaLM 540B with only 1.8K additional training examples.

Share
Picture of Ayush Jain

Ayush Jain

Ayush is interested in knowing how technology shapes and defines our culture, and our understanding of the world. He believes in exploring reality at the intersections of technology and art, science, and politics.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India