Looking back at 2021, it can surely be labelled as the year of large language models, with all the tech giants releasing models to stay ahead in the innovation game. In December itself, we saw back-to-back releases – DeepMind’s 280 billion parameter transformer language model, Gopher, Google’s Generalist Language Model (GLaM), a trillion weight model that uses sparsity, LG AI Research’s artificial intelligence language model “Exaone“, with capabilities of tuning 300 billion different parameters or variables. With innovations in language models accelerating at such a massive pace, can we possibly see a 100T large language model in the very near future?
Already halfway through
The idea is surely not too far-fetched if we look at the growth that tech companies have made, bringing out improved versions of the models that exist today in a span of just a few years. After the release of the GPT-3 autoregressive language model with 175 billion machine learning parameters from Open AI in 2020 (its predecessor, GPT-2, was over 100 times smaller, at 1.5 billion parameters), major efforts have gone into bringing out more such models by tech mammoths.
AI21 Labs released Jurassic-1, which has 178 billion parameters. In a landmark event, Microsoft and NVIDIA collaborated to bring out the Megatron-Turing Natural Language Generation model (MT-NLG), calling it the largest and most powerful monolithic transformer language model trained to date, with 530 billion parameters. Google released Switch Transformers, a technique to train language models with over a trillion parameters.
The Chinese govt-backed Beijing Academy of Artificial Intelligence (BAAI) has introduced Wu Dao 2.0, which uses 1.75 trillion parameters to simulate conversational speech, write poems, understand pictures, etc. It has outdone OpenAI’s GPT-3 and Google’s Switch Transformer in size. The lead researcher behind this innovation, Tang Jie, said that Wu Dao 2.0 aims to make ‘machines’ think like ‘humans’. It wants to achieve cognitive abilities beyond the Turing test.
GPT-4 is coming soon
There is a buzz that GPT-4 is coming soon, and we can only imagine how amazing it will be. If we look at the timeline, all three models have been released within a gap of a year – GPT-1 was released in 2018, GPT-2 in 2019, and GPT-3 in 2020. So, the chances of seeing GPT-4 in 2022 or early 2023 are very high.
But, Sam Altman, the CEO of OpenAI, had something different to say about the future release of GPT-4. In a question-answer session in AC10 online meetup, he said that though GPT-4 will not be any bigger than GPT-3, it will use more compute resources.
Scaling up to such numbers is not easy
Though scaling up to 100T will be a monumental achievement and can greatly benefit different areas like improved search, better chatbot and virtual assistance, and enhanced gaming experience, scaling up to such massive figures is not easy. Even the past models with much fewer parameters have struggled to fine-tune to such massive numbers. A major factor to consider is the financial and technical cost for a 100T model. As transformative as it sounds, building a 100T language model will require massive computational resources, obviously leading to high technical costs. These models can only be developed by big names that will have the capability to invest such vast amounts of resources.
We also have to look at the environmental impacts of building something so huge. “Training a single BERT base model (without hyperparameter tuning) on GPUs was estimated to require as much energy as a trans-American flight,” said the paper titled “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?“. So, one can only imagine the toll over the environment when building a 100T language model. The researchers further stated that while some of the energy comes from renewable sources or cloud compute companies’ use of carbon credit offset sources, a major chunk of cloud compute providers’ energy is not sourced from renewable sources. We are all aware that renewable energy sources are costly and data centres with increasing computation requirements take away from other potential uses of green energy. The need is to focus on building energy-efficient model architectures and training paradigms.
Protection of sensitive information, removal of bias will be key
Taking care of bias and truthfulness – it is common knowledge that large language models exhibit different kinds of biases due to the vast data they collect from different sources. Various studies and papers indicate how these biases can spread over caste, religion, ethnicity, and the kind of societal impact it can have. Another area that needs focus will be the protection of sensitive information. In a study last year by Google, Apple, Stanford University, OpenAI, the University of California, Berkeley, and Northeastern University, the co-authors showed that large language models can be prompted to show sensitive, private information when fed certain words and phrases. With so much data used for a mammoth model of that size, protecting sensitive information will be crucial.
Future is not just text, but videos, audios and images too
Though GPT-3 largely relied on text, a discussion by Stanford University titled “How Large Language Models Will Transform Science, Society, and AI” feels that the future will be different. The future models that are on the way will be trained on data from other sources as well as from videos, audios, images and more. This will bring in diversity and an enhanced learning speed.
Progress in this area has already been made by releasing OpenAI’s DALL·E, a 12-billion parameter version of GPT-3 trained to generate images from text descriptions using a dataset of text-image pairs.
Before A 100T model release, develop norms and rules for deployment of LLMs
If we are to see such massive 100T language models in the near future, what needs to be done urgently is to develop rules with respect to various aspects of deploying such models. Who builds such models? Who is responsible for looking into its ethical part? If there are biases, leaks of sensitive information, who will bear the brunt? – these are the questions one needs to work on before scaling models to bigger numbers.