With A Rush To Create Larger Language Models, Are We Beating Their Purpose

Language Models trained on large, uncurated, static datasets from the Web encode hegemonic views that are harmful to marginalised populations.


Israel-based AI company AI21 Labs has released Jurassic-1 Jumbo. Jurassic – a language model, at 178 billion computational parameters, slightly larger than the previous most significant OpenAI’s GPT-3 with 175 billion parameters. 

Turing NLG from Microsoft, GPT-3 from OpenAI, and Wu Dao 2.0 from Beijing Academy of Artificial Intelligence (BAAI) — all three language models have been introduced within one and a half years. It indicates a rush to come out with the “biggest” model. However, the trend demands an answer: is it worth having such large language models at all? Let’s understand.


Sign up for your weekly dose of what's up in emerging technology.

To start with, Yann LeCun, Chief AI Scientist, Facebook, equates the idea of larger language models with building high-altitude aeroplanes designed to land on the Moon. It might help beat altitude records, but to reach the Moon demands an entirely different approach. 

Too Risky to have Large Models

In a paper titled “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”, researchers outlined the danger of going too big. The team includes renowned American computer scientist Timnit Gebru, working on algorithmic bias and data mining. They trained a Transformer (large) model with neural architecture search and estimated that the training procedure emitted 284 tonnes of CO2 per year – nearly 57 times higher than the average human emission of 5 tonnes of carbon dioxide per year. 

Another study published by the University of California, Berkeley researchers found that huge language models like OpenAI’s GPT-3 can solve just 2.9 to 6.9 per cent of problems from a dataset of over 12,500. High financial and environmental costs with low problem-solving ability cause more harm than good.

Uneven internet penetration is another cause of worry. The size of data available on the web might be large, but it does not guarantee diversity. Consider, for example, the internet penetration in Africa is a mere 39.3 per cent, while for Europe, it’s above 90 per cent. “LMs trained on large, uncurated, static datasets from the Web encode hegemonic views that are harmful to marginalised populations,” says the research.

Privacy issues are yet to be resolved. Large datasets in hundreds of gigabytes from different sources comprise sensitive and personally identifiable information (PII), including names, addresses, contact numbers, gender, etc., even if trained on public data. This opens up the possibility that models trained with such large data sets might reflect personal details in output. Hence data leaks from such models can lead to disastrous consequences. Training data extraction attacks are termed as a realistic threat on state-of-the-art large language models. However, in a collaboration, Apple, Stanford, OpenAI, Google, Berkley and Northeastern University have demonstrated that given only the ability to query a pre-trained language model, it is possible to extract specific pieces of training data that the model has memorised. 

Finally, the major issue with large language models – the encoded bias – needs to be examined. It is quite known by now that large LMs exhibit various kinds of bias, including stereotypical associations or negative sentiment towards specific groups. For instance, a study finds that BERT associates phrases referencing persons with disabilities with more negative sentiment words and that gun violence, homelessness, and drug addiction are overrepresented in texts discussing mental illness.

Way Forward

NLP researchers need to take all these risks into consideration. A larger emphasis to look at whether the benefits outweigh the risks matters. The time calls for the research community to direct more resources and efforts to build effective models rather than go data-hungry. Language models show enormous utility and flexibility, but, like other innovations, they also pose risks and carry certain limitations. Identification of risks and mitigating associated issues can help better language models. 

More Great AIM Stories

kumar Gandharv
Kumar Gandharv, PGD in English Journalism (IIMC, Delhi), is setting out on a journey as a tech Journalist at AIM. A keen observer of National and IR-related news.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM