MITB Banner

Behind HuggingFace’s BigScience Project that crowdsources research on large language models

Companies like Google and Facebook are deploying large language models (LLMs) for translation and content moderation. Meanwhile, OpenAI’s GPT-2 and GPT-3 are the most powerful language models that are able to write exceptionally convincing passages of text in a range of different styles (as well as complete musical compositions and finish writing computer code). Start-ups, too, are creating dozens of their own LLM products and services based on the models created by these tech giants. 

Very soon, all our digital interactions will likely be filtered through LLMs, which is bound to have a fundamental impact on our society. 

Still, despite the proliferation of this technology, very little research is being done into the environmental, ethical, and societal concerns it raises. Today, big tech giants hold all the power in determining how this transformative technology develops because research into AI is expensive, and they are the ones with the deep pockets—giving them the power to censor or encumber research that casts them in a bad light. 

Dangers of LLMs 

There are a number of concerns surrounding the rapid growth of LLMs that many leaders in the AI community believe are being under-researched by big tech firms. These include: 

  • The data used to build these models is often unethically and nonconsensually sourced. 
  • The models are conversationally fluid and believably human, but they do not understand what they’re saying and often propagate racism, sexism, self-harm, and other dangerous views. 
  • Many of the advanced capabilities of LLMs today are only available in English, which makes their application for content moderation in non-English speaking countries dangerous. 
  • When fake news, hate speech, and death threats aren’t moderated out of the data set, they are used as training data to build the next generation of LLMs—allowing for the continuation (or worsening) of toxic linguistic patterns on the internet. 

What is the BigScience project? 

The BigScience project, led by Hugging Face, is a year-long research workshop that has taken inspiration from previous scientific creation schemes (such as CERN in particle physics) in order to combat the lack of research being done on multilingual models and datasets. The leaders of the project don’t believe that they can put a pause on the hype surrounding large language models, but they hope to nudge it in a direction that will make it more beneficial to society. 

The idea is for the program’s participants (who are all there as volunteers) to investigate the capabilities and limitations of these datasets and models from all angles. The central question they seek to answer is how and when LLMs should be developed and deployed in order for us to be able to enjoy their benefits without having to confront the challenges they pose. 

To do this, the group of researchers aim to create a very large multilingual neural network language model and very large multilingual text dataset on a supercomputer that has been provided to them by the French government.

How is BigScience doing a better job than tech companies? 

Unlike the research conducted at tech companies, where researchers have primarily technical expertise, BigScience has brought in researchers from a much broader range of countries and disciplines. They have researchers who specialise in AI, NLP, social sciences, legal, ethics, and public policy, in order to make the model-construction process a truly collaborative event. 

As of now, the program consists of 600 researchers from 50 countries and more than 250 institutions. They have all been divided into a dozen working groups, each tackling different facets of model development and investigation: one group is measuring the model’s environmental impact, one is developing and evaluating the model’s “multilinguality,” another is developing responsible ways to source training data, and yet another is transcribing historical radio archives or podcasts. 

If things work out, the project could inspire people within the industry (many of whom are involved in the project) to incorporate some of these approaches into their own LLM strategy and create new norms within the NLP community.

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Srishti Mukherjee

Srishti Mukherjee

Drowned in reading sci-fi, fantasy, and classics in equal measure; Srishti carries her bond with literature head-on into the world of science and tech, learning and writing about the fascinating possibilities in the fields of artificial intelligence and machine learning. Making hyperrealistic paintings of her dog Pickle and going through succession memes are her ideas of fun.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories