Last updated February 28, 2024
In AI Origins & Evolution

Behind HuggingFace’s BigScience Project that crowdsources research on large language models

Share

Published on January 6, 2022

by Srishti Mukherjee

Companies like Google and Facebook are deploying large language models (LLMs) for translation and content moderation. Meanwhile, OpenAI’s GPT-2 and GPT-3 are the most powerful language models that are able to write exceptionally convincing passages of text in a range of different styles (as well as complete musical compositions and finish writing computer code). Start-ups, too, are creating dozens of their own LLM products and services based on the models created by these tech giants.

Very soon, all our digital interactions will likely be filtered through LLMs, which is bound to have a fundamental impact on our society.

Still, despite the proliferation of this technology, very little research is being done into the environmental, ethical, and societal concerns it raises. Today, big tech giants hold all the power in determining how this transformative technology develops because research into AI is expensive, and they are the ones with the deep pockets—giving them the power to censor or encumber research that casts them in a bad light.

Dangers of LLMs

There are a number of concerns surrounding the rapid growth of LLMs that many leaders in the AI community believe are being under-researched by big tech firms. These include:

The data used to build these models is often unethically and nonconsensually sourced.

The models are conversationally fluid and believably human, but they do not understand what they’re saying and often propagate racism, sexism, self-harm, and other dangerous views.

Many of the advanced capabilities of LLMs today are only available in English, which makes their application for content moderation in non-English speaking countries dangerous.

When fake news, hate speech, and death threats aren’t moderated out of the data set, they are used as training data to build the next generation of LLMs—allowing for the continuation (or worsening) of toxic linguistic patterns on the internet.

What is the BigScience project?

The BigScience project, led by Hugging Face, is a year-long research workshop that has taken inspiration from previous scientific creation schemes (such as CERN in particle physics) in order to combat the lack of research being done on multilingual models and datasets. The leaders of the project don’t believe that they can put a pause on the hype surrounding large language models, but they hope to nudge it in a direction that will make it more beneficial to society.

The idea is for the program’s participants (who are all there as volunteers) to investigate the capabilities and limitations of these datasets and models from all angles. The central question they seek to answer is how and when LLMs should be developed and deployed in order for us to be able to enjoy their benefits without having to confront the challenges they pose.

To do this, the group of researchers aim to create a very large multilingual neural network language model and very large multilingual text dataset on a supercomputer that has been provided to them by the French government.

How is BigScience doing a better job than tech companies?

Unlike the research conducted at tech companies, where researchers have primarily technical expertise, BigScience has brought in researchers from a much broader range of countries and disciplines. They have researchers who specialise in AI, NLP, social sciences, legal, ethics, and public policy, in order to make the model-construction process a truly collaborative event.

As of now, the program consists of 600 researchers from 50 countries and more than 250 institutions. They have all been divided into a dozen working groups, each tackling different facets of model development and investigation: one group is measuring the model’s environmental impact, one is developing and evaluating the model’s “multilinguality,” another is developing responsible ways to source training data, and yet another is transcribing historical radio archives or podcasts.

If things work out, the project could inspire people within the industry (many of whom are involved in the project) to incorporate some of these approaches into their own LLM strategy and create new norms within the NLP community.

Access all our open Survey & Awards Nomination forms in one place

Srishti Mukherjee

Drowned in reading sci-fi, fantasy, and classics in equal measure; Srishti carries her bond with literature head-on into the world of science and tech, learning and writing about the fascinating possibilities in the fields of artificial intelligence and machine learning. Making hyperrealistic paintings of her dog Pickle and going through succession memes are her ideas of fun.