Listen to this story
BigScience is an open collaboration boot-strapped by HuggingFace, GENCI and IDRIS, and organised as a research workshop. With the help of around 1,000 academics and researchers around the world, BigScience has developed BLOOM to counter big tech’s stronghold on large language models.
An early version of the BLOOM language model was released on June 17, 2022. The Bloom language model will be open source and will be the first model of its scale to be multilingual.
The goal of the BigScience project is to have a group of researchers participate in developing and training an open large language model.
Sign up for your weekly dose of what's up in emerging technology.
“Everything happens entirely in the open: anyone can participate, and we share all research artefacts with the entire research community. This initiative is designed as an interdisciplinary research workshop, gathering researchers with diverse research interests, from social sciences to Machine Learning. The main idea is to show that there is an open, collaborative and interdisciplinary way to create Large Language Models,” Giada Pistilli, Ethicist at Hugging Face, said.
BLOOM is being trained with USD7-million-worth of publicly funded computing time. The team also got free access to 28 petaflops Jean Zay (IDRIS) supercomputer facility outside Paris.
What is BigScience?
BigScience is neither a consortium nor an officially incorporated entity. The collaboration is inspired by scientific projects such as Large Hadron Collider, in which open scientific collaborations among the scientific community have facilitated the creation of large-scale artefacts useful for the entire community.
BigScience consists of independent researchers and academic and industrial researchers interested in AI, NLP, social sciences, legal, ethics and public policy.
Talking about the significance of an initiative like BigScience, Pistilli said there is a growing need for research to be conducted in a setting where stakeholders can have input and influence over the design process. This process will allow them to help shape the values and priorities of the research project and decide what data and evaluations should be used.
“BigScience aims to make meaningful progress toward solving complex issues and create tools and processes to help a greater diversity of participants. Another important aspect was the joint effort we conducted to help assign a place for Large Language Models within society,” Pistilli said.
Open for all
The fully trained BLOOM model is accessible to all researchers; however, to run it, you need significant hardware capacity. Unfortunately, not every researcher has such capacity. Hence, the team will publish smaller, less hardware-intensive versions. In addition, a distributed system will be created to allow labs to share the model across their servers.
Further, Hugging Face will release a web application that will enable anyone to query BLOOM without downloading.
Models are only as good as the data sets they are trained on. Therefore, selecting the text to train the model was a key decision for the team. Most major models are fed with language directly from the web and publicly open dialogues from sources such as Reddit. However, the team has hand-picked around two-thirds of the 341-billion-word data set that BLOOM has been trained with.
Big firms monopolising LLMs?
Large language models are quickly changing the AI landscape. But not many organisations have the resources to train such large models. In a fair world, technology should be accessible to all and used for humanity’s betterment. However, there is a growing concern that big firms could monopolise such models for vested interests.
Also, given the potential impact of such language models, it is imperative that the broader community must have a good understanding of how they are constructed, how they function and how they can be improved.
Popular models like OpenAI’s GPT-3 are not open source. This means the inner functioning of such models is known by only a handful of people.
“New AI technologies are having a considerable impact on society. Just think of the large language models used to make content moderation, text prediction in our email interfaces, or machine translations. The fundamental dichotomy here lies in opening or closing these models. There are security reasons to keep them closed, but doing so maintains its construction in terms of dataset and training equally closed and therefore inaccessible.
“This choice represents the old philosophical tension between security and freedom. So this closed preference makes it difficult for the research community to study and evaluate these models to measure their impact on society. The reasons for opening specific processes become a matter of transparency and, therefore, accountability,” Pistilli added.
Besides trying to break the stranglehold of big firms on large language models, BLOOM aims to address the biases that large language models inherit from the datasets they train on.
“Due to the complementarity of law and ethics, we drafted an ethical charter that inspired the governance of the project and its legal, ethical, and social aspect. As a society impacted, directly or indirectly, by these technologies, we must educate ourselves about the ethical implications of these new technologies on our daily lives.
“We need to inform people that technology is never neutral and is inscribed within a socio-political framework. As an instrument of power, it can come to alter power relations within society and amplify its injustices. We must also be careful not to fall into the trap of technosolutionism, that is, claiming to solve the underlying problems of society through technological tools,” Pistilli said.