Finally, a large language model that’s open source

We must be careful not to fall into the trap of technosolutionism.
Listen to this story

BigScience is an open collaboration boot-strapped by HuggingFace, GENCI and IDRIS, and organised as a research workshop. With the help of around 1,000 academics and researchers around the world, BigScience has developed BLOOM to counter big tech’s stronghold on large language models.

An early version of the BLOOM language model was released on June 17, 2022. The Bloom language model will be open source and will be the first model of its scale to be multilingual.


The goal of the BigScience project is to have a group of researchers participate in developing and training an open large language model. 

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

“Everything happens entirely in the open: anyone can participate, and we share all research artefacts with the entire research community. This initiative is designed as an interdisciplinary research workshop, gathering researchers with diverse research interests, from social sciences to Machine Learning. The main idea is to show that there is an open, collaborative and interdisciplinary way to create Large Language Models,” Giada Pistilli, Ethicist at Hugging Face, said.

BLOOM is being trained with USD7-million-worth of publicly funded computing time. The team also got free access to 28 petaflops Jean Zay (IDRIS) supercomputer facility outside Paris.

Download our Mobile App

What is BigScience?

BigScience is neither a consortium nor an officially incorporated entity. The collaboration is inspired by scientific projects such as Large Hadron Collider, in which open scientific collaborations among the scientific community have facilitated the creation of large-scale artefacts useful for the entire community.

BigScience consists of independent researchers and academic and industrial researchers interested in AI, NLP, social sciences, legal, ethics and public policy.

Talking about the significance of an initiative like BigScience, Pistilli said there is a growing need for research to be conducted in a setting where stakeholders can have input and influence over the design process. This process will allow them to help shape the values and priorities of the research project and decide what data and evaluations should be used.

“BigScience aims to make meaningful progress toward solving complex issues and create tools and processes to help a greater diversity of participants. Another important aspect was the joint effort we conducted to help assign a place for Large Language Models within society,” Pistilli said.

Open for all

The fully trained BLOOM model is accessible to all researchers; however, to run it, you need significant hardware capacity. Unfortunately, not every researcher has such capacity. Hence, the team will publish smaller, less hardware-intensive versions. In addition, a distributed system will be created to allow labs to share the model across their servers. 

Further, Hugging Face will release a web application that will enable anyone to query BLOOM without downloading. 

Models are only as good as the data sets they are trained on. Therefore, selecting the text to train the model was a key decision for the team. Most major models are fed with language directly from the web and publicly open dialogues from sources such as Reddit. However, the team has hand-picked around two-thirds of the 341-billion-word data set that BLOOM has been trained with. 

Big firms monopolising LLMs?

Large language models are quickly changing the AI landscape. But not many organisations have the resources to train such large models. In a fair world, technology should be accessible to all and used for humanity’s betterment. However, there is a growing concern that big firms could monopolise such models for vested interests.

Also, given the potential impact of such language models, it is imperative that the broader community must have a good understanding of how they are constructed, how they function and how they can be improved. 

Popular models like OpenAI’s GPT-3 are not open source. This means the inner functioning of such models is known by only a handful of people. 

“New AI technologies are having a considerable impact on society. Just think of the large language models used to make content moderation, text prediction in our email interfaces, or machine translations. The fundamental dichotomy here lies in opening or closing these models. There are security reasons to keep them closed, but doing so maintains its construction in terms of dataset and training equally closed and therefore inaccessible. 

“This choice represents the old philosophical tension between security and freedom. So this closed preference makes it difficult for the research community to study and evaluate these models to measure their impact on society. The reasons for opening specific processes become a matter of transparency and, therefore, accountability,” Pistilli added.

Besides trying to break the stranglehold of big firms on large language models, BLOOM aims to address the biases that large language models inherit from the datasets they train on. 

“Due to the complementarity of law and ethics, we drafted an ethical charter that inspired the governance of the project and its legal, ethical, and social aspect. As a society impacted, directly or indirectly, by these technologies, we must educate ourselves about the ethical implications of these new technologies on our daily lives.

“We need to inform people that technology is never neutral and is inscribed within a socio-political framework. As an instrument of power, it can come to alter power relations within society and amplify its injustices. We must also be careful not to fall into the trap of technosolutionism, that is, claiming to solve the underlying problems of society through technological tools,” Pistilli said.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Pritam Bordoloi
I have a keen interest in creative writing and artificial intelligence. As a journalist, I deep dive into the world of technology and analyse how it’s restructuring business models and reshaping society.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.