BigScience’s open-source LLM BLOOM has landed

BigScience is uniquely people-first and participation-driven, bringing together perspectives from hundreds of multidisciplinary researchers around the world.
Listen to this story

BLOOM is the first multilingual Large Language Model (LLM) trained in complete transparency by the largest collaboration of AI researchers ever involved in a single research project.

With its 176 billion parameters (larger than OpenAI’s GPT-3 and MetaAI’s OPT), BLOOM can generate text in 46 natural languages and dialects and 13 programming languages. It is trained on 1.6TB text data, the equivalent of 320 times the complete works of Shakespeare.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Researchers can now download, run and study BLOOM. Any individual or institution who agrees to the terms of the model’s Responsible AI License can use the model on a local machine or on the cloud.

“BLOOM is a demonstration that the most powerful AI models can be trained and released by the broader research community with accountability and in an actual open way, in contrast to the typical secrecy of industrial AI research labs”

Teven Le Scao, BLOOM Training co-lead

Although it was never trained on any of those specific tasks, BLOOM can be asked to produce summaries or translations of text in 46 natural languages, output code from instructions in 13 programming languages, and follow prompts to perform original tasks such as writing recipes, extracting information from a news article, or composing sentences using a newly-defined invented word.

Also read: Finally, a large language model that’s open source

“BigScience is uniquely people-first and participation-driven, bringing together perspectives from hundreds of multidisciplinary researchers around the world. This was the best way to promote values of inclusion and responsibility in concert with the people actually using this technology,” said Yacine Jernite, BigScience Data Chair. 

Check out the model’s generation abilities here

Ushering a new era of open-source LLMs

The BigScience research project started in early 2021 and was a collaborative effort involving over 1000 researchers from 60+ countries and 250+ institutions. The BLOOM model was trained on the Jean Zay supercomputer in the south of Paris, France. 

A Jean Zay 4 V100 node.

“Large ML models have changed the world of AI research over the last two years but the huge compute cost necessary to train them resulted in very few teams actually having the ability to train and research them”

BigScience co-lead & Hugging Face co-founder Thomas Wolf

Also read: The dazzling ascent of Hugging Face

The BigScience collaboration that created BLOOM was bootstrapped and led by Hugging Face, with strong support from GENCI, the IDRIS team at the CNRS, the Megatron team at NVIDIA and the Deepspeed team at Microsoft as well as the more than 250 entities (universities, startups and enterprises) supporting the various participants of BigScience including EleutherAI and the Allen Institute for AI.

Also read: Paul Allen liked the fact that I wasn’t an academic: Dr Oren Etzioni, CEO, AI2

“We adopted a data-first approach to make sure the training corpus was aligned with our values. The multidisciplinary and international makeup of BigScience enabled us to critically reflect on every step of the process from multiple vantage points: ethical, legal, environmental, linguistic and technical. That meant we were able to mitigate ethical concerns without compromising on performance or scale,” said Christopher Akiki, research scientist at Leipzig University & BigScience researcher. 

Core values 

Openness: Anyone can publicly view all of BigScience meeting notes, discussions, and code. The progress of model training was publicly visible throughout the process. BigScience takes an “open first” approach.

Accessibility: BigScience believes the model should be openly accessible to researchers everywhere and are in the process of developing an easy-to-use API.

Multilinguality: BLOOM is multilingual in contrast with monolingual models such as LaMBDA (Google) and GPT-3 (OpenAI). BLOOM was trained on data from 46 natural languages and 13 programming languages. 

BigScience is slated to add more languages, make the model smaller so it’s easier to use at the same level of performance, and support community efforts to expand it. BLOOM is a living family of models that will grow, not a one-and-done model.

You can test the model here

PS: Do not talk to BLOOM as an entity. It’s not a chatbot but a webpage/blog/article completion model. Instead, mimic a few words of a webpage similar to the type of content you want to generate. Start a sentence as if YOU were writing a blog, webpage, math post, or coding an article and Bloom will generate a coherent follow-up.

More Great AIM Stories

Sri Krishna
Sri Krishna is a technology enthusiast with a professional background in journalism. He believes in writing on subjects that evoke a thought process towards a better world. When not writing, he indulges his passion for automobiles and poetry.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM