Listen to this story
BLOOM is the first multilingual Large Language Model (LLM) trained in complete transparency by the largest collaboration of AI researchers ever involved in a single research project.
With its 176 billion parameters (larger than OpenAI’s GPT-3 and MetaAI’s OPT), BLOOM can generate text in 46 natural languages and dialects and 13 programming languages. It is trained on 1.6TB text data, the equivalent of 320 times the complete works of Shakespeare.
Sign up for your weekly dose of what's up in emerging technology.
Researchers can now download, run and study BLOOM. Any individual or institution who agrees to the terms of the model’s Responsible AI License can use the model on a local machine or on the cloud.
“BLOOM is a demonstration that the most powerful AI models can be trained and released by the broader research community with accountability and in an actual open way, in contrast to the typical secrecy of industrial AI research labs”Teven Le Scao, BLOOM Training co-lead
Although it was never trained on any of those specific tasks, BLOOM can be asked to produce summaries or translations of text in 46 natural languages, output code from instructions in 13 programming languages, and follow prompts to perform original tasks such as writing recipes, extracting information from a news article, or composing sentences using a newly-defined invented word.
“BigScience is uniquely people-first and participation-driven, bringing together perspectives from hundreds of multidisciplinary researchers around the world. This was the best way to promote values of inclusion and responsibility in concert with the people actually using this technology,” said Yacine Jernite, BigScience Data Chair.
Check out the model’s generation abilities here.
Ushering a new era of open-source LLMs
The BigScience research project started in early 2021 and was a collaborative effort involving over 1000 researchers from 60+ countries and 250+ institutions. The BLOOM model was trained on the Jean Zay supercomputer in the south of Paris, France.
“Large ML models have changed the world of AI research over the last two years but the huge compute cost necessary to train them resulted in very few teams actually having the ability to train and research them”BigScience co-lead & Hugging Face co-founder Thomas Wolf
Also read: The dazzling ascent of Hugging Face
The BigScience collaboration that created BLOOM was bootstrapped and led by Hugging Face, with strong support from GENCI, the IDRIS team at the CNRS, the Megatron team at NVIDIA and the Deepspeed team at Microsoft as well as the more than 250 entities (universities, startups and enterprises) supporting the various participants of BigScience including EleutherAI and the Allen Institute for AI.
“We adopted a data-first approach to make sure the training corpus was aligned with our values. The multidisciplinary and international makeup of BigScience enabled us to critically reflect on every step of the process from multiple vantage points: ethical, legal, environmental, linguistic and technical. That meant we were able to mitigate ethical concerns without compromising on performance or scale,” said Christopher Akiki, research scientist at Leipzig University & BigScience researcher.
Openness: Anyone can publicly view all of BigScience meeting notes, discussions, and code. The progress of model training was publicly visible throughout the process. BigScience takes an “open first” approach.
Accessibility: BigScience believes the model should be openly accessible to researchers everywhere and are in the process of developing an easy-to-use API.
Multilinguality: BLOOM is multilingual in contrast with monolingual models such as LaMBDA (Google) and GPT-3 (OpenAI). BLOOM was trained on data from 46 natural languages and 13 programming languages.
BigScience is slated to add more languages, make the model smaller so it’s easier to use at the same level of performance, and support community efforts to expand it. BLOOM is a living family of models that will grow, not a one-and-done model.
You can test the model here.
PS: Do not talk to BLOOM as an entity. It’s not a chatbot but a webpage/blog/article completion model. Instead, mimic a few words of a webpage similar to the type of content you want to generate. Start a sentence as if YOU were writing a blog, webpage, math post, or coding an article and Bloom will generate a coherent follow-up.