MITB Banner

Open Source RedPajama-Data-v2 with 30 Trillion Tokens is Here

Together.AI says that the dataset offers a solid foundation for advancing open LLMs such as Llama, Mistral, Falcon, and MPT.

Share

RedPajama
Listen to this story

RedPajama has unveiled the latest version of its dataset, RedPajama-Data-v2, which is a colossal repository of web data aimed at advancing language model training. This dataset encompasses a staggering 30 trillion tokens, meticulously filtered and deduplicated from a raw pool of over 100 trillion tokens, sourced from 84 CommonCrawl data dumps in five languages, including English, French, Spanish, German, and Italian. 

Click here to check out the GitHub repository.

RedPajama-Data-v2 comes with a remarkable addition of 40+ pre-computed data quality annotations that offer invaluable tools for further data filtering and weighting.

Over the past six months, the impact of RedPajama’s previous release, RedPajama-1T, has been profound in the language model community. This 5TB dataset of high-quality English tokens has been downloaded by more than 190,000 individuals, who have harnessed its potential in creative ways. 

RedPajama-1T served as a stepping stone towards the goal of creating open datasets for language model training, but RedPajama-Data-v2 takes this ambition to new heights with its mammoth 30 trillion token web dataset.

RedPajama-Data-v2 stands out as the largest public dataset specifically crafted for LLM training, significantly contributing to the field. Most notably, it introduces 40+ pre-computed quality annotations, empowering the community to enhance the dataset’s utility. This release encompasses over 100 billion text documents derived from 84 CommonCrawl data dumps, constituting a total of 100+ trillion raw tokens.

Together.AI says that the dataset offers a solid foundation for advancing state-of-the-art open LLMs such as Llama, Mistral, Falcon, MPT, and the RedPajama models. 

RedPajama-Data-v2 primarily focuses on CommonCrawl data, while data sources such as Wikipedia are available in RedPajama-Data-v1. To further enrich the dataset, users are encouraged to integrate Stack (by BigScience) for code-related content and s2orc (by AI2) for scientific articles. RedPajama-Data-v2 is meticulously crafted from publicly available web data, comprising the core elements of plain text source data, 40+ quality annotations, and deduplication clusters.

The process of creating the source data begins with each CommonCrawl snapshot passing through the CCNet pipeline, chosen for its light processing approach, preserving raw data integrity. This results in the generation of 100 billion individual text documents, maintaining alignment with the overarching principle of data preservation.

Share
Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India