Last updated October 31, 2023
In AI News & Update

Open Source RedPajama-Data-v2 with 30 Trillion Tokens is Here

Together.AI says that the dataset offers a solid foundation for advancing open LLMs such as Llama, Mistral, Falcon, and MPT.

Share

Published on October 31, 2023

by Mohit Pandey

Listen to this story

RedPajama has unveiled the latest version of its dataset, RedPajama-Data-v2, which is a colossal repository of web data aimed at advancing language model training. This dataset encompasses a staggering 30 trillion tokens, meticulously filtered and deduplicated from a raw pool of over 100 trillion tokens, sourced from 84 CommonCrawl data dumps in five languages, including English, French, Spanish, German, and Italian.

Click here to check out the GitHub repository.

RedPajama-Data-v2 comes with a remarkable addition of 40+ pre-computed data quality annotations that offer invaluable tools for further data filtering and weighting.

The dataset covers 5 languages, with 40+ pre-computed data quality annotations that can be used for further filtering and weighting. Here is one example of how to filter RedPajama-Data-v2 in a similar way as Gopher: pic.twitter.com/VqKObX9Iqr
— Together AI (@togethercompute) October 30, 2023

Over the past six months, the impact of RedPajama’s previous release, RedPajama-1T, has been profound in the language model community. This 5TB dataset of high-quality English tokens has been downloaded by more than 190,000 individuals, who have harnessed its potential in creative ways.

RedPajama-1T served as a stepping stone towards the goal of creating open datasets for language model training, but RedPajama-Data-v2 takes this ambition to new heights with its mammoth 30 trillion token web dataset.

RedPajama-Data-v2 stands out as the largest public dataset specifically crafted for LLM training, significantly contributing to the field. Most notably, it introduces 40+ pre-computed quality annotations, empowering the community to enhance the dataset’s utility. This release encompasses over 100 billion text documents derived from 84 CommonCrawl data dumps, constituting a total of 100+ trillion raw tokens.

Together.AI says that the dataset offers a solid foundation for advancing state-of-the-art open LLMs such as Llama, Mistral, Falcon, MPT, and the RedPajama models.

RedPajama-Data-v2 primarily focuses on CommonCrawl data, while data sources such as Wikipedia are available in RedPajama-Data-v1. To further enrich the dataset, users are encouraged to integrate Stack (by BigScience) for code-related content and s2orc (by AI2) for scientific articles. RedPajama-Data-v2 is meticulously crafted from publicly available web data, comprising the core elements of plain text source data, 40+ quality annotations, and deduplication clusters.

The process of creating the source data begins with each CommonCrawl snapshot passing through the CCNet pipeline, chosen for its light processing approach, preserving raw data integrity. This results in the generation of 100 billion individual text documents, maintaining alignment with the overarching principle of data preservation.

Access all our open Survey & Awards Nomination forms in one place