MITB Banner

Hugging Face Introduces Cosmopedia, the Largest Open Synthetic Dataset 

The dataset consists of over 30 million samples and f 25 billion tokens, generated by Mixtral.

Share

Hugging Face has Cosmopedia v0.1, the largest open synthetic dataset consisting of over 30 million samples, generated by Mixtral 7b. It consists of various types of content such as textbooks, blog posts, stories, and WikiHow articles, contributing to a total of 25 billion tokens. 

The dataset aims to compile global knowledge by mapping information from web datasets like RefinedWeb and RedPajama. It features essential information, including prompts, synthetic content, seed data sources, token lengths, text formats (e.g., textbook, blog post), and target audiences. The comprehensive breakdown of splits, distributions, and creation methodology is presented, offering researchers insights into the dataset’s structure and potential applications.

Inspired by Phi1.5’s work, this initial version of Cosmopedia provides a foundation for research in the synthetic data domain. It serves as a comprehensive resource for diverse topics, emphasizing its potential for further enhancement in subsequent iterations.

The dataset is structured into eight splits, each derived from distinct seed samples. These splits include web_samples_v1 and web_samples_v2, constituting approximately 75% of the dataset, sourced from an internal web dataset akin to RefinedWeb. 

The Stanford split utilizes scraped course outlines from stanford.edu, while the stories split incorporates generated narratives from UltraChat and OpenHermes2.5. Additionally, WikiHow, OpenStax, KhanAcademy, and automathtext splits involve prompts related to their respective sources.

To facilitate dataset access, users can employ the provided code snippet to load specific splits. A smaller subset, Cosmopedia-100k, is also available for those seeking a reduced dataset. Furthermore, a larger model, Cosmo-1B, has been trained on Cosmopedia, demonstrating scalability and versatility.

The dataset creation process involves a topic clustering method for web samples, refining prompts iteratively, and addressing contamination issues. The objective is to maximize diversity by tailoring prompt styles and audiences, significantly reducing duplicate content.

Share
Picture of Shritama Saha

Shritama Saha

Shritama (she/her) is a technology journalist at AIM who is passionate to explore the influence of AI on different domains including fashion, healthcare and banks.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.