Hugging Face has Cosmopedia v0.1, the largest open synthetic dataset consisting of over 30 million samples, generated by Mixtral 7b. It consists of various types of content such as textbooks, blog posts, stories, and WikiHow articles, contributing to a total of 25 billion tokens.
The dataset aims to compile global knowledge by mapping information from web datasets like RefinedWeb and RedPajama. It features essential information, including prompts, synthetic content, seed data sources, token lengths, text formats (e.g., textbook, blog post), and target audiences. The comprehensive breakdown of splits, distributions, and creation methodology is presented, offering researchers insights into the dataset’s structure and potential applications.
Inspired by Phi1.5’s work, this initial version of Cosmopedia provides a foundation for research in the synthetic data domain. It serves as a comprehensive resource for diverse topics, emphasizing its potential for further enhancement in subsequent iterations.
The dataset is structured into eight splits, each derived from distinct seed samples. These splits include web_samples_v1 and web_samples_v2, constituting approximately 75% of the dataset, sourced from an internal web dataset akin to RefinedWeb.
The Stanford split utilizes scraped course outlines from stanford.edu, while the stories split incorporates generated narratives from UltraChat and OpenHermes2.5. Additionally, WikiHow, OpenStax, KhanAcademy, and automathtext splits involve prompts related to their respective sources.
To facilitate dataset access, users can employ the provided code snippet to load specific splits. A smaller subset, Cosmopedia-100k, is also available for those seeking a reduced dataset. Furthermore, a larger model, Cosmo-1B, has been trained on Cosmopedia, demonstrating scalability and versatility.
The dataset creation process involves a topic clustering method for web samples, refining prompts iteratively, and addressing contamination issues. The objective is to maximize diversity by tailoring prompt styles and audiences, significantly reducing duplicate content.