10 Brilliant Datasets Based on ChatGPT Outputs

Since the release of GPT-4, AI researchers have been using the model’s outputs to train their own language models and datasets for benchmark results.
Listen to this story

By now, none on the internet has remained untouched by the power of ChatGPT, based on GPT-3.5 and GPT-4), the driving force behind Silicon Valley’s favourite chatbot. With over a 100 million users, the OpenAI model has also captivated the research community. Since the release of GPT-4, AI researchers have been using the model’s outputs to train their own language models and datasets for benchmark results.

Here are 10 datasets trained on GPT-4 output handpicked for the GPT4 enthusiasts!

Lima 

Researchers at Meta AI have unveiled ‘LIMA: Less Is More for Alignment’, a small dataset containing 1000 examples (available on Hugging Face). The study suggests that LIMA can push forward the research for developing proficient LLMs. Notably, the researchers demonstrated that a 65B LLaMA model, fine-tuned only on these 1000 examples using a supervised approach, and achieved competitive performance compared to ChatGPT. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Find the repository here. 




MiniGPT4

Researchers from Vision-CAIR introduced MiniGPT4, pre-trained and aligned with Vicuna-7B. The updated model shows a significant reduction in GPU memory consumption — as low as 12GB. The researchers propose a novel approach for generating quality image-text pairs by the model itself and ChatGPT. This methodology allows for the creation of a compact yet superior dataset, consisting of a total of 3500 pairs. 

Find the GitHub repository here.

Dolly 

Dolly, a groundbreaking open source project by Databricks, shows the capability of transforming a pre-existing, outdated open-source LLM into a ChatGPT-like system to follow instructions swiftly. This is made possible by a mere 30-minute training process on a single machine, utilising high quality training data. 

Notably, the underlying model in Dolly comprises only 6 billion parameters,compared to other models with much larger parameters. The researchers also released a predecessor for the model, Dolly 2.0 which was lauded by the open source community. 

Find the GitHub repository here

Code Alpaca 

The Code Alpaca project aims to construct and distribute an instruction-following Meta AI’s LLaMA model designed specifically for code generation. This repository is built upon Stanford’s Alpaca, with the only modification being the data used for training. The training method remains the same as the original approach.

For refining the Code Alpaca models, a 7B and 13B LLaMA models were used. These models were then fine-tuned using a dataset of 20,000 instruction-following examples, generated through techniques inspired by the Self-Instruct paper, with certain adaptations for better outputs. 

Find the GitHub repository here. 

Instruction Tuning with GPT4

GPT-4-LLM has a primary objective of facilitating sharing of data produced by GPT-4, which can be used for building instruction-following LLMs through supervised and reinforcement learning techniques. 

This project pushes the boundaries of instruction-tuning in the LLMs world, as it is one of the initial initiatives to leverage OpenAI’s GPT-4’s capabilities in generating instruction-following data specifically tailored for LLM fine-tuning. Notably, the development holds the potential to advance the state of the art in language model training. 

Find the GitHub repository here. 

LLaVA-Instruct-150K

LLaVA Visual Instruct 150K is a collection of multimodal instruction-following data, generated using GPT. The dataset is curated for visual instruction tuning, to enhance the development of large multimodal models with advanced vision and language capabilities, geared towards the GPT-4 vision/language framework. The dataset holds great promise for research in the intersection of vision and language for creating capable multimodal models.

Find the GitHub repository here.

UltraChat

UltraChat offers valuable open-source, large-scale, and multi-round dialogue data powered by ChatGPT Turbo APIs. To prioritise privacy protection, the data collection process does not directly use any internet-based prompts. Furthermore, to maintain high standards of generation quality, a dual API approach is used. 

One API operates as the user, generating queries, while the other API assumes the role of generating responses. This approach ensures a reliable dialogue generation process, promoting advancements in conversational AI while also prioritising privacy and data integrity.

Find the GitHub repository here. 

GPTeacher

GPTeacher is a compilation of modular datasets, crafted by the GPT-4, General-Instruct, Roleplay-Instruct, Code-Instruct, and Toolformer. Each dataset serves a specific purpose, and together form a valuable resource for researchers. With GPT-4’s data generation prowess, these datasets showcase the model’s versatility and contribute to the landscape of language modelling.

Find the GitHub repository here.

ShareGPT 

The collection of 70k user-shared conversations through public APIs have served as the foundational dataset for Vicuna-13B, an open-source chatbot. The dataset is based on an open-source Chrome Extension of ShareGPT which was used by users to share their ChatGPT conversations before OpenAI introduced the feature in the chatbot. 

Find the Hugging Face repository here. 

HC3

The HC3 (Human ChatGPT Comparison Corpus) dataset is an extensive collection of approximately 40k questions and their corresponding responses, generated by ChatGPT users. 

The primary aim of this dataset is to conduct an analysis and comparison of ChatGPT’s responses in contrast to human-generated answers. The questions range from subjects, including open-domain, financial, medical, legal, and psychological areas. 

Find the Hugging Face repository here.

Tasmia Ansari
Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.

Download our Mobile App

MachineHack

AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR