Listen to this story
By now, none on the internet has remained untouched by the power of ChatGPT, based on GPT-3.5 and GPT-4), the driving force behind Silicon Valley’s favourite chatbot. With over a 100 million users, the OpenAI model has also captivated the research community. Since the release of GPT-4, AI researchers have been using the model’s outputs to train their own language models and datasets for benchmark results.
Here are 10 datasets trained on GPT-4 output handpicked for the GPT4 enthusiasts!
Researchers at Meta AI have unveiled ‘LIMA: Less Is More for Alignment’, a small dataset containing 1000 examples (available on Hugging Face). The study suggests that LIMA can push forward the research for developing proficient LLMs. Notably, the researchers demonstrated that a 65B LLaMA model, fine-tuned only on these 1000 examples using a supervised approach, and achieved competitive performance compared to ChatGPT.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Researchers from Vision-CAIR introduced MiniGPT4, pre-trained and aligned with Vicuna-7B. The updated model shows a significant reduction in GPU memory consumption — as low as 12GB. The researchers propose a novel approach for generating quality image-text pairs by the model itself and ChatGPT. This methodology allows for the creation of a compact yet superior dataset, consisting of a total of 3500 pairs.
Dolly, a groundbreaking open source project by Databricks, shows the capability of transforming a pre-existing, outdated open-source LLM into a ChatGPT-like system to follow instructions swiftly. This is made possible by a mere 30-minute training process on a single machine, utilising high quality training data.
Notably, the underlying model in Dolly comprises only 6 billion parameters,compared to other models with much larger parameters. The researchers also released a predecessor for the model, Dolly 2.0 which was lauded by the open source community.
The Code Alpaca project aims to construct and distribute an instruction-following Meta AI’s LLaMA model designed specifically for code generation. This repository is built upon Stanford’s Alpaca, with the only modification being the data used for training. The training method remains the same as the original approach.
For refining the Code Alpaca models, a 7B and 13B LLaMA models were used. These models were then fine-tuned using a dataset of 20,000 instruction-following examples, generated through techniques inspired by the Self-Instruct paper, with certain adaptations for better outputs.
Instruction Tuning with GPT4
GPT-4-LLM has a primary objective of facilitating sharing of data produced by GPT-4, which can be used for building instruction-following LLMs through supervised and reinforcement learning techniques.
This project pushes the boundaries of instruction-tuning in the LLMs world, as it is one of the initial initiatives to leverage OpenAI’s GPT-4’s capabilities in generating instruction-following data specifically tailored for LLM fine-tuning. Notably, the development holds the potential to advance the state of the art in language model training.
LLaVA Visual Instruct 150K is a collection of multimodal instruction-following data, generated using GPT. The dataset is curated for visual instruction tuning, to enhance the development of large multimodal models with advanced vision and language capabilities, geared towards the GPT-4 vision/language framework. The dataset holds great promise for research in the intersection of vision and language for creating capable multimodal models.
UltraChat offers valuable open-source, large-scale, and multi-round dialogue data powered by ChatGPT Turbo APIs. To prioritise privacy protection, the data collection process does not directly use any internet-based prompts. Furthermore, to maintain high standards of generation quality, a dual API approach is used.
One API operates as the user, generating queries, while the other API assumes the role of generating responses. This approach ensures a reliable dialogue generation process, promoting advancements in conversational AI while also prioritising privacy and data integrity.
GPTeacher is a compilation of modular datasets, crafted by the GPT-4, General-Instruct, Roleplay-Instruct, Code-Instruct, and Toolformer. Each dataset serves a specific purpose, and together form a valuable resource for researchers. With GPT-4’s data generation prowess, these datasets showcase the model’s versatility and contribute to the landscape of language modelling.
The collection of 70k user-shared conversations through public APIs have served as the foundational dataset for Vicuna-13B, an open-source chatbot. The dataset is based on an open-source Chrome Extension of ShareGPT which was used by users to share their ChatGPT conversations before OpenAI introduced the feature in the chatbot.
The HC3 (Human ChatGPT Comparison Corpus) dataset is an extensive collection of approximately 40k questions and their corresponding responses, generated by ChatGPT users.
The primary aim of this dataset is to conduct an analysis and comparison of ChatGPT’s responses in contrast to human-generated answers. The questions range from subjects, including open-domain, financial, medical, legal, and psychological areas.