Top Used Datasets for Text to Image Synthesis Models

Abundant image datasets are one of the most crucial parts of training and testing computer vision based image synthesis.
Listen to this story

Text-to-image models use computer vision algorithms to analyse images and understand, label, and interpret them. Image generation is likely the technology of the future and has already made several innovations and breakthroughs such as facial recognition and autonomous vehicles.

When it comes to training and testing these models, the datasets play a huge role for the comprehensiveness, accuracy, and variety of the generated images. Here’s a list of the most used datasets used by image synthesis models that you can implement for building your own models as well, just like the pros!


Used by DALL-E for testing, MS-COCO is a large-scale object detection, captioning, and segmentation dataset that consists of 120,000 images in 91 different categories. Each image has five different captions which makes it an ideal dataset for testing image synthesis models.

Click here to go to the GitHub repository.


An AI training dataset that contains more than five billion image-text pairs, LAOIN-5B builds by 14x on the predecessor LAOIN-400M. Large-scale AI Open Network (LAION) is one of the largest image-text dataset that is available free for everyone.

Click here for the dataset.

Conceptual Images 12m

CC12M is a dataset made of 12 million text-image pairs and is used by OpenAI’s DALL-E2 for training as one of the datasets. The dataset is built on their previous dataset of 3 million text-image pairs called CC3M and was used for various pre-training and end-to-end training of images.

Click here to check out the 2.5GB dataset.

Filtered YFCC100m

One of the biggest dataset for multimedia research, YFCC100M consists of 100 million objects with 99.2 million images and 0.8 million videos. The photos have a common creative license and identifying information about each image such as the Flickr identifier, owner name, and several other information of the images since the inception of Flickr in 2004 till 2014.

Click here for more information.


Google’s Language-Image Mixture of Experts (LIMoE) was trained on zero-shot learning with 5.6 billion parameters on ImageNet, which is a database organised according to the hierarchy of WordNet. Currently only including images of nouns, each node of the hierarchy depicts thousands of images.

Click here and visit the website.


A large-scale face image dataset with text-guided image manipulation, for face generation and editing and VQA. The dataset has 30,000 total images with 24,000 for training and 6,000 for testing with ten captions per image, thereby making it a broad dataset.

Click here for the image dataset.


Another large scale, visual-language face dataset with rich fine-grained labels, classifying a single attribute into multiple degrees referring to its semantic meaning. The dataset has nearly 200,000 images with 10,000 identities containing five fine-grained information about each individual image.

Click here to download the dataset.


Used for training and testing a lot of image synthesis models, DeepFashion is a rich multi-modal annotation with fine-grained labels and textual descriptions. The dataset consists of 800,000 diverse images of fashion that make for a large variety of images in different props in different poses.

Click here to visit their website.

MNIST Database

Yann LeCun’s proposed dataset with 60,000 training examples and testing set of 10,000 images. The dataset is mostly used for technique and pattern recognition on real-world data. The digits on the dataset are normalised and centred in an image of fixed size.

Visit the website to know more.


This dataset contains 163 car makes and around 1,716 models annotated and labelled with five attributes each that include several information like speed, seats, and displacement.

Click here to access the database.


A larger dataset with 60,000 images of 32×32 resolution divided on the basis of colours into ten separate classes. The dataset is also divided into training batches with one test batch containing 10,000 images.

Click here to see the dataset.

Google’s Open Images

Featuring 9 million URLs, it is one of the largest datasets with millions of images with annotations. The dataset is divided into 6,000 categories, making it a widely used dataset for many prominent image generation models.

Click here to check out the description.


One of the larger datasets based on videos, Youtube-8M contains millions of labelled video IDs with annotations of 3,800 visual entities, excluding movies and TV series for copyright protection.

Check out the research here.

Download our Mobile App

Mohit Pandey
Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Bangalore

Future Ready | Lead the AI Era Summit

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

20th June | Bangalore

Women in Data Science (WiDS) by Intuit India

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox