Listen to this story
Text-to-image models use computer vision algorithms to analyse images and understand, label, and interpret them. Image generation is likely the technology of the future and has already made several innovations and breakthroughs such as facial recognition and autonomous vehicles.
When it comes to training and testing these models, the datasets play a huge role for the comprehensiveness, accuracy, and variety of the generated images. Here’s a list of the most used datasets used by image synthesis models that you can implement for building your own models as well, just like the pros!
Used by DALL-E for testing, MS-COCO is a large-scale object detection, captioning, and segmentation dataset that consists of 120,000 images in 91 different categories. Each image has five different captions which makes it an ideal dataset for testing image synthesis models.
Click here to go to the GitHub repository.
An AI training dataset that contains more than five billion image-text pairs, LAOIN-5B builds by 14x on the predecessor LAOIN-400M. Large-scale AI Open Network (LAION) is one of the largest image-text dataset that is available free for everyone.
Click here for the dataset.
Conceptual Images 12m
CC12M is a dataset made of 12 million text-image pairs and is used by OpenAI’s DALL-E2 for training as one of the datasets. The dataset is built on their previous dataset of 3 million text-image pairs called CC3M and was used for various pre-training and end-to-end training of images.
Click here to check out the 2.5GB dataset.
One of the biggest dataset for multimedia research, YFCC100M consists of 100 million objects with 99.2 million images and 0.8 million videos. The photos have a common creative license and identifying information about each image such as the Flickr identifier, owner name, and several other information of the images since the inception of Flickr in 2004 till 2014.
Click here for more information.
Google’s Language-Image Mixture of Experts (LIMoE) was trained on zero-shot learning with 5.6 billion parameters on ImageNet, which is a database organised according to the hierarchy of WordNet. Currently only including images of nouns, each node of the hierarchy depicts thousands of images.
Click here and visit the website.
A large-scale face image dataset with text-guided image manipulation, for face generation and editing and VQA. The dataset has 30,000 total images with 24,000 for training and 6,000 for testing with ten captions per image, thereby making it a broad dataset.
Click here for the image dataset.
Another large scale, visual-language face dataset with rich fine-grained labels, classifying a single attribute into multiple degrees referring to its semantic meaning. The dataset has nearly 200,000 images with 10,000 identities containing five fine-grained information about each individual image.
Click here to download the dataset.
Used for training and testing a lot of image synthesis models, DeepFashion is a rich multi-modal annotation with fine-grained labels and textual descriptions. The dataset consists of 800,000 diverse images of fashion that make for a large variety of images in different props in different poses.
Click here to visit their website.
Yann LeCun’s proposed dataset with 60,000 training examples and testing set of 10,000 images. The dataset is mostly used for technique and pattern recognition on real-world data. The digits on the dataset are normalised and centred in an image of fixed size.
Visit the website to know more.
This dataset contains 163 car makes and around 1,716 models annotated and labelled with five attributes each that include several information like speed, seats, and displacement.
Click here to access the database.
A larger dataset with 60,000 images of 32×32 resolution divided on the basis of colours into ten separate classes. The dataset is also divided into training batches with one test batch containing 10,000 images.
Click here to see the dataset.
Google’s Open Images
Featuring 9 million URLs, it is one of the largest datasets with millions of images with annotations. The dataset is divided into 6,000 categories, making it a widely used dataset for many prominent image generation models.
Click here to check out the description.
One of the larger datasets based on videos, Youtube-8M contains millions of labelled video IDs with annotations of 3,800 visual entities, excluding movies and TV series for copyright protection.
Check out the research here.