Computer vision is one of the fastest-growing subfields in the AI space right now. Researchers and companies are using computer vision mechanisms to solve a variety of problems in different areas like manufacturing, security, medical imaging analysis and diagnostics, autonomous driving and many more.
Here we look at the top open-source datasets available for computer vision projects:
ImageNet
It is an image dataset organised according to the WordNet hierarchy. There are more than 100,000 synsets in WordNet, out of which the majority (over 80,000) are nouns. ImageNet aims to provide on average 1000 images to illustrate each synset. Two important needs inspired it in computer vision research. These are the need to establish a clear North Star problem in computer vision and a critical need for more data to enable more generalisable machine learning methods.
IMDB-Wiki
It is one of the largest open-source datasets of face images with gender and age labels for training. This dataset has 523,051 face images, with 460,723 face images obtained from 20,284 celebrities from IMDB and 62,328 from Wikipedia.
MS Coco
It is large-scale object detection, segmentation, and captioning dataset. It has 330K images (>200K labelled), 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image and 250,000 people with key points.
Flickr-30k
It is a collection for sentence-based image description and search and consists of 30,000 images paired with five different captions that provide clear descriptions of the salient entities and events. The images are chosen from six different Flickr groups and usually do not contain any well-known people or locations but were manually selected.
Berkeley DeepDrive
It is a driving dataset for heterogeneous multitask learning. It has 100k driving videos collected from more than 50k rides. Each video is 40-second long with 30fps. It contains diverse scene types such as city streets, residential areas, and highways across different weather conditions at different times of the day. It can be helpful for lane detection, object detection, semantic segmentation, instance segmentation, multi-object tracking, etc.
LSUN
The Large-scale Scene Understanding (LSUN) classification dataset contains 10 scene categories: bedroom, kitchen, outdoor church, dining room, etc. Each category has a large number of images, from around 120,000 to 3,000,000.
The validation data includes 300 images, and the test data has 1000 images for each category.
MPII Human Pose
The dataset includes around 25K images containing over 40K people with annotated body joints. They are collected by using an established taxonomy of everyday human activities. In total, the dataset covers 410 human activities, and each image is provided with an activity label. Each of these images is extracted from a YouTube video and come with preceding and following un-annotated frames.
CIFAR-10 & CIFAR-100
The CIFAR-10 dataset consists of 60,000 32×32 colour images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. The CIFAR-100 is similar to the CIFAR-10 but has 100 classes containing 600 images each.
The CIFAR-10 dataset is divided into five training batches and one test batch, each with 10,000 images. The test batch contains exactly 1,000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5,000 images from each class.
In the CIFAR-100, the 100 classes are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs).
Kinetics-700
It is a collection of large-scale, high-quality datasets of URL links of up to 650,000 video clips that cover 400/600/700 action classes depending on the dataset version. The video includes human-object interactions like playing instruments and human-human interactions as well. Each clip is human annotated with a single action class and lasts for around 10 seconds.