Popular Open Source Datasets You Need For Computer Vision Projects

Here are a few open source datasets you can use for your computer vision projects

Computer vision is one of the fastest-growing subfields in the AI space right now. Researchers and companies are using computer vision mechanisms to solve a variety of problems in different areas like manufacturing, security, medical imaging analysis and diagnostics, autonomous driving and many more. 

Here we look at the top open-source datasets available for computer vision projects: 


It is an image dataset organised according to the WordNet hierarchy. There are more than 100,000 synsets in WordNet, out of which the majority (over 80,000) are nouns. ImageNet aims to provide on average 1000 images to illustrate each synset. Two important needs inspired it in computer vision research. These are the need to establish a clear North Star problem in computer vision and a critical need for more data to enable more generalisable machine learning methods. 


It is one of the largest open-source datasets of face images with gender and age labels for training. This dataset has 523,051 face images, with 460,723 face images obtained from 20,284 celebrities from IMDB and 62,328 from Wikipedia.

MS Coco 

It is large-scale object detection, segmentation, and captioning dataset. It has 330K images (>200K labelled), 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image and 250,000 people with key points.


It is a collection for sentence-based image description and search and consists of 30,000 images paired with five different captions that provide clear descriptions of the salient entities and events. The images are chosen from six different Flickr groups and usually do not contain any well-known people or locations but were manually selected. 

Berkeley DeepDrive

It is a driving dataset for heterogeneous multitask learning. It has 100k driving videos collected from more than 50k rides. Each video is 40-second long with 30fps. It contains diverse scene types such as city streets, residential areas, and highways across different weather conditions at different times of the day. It can be helpful for lane detection, object detection, semantic segmentation, instance segmentation, multi-object tracking, etc.


The Large-scale Scene Understanding (LSUN) classification dataset contains 10 scene categories: bedroom, kitchen, outdoor church, dining room, etc. Each category has a large number of images, from around 120,000 to 3,000,000. 

The validation data includes 300 images, and the test data has 1000 images for each category.

MPII Human Pose

The dataset includes around 25K images containing over 40K people with annotated body joints. They are collected by using an established taxonomy of everyday human activities. In total, the dataset covers 410 human activities, and each image is provided with an activity label. Each of these images is extracted from a YouTube video and come with preceding and following un-annotated frames.

CIFAR-10 & CIFAR-100

The CIFAR-10 dataset consists of 60,000 32×32 colour images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. The CIFAR-100 is similar to the CIFAR-10 but has 100 classes containing 600 images each.

The CIFAR-10 dataset is divided into five training batches and one test batch, each with 10,000 images. The test batch contains exactly 1,000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5,000 images from each class.

In the CIFAR-100, the 100 classes are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs).


It is a collection of large-scale, high-quality datasets of URL links of up to 650,000 video clips that cover 400/600/700 action classes depending on the dataset version. The video includes human-object interactions like playing instruments and human-human interactions as well. Each clip is human annotated with a single action class and lasts for around 10 seconds.

Download our Mobile App

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox