Over the years, Computer Vision has dramatically evolved, from just image classification to image segmentation and localization, providing bounding boxes around the detected objects with proper annotation and labels. Now it’s taken way forward to visual to speech and text conversions. These advances have inspired many deep learning models to be built for predictions.
Developed by Google in collaboration with CMU and Cornell Universities, Open Images Dataset has set a benchmark for visual recognition. Open Images contains nearly 9 million images with annotations and bounding boxes, image segmentation, relationships among objects and localized narratives. The dataset contains over 600 categories. From version 6 annotations are provided by Google Cloud Vision API. Humans have manually verified these automated labels, and then developers have tried to remove the false positives. Before this, annotations had been provided by professionals for consistency and accuracy. Many of these images contain complex visual scenes which include multiple labels.
As per version 4, Tensorflow API training dataset contains 1.7M images out of which 14.6M bounding boxes in images for 600 different classes. Validation set contains 41,620 images, and the test set includes 125,436 images.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
There are six versions of Open Images until now.
V1- Released in 2016, Pretrained Inception V2 model trained on the dataset and released. Annotations were generated using Google’s BigQuery. Later inception v3 model was trained and fine-tuned on applications such as DeepDream.
V2 – Released in 2017, ResNet 101 image classification model was generated. Updated 2M bounding boxes images on 600 object classes and 4.3M images that were manually-verified labels on the training set. Common Visual Data Foundation(CVDP) provided a data visualizer on this data.
V3 – Released in 2017, inception ResNet v2 object detection model was trained on it and made a part of Tensorflow Object Detection API. They were updated with 3.7M bounding-boxes on images and 9.7M positive image-level labels on the training set. The dataset could be downloaded from Common Visual Data Foundation(CVDP).
V4- Released in 2018, Google AI had held a competition for automatic object detection and visual relationship tracks.
Download size- 565.11 GiB.
Code Snippet(With TensorFlowAPI)
import tensorflow_datasets.object_detection as tfds train,test = tfds.load('openimagesv4', split=['train', 'test'])
There are three variants- Original pixels and quality
200,000 pixels, at 72 JPEG quality and
300,000 pixels, at 72 JPEG quality.
V5 – Released in 2019, 15.8M bounding boxes and 391k visual relationships. This version introduced the image segmentation masks in 2.7M images over 350 categories. With this version also, an Open Images Challenge for object detection was held.
Download Size: 535.63 GiB
Code Snippet(With TensorFlowAPI)
import tensorflow_datasets.object_detection as tfds train,test = tfds.load('openimageschallenge2019detection', split=['train', 'test'])
There are 2 variants- 200,000 pixels, at 72 JPEG quality and
300,000 pixels, at 72 JPEG quality.
V6 – Released in 2020, this version introduced the localization narrative to 500K images. Along with these 123K images from COCO dataset were also provided localization narratives. Updated 23.5M new manually-verified labels, that makes a total of 59.9M images in 20,000 categories.
Google has made an official website for open images visualizer, download, documentation, challenges, news and other related information.
You can visit it from this link.
Object Detection
Image Segmentation
Visual Relationship
Open Images V6 has increased the types of visual relationship annotations by up to 1.4k, adding for example “dog catching a flying disk”, “man riding a skateboard” and “man and woman holding hands”.
Localized Narrative
Advances in applied machine learning and deep learning have enhanced Computer Vision, allowing systems to automatically caption images forward to applications that can create natural language replies in response to shared photos and form an interactive chatbot. This massive achievement could visually impair people.
Developed in February 2020, Localized narratives provide high-level analysis and understanding of visuals and provide leverage between vision and language. This is done with the help of image captioning. With text, it provides a description of the scene by narration and mouse hovering over parts being read.
Source: Google AI blog
Conclusion
Many research papers have been published on works taking place in and around Open Images dataset. This has inspired works like DeepDream and artistic style transfer. Active contribution and development is going on to open Images, and we can expect more amazing work in future to help grow the computer vision community.