Google Releases 3D Object Detection Dataset: Complete Guide To Objectron (With Implementation In Python)

Google research dataset team just added a new state of art 3-D video dataset for object detection i.e. Objectron, objectron dataset is published 2 months before writing this article, and it was published with a C-UDA(Computational Use of Data Agreement) license, this dataset contains short object-centric video clips capturing objects from different angles, each of which is provided with an AR(augmented reality) session metadata(extra information about data) that includes camera angles, poses, sparse point-cloud, and surface planes.


Object detection history is very long, their evolutions start from late 1998, and with time we saw frameworks like VJ Det(P. Viola et al-01), HOG Det. (n. Dalal et al. 05), AlexNet, RCNN then Fast RCNN, Faster RCNN, Masked RCNN, SSD, YOLO, etc.

object detection milestones

Over the years the number of publications and research in the object detection domain has been increased tremendously as shown in the figure below:

MediaPipe Objectron

Above mentioned object detection frameworks were all based on 2D image, they were all following the 2D object prediction, but we see the world and objects in the 3D so initially, to create new techniques for 3D object detection techniques, Google came up with an amazing idea which was extending prediction to 3D, so that one can capture an object’s size, position, angle and orientation in the world, Which can further lead to a variety of applications in self-driving cars, robotics, and of course AR(augmented reality).

On March 11, 2020, Google announced the MediaPipe Objectron: an open-source platform framework for building machine learning pipelines to process perceptual data. It was able to compute oriented 3D bounding boxes of objects in real-time on mobile devices.

Now there was a deficiency of 3D annotated data and tools so what MediaPipe have done they developed a novel data pipeline using mobile augmented reality(AR) session data, nowadays most of the smartphones have AR capabilities and ability to capture additional information with AR session, including camera angle, pose, sparse 3d point clouds, lightning, and planar surfaces.

They built this tool that can annotate the objects in a very easy manner using AR, which allows quickly annotates 3D bounding boxes for objects, the interface of the tool is shown below:

  • Right: 3D bounding boxes are annotated in the 3D world 
  • Left: Projections of annotated 3D boxes are visualized to make it easy to validate the annotation.


Mediapipe objectron was built on a single-stage model and to predict the pose, angle, size, and orientation of an object the model use the backbone and further network functionality are as follows:

  • The Encode-Decoder architecture, built upon Google MobileNetv2
  • Joint prediction of an object’s shape with detection and regression.
  • Shape task predict an object’s shape signals(optional)
  • Regression task estimates the 2D projections of the eight box vertices.
  • Final 3D coordinates obtained using the pose estimation algorithm (EPnP).

The model was light enough to run real-time on mobile devices at 26 Frames/second(FPS) on an ‘Adreno 650 mobile GPU’.

Download our Mobile App


  • Less accuracy
  • Can only recognize two classes of objects shoes and chair
  • Small architecture

Read More about MediaPipe objectron


Now the predecessor MediPipe mobile objectron was a lighter version for annotating and detecting objects in 3D, It was a single-stage arch model, but the new approach uses an updated model architecture and can recognize 9 object classes: bike, book, bottle, camera, cereal_box, chair, cup, laptop, and shoe.

By releasing this Objectron dataset, we hope to enable the research community to push the limits of 3D object geometry understanding. We also hope to foster new research and applications, such as view synthesis, improved 3D representation, and unsupervised learning.

Some of the features of the Objectron dataset are as follows:

  • The dataset consists of 15000 annotated video clips additionally added with over 4 Million annotated images.
  • It contains objects like a bike, book, bottle, camera, cereal_box, chair, cup, laptop, and shoe.
  • More accurate than the previous version.
  • Most objects in this dataset are household objects.
  • Objectron dataset is Geo diverse, which means the data has been collected from 10 countries across five continents to ensure its geo-diversity.
  • Contains Scripts to load, download, evaluate, and visualize the data into Tensorflow and Pytorch.

Objectron Architecture

It uses two-stage architecture,

  • The first stage Consist of Tensorflow Object detection Models to find the 2D crop of the object. 
  • The Second stage then uses those crop image to estimate the 3D bounding box and simultaneously computes the 2D crop of the next image frame
  • It runs at 83 FPS on the same GPU as the predecessor.
objectron architecture

Objectron dataset license

The C-UDA license allows the data holder to make their data available to anyone for computational purposes, such as artificial intelligence, machine learning, and text and data mining.

Downloading Objectron Dataset

The dataset is stored in the objectron bucket on Google Cloud storage, and include the following attributes:

  • Video sequences (gs://objectron/videos/class/batch-i/j/video.MOV)
  • Annotations labels(gs://objectron/videos/class/batch-i/j/video.MOV)
  • Metadata
  • Processed dataset(tf.records)
  • Index of all available samples.

Public API to access the annotations and videos are available, for example:

For downloading the dataset, we are going to use the gsutil:  a Linux shell command like we have cd(change directory), ls(list file), and cp(copy file). Use the below command to see the list of data files.

  1. Downloading using gsutil
!gsutil ls gs://objectron/v1/records_shuffled
files of objectorn dataset
  1. Download data using Public HTTP API:
import requests
public_url = ""
blob_path = public_url + "/v1/index/cup_annotations_test"
video_ids = requests.get(blob_path).text
video_ids = video_ids.split('\n')
# Download the first ten videos in cup test dataset
for i in range(1):
    video_filename = public_url + "/videos/" + video_ids[i] + "/video.MOV"
    metadata_filename = public_url + "/videos/" + video_ids[i] + "/geometry.pbdata"
    annotation_filename = public_url + "/annotations/" + video_ids[i] + ".pbdata"
    # video.content contains the video file.
    video = requests.get(video_filename)
    metadata = requests.get(metadata_filename)
    annotation = requests.get(annotation_filename)
    file = open("video1.MOV", "wb")

Play video inside your notebook

from IPython.display import HTML
from base64 import b64encode
mp4 = open('/content/video1.MOV','rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
<video width=400 controls>
      <source src="%s" type="video/mp4">
""" % data_url)
visualizing dataset

Visualize objectron Dataset

There are two methods for plotting the 3D annotation dataset, first, we will discuss the Sequence example and then the one with TensorFlow.

See Also

1. Parsing Objectron’s SequenceExamples

SequenceExamples hold the entire video sequence and the corresponding annotation in them they are very useful in training the video models, multi-view models, as well as tracking object in 3D

Clone repo and change directory to objectron folder as we are going to use the files from objectron repo later and also install some dependencies for Objectron

!git clone
%cd Objectron
%cd objectron
!pip install frozendict

Import modules and objectron utilities

objectron_buckett = 'gs://objectron'
# Importing the necessary modules. We will run this notebook locally.
import tensorflow as tf
import glob
from IPython.core.display import display,HTML
import matplotlib.pyplot as plt
import os
import numpy as np
import tensorflow as tf
import cv2
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
from objectron.schema import features
from objectron.dataset import box
from objectron.dataset import graphics

Data pipeline for parsing the sequence examples. In the following example, we grab a few frames from each sequence.

def parse_tfrecord(example):
    context, data =
                            sequence_features = features.SEQUENCE_FEATURE_MAP,
                            context_features = features.SEQUENCE_CONTEXT_MAP
    # Number of frames in the video
    num_examples = context['count']
    # The unique sequence id (class/batch-i/j)
    video_id = context['sequence_id']
    rand = tf.random.uniform([NUM_FRAMES], 0, num_examples, tf.int64)
    data['frame_ids'] = rand
    # Grabbing  four random frames from the sequence and decode them for processing
    for i in range(NUM_FRAMES):
        id = rand[i]
        image_tag = 'image-{}'.format(i)
        data[image_tag] = data[features.FEATURE_NAMES['IMAGE_ENCODED']][id]
        data[image_tag] = tf.image.decode_png(data[image_tag], channels=3)
        data[image_tag].set_shape([640, 480, 3])
    return context, data
shards = + '/v1/sequences/book/book_test*')
dataset =
dataset =


num_rows = 5
for context, data in dataset.take(num_rows):
    fig, ax = plt.subplots(1, NUM_FRAMES, figsize = (12, 16))
    for i in range(NUM_FRAMES):
        num_frames = context['count']
        id = data['frame_ids'][i]
        image = data['image-{}'.format(i)].numpy()
        num_instances = data[features.FEATURE_NAMES['INSTANCE_NUM']][id].numpy()[0]
        keypoints = data[features.FEATURE_NAMES['POINT_2D']].values.numpy().reshape(num_frames, num_instances, NUM_KEYPOINTS, 3)
        for instance_id in range(num_instances):
            image = graphics.draw_annotation_on_image(image, keypoints[id, instance_id, :, :], [9])

2. Grabbing samples from shared Tensorflow (tf.records)

We are going to use the, For more information checkout TFRecord and tf.train.Example.

objectron_buckett = 'gs://objectron/v1/records_shuffled'
WIDTH = 480
HEIGHT = 640
# The 3D bounding box has 9 vertices, 0: is the center, and the 8 vertices of the 3D box.

def parse(example):
  """Parses a single tf.Example and decode the `png` string to an array."""
  data =, features = features.FEATURE_MAP)
  data['image'] = tf.image.decode_png(data[features.FEATURE_NAMES['IMAGE_ENCODED']], channels=NUM_CHANNELS)
  data['image'].set_shape([HEIGHT, WIDTH, NUM_CHANNELS])
  return data 
def augment(data):
  return data
def normalize(data):
  """Convert `image` from [0, 255] -> [-1., 1.] floats."""
  data['image'] = tf.cast(data['image'], tf.float32) * (2. / 255.) - 1.0  
  return data
def load_tf_record(input_record):
  dataset =
  dataset =, num_parallel_calls = NUM_PARALLEL_CALLS)\
                   .map(augment, num_parallel_calls = NUM_PARALLEL_CALLS)\
                   .map(normalize, num_parallel_calls = NUM_PARALLEL_CALLS)
  # Our TF.records are shuffled in advance. If you re-generate the dataset from the video files, you'll need to
  # shuffle your examples. Keep in mind that you cannot shuffle the entire datasets using dataset.shuffle, since 
  # it will be very slow.
  dataset = dataset.shuffle(100)\
  return dataset
training_shards   = + '/chair/chair_train*')
dataset = load_tf_record(training_shards)

Let’s grab a few rows(7) from the dataset and visualize their 3D bounding boxes. The Objectron features are defined in /schema/ In this example, we only used the 2D keypoints but each sample contains a lot more information, such as 3D keypoints, the object name, pose information, etc.

The below code uses dataset/ utility) for visualizing the 3D bounding box on the image.

num_rows = 7
for data in dataset.take(num_rows):
  fig, ax = plt.subplots(1, BATCH_SIZE, figsize = (12, 16))
  number_objects_batch = data[features.FEATURE_NAMES['INSTANCE_NUM']]
  num_obj_cumsum = np.sum(number_objects_batch)
  image_width = data[features.FEATURE_NAMES['IMAGE_WIDTH']]
  image_height = data[features.FEATURE_NAMES['IMAGE_HEIGHT']]
  keypoints = data[features.FEATURE_NAMES['POINT_2D']].values.numpy().reshape(np.sum(number_objects_batch), NUM_KEYPOINTS, 3)
  # The object annotation is a list of 3x1 keypoints for all the annotated
  # objects. The objects can have a varying number of keypoints. First we split
  #list according to the number of keypoints for each object. This
  # also leaves an empty array at the end of the list.
  batch_keypoints = np.split(keypoints, np.array(np.cumsum(number_objects_batch)))
  # Visualize the first image/keypoint pair in the batch
  for id in range(BATCH_SIZE):
    w = image_width.numpy()[id][0]
    h = image_height.numpy()[id][0]
    # DeNormalize the image (for visualization purpose only)
    image = tf.cast((data['image'] + 1.0) / 2.0 * 255, tf.uint8).numpy()[id]
    num_instances = number_objects_batch[id].numpy()[0]
    keypoints_per_sample = batch_keypoints[id]
    for instance_id in range(num_instances):
      image = graphics.draw_annotation_on_image(image, keypoints_per_sample[instance_id, :, :], [9])


Object detection is a crucial step for Universal object recognition APIs, and as the techniques in the field of computer vision are becoming more and more mature, there are many new use-cases opportunities opened for researchers and businesses.

We have almost covered everything from history to evolution to how google objectron beats its predecessor and what are the improvements they came up with and we have also seen the coding implementation of Detectron dataset like how to download the dataset using 2 different approaches and how to visualize the dataset using TensorFlow and SequenceExamples, For more information, Go to the following links:

What Do You Think?

Join Our Telegram Group. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top