MITB Banner

Google’s DeepMind Sets A New Benchmark in Object Detection

The model is composed of both real-world videos with human annotations of point tracks, along with perfect ground-truth point tracks of synthetic videos.
Listen to this story

It is important for AI to perceive how objects move, gain physical understanding of rotation and shape change and particularly so in the context of  surveillance, self-driving vehicles, and more. Evaluating this further, researchers from Google-owned DeepMind in their paper, ‘TAP-Vid: A Benchmark for Tracking Any Point in a Video’, have introduced a new benchmark called ‘TAP-Vid’ to track points on physical surfaces on videos. 

The model is composed of both real-world videos with human annotations of point tracks, along with perfect, ground-truth point tracks of  synthetic videos. The team have proposed a simple end-to-end point tracking model, TAP-Net, outperforming all prior methods which were trained on synthetic data. 

Click here to view code and data. 

Source: DeepMind

The researchers have introduced the problem of Tracking Any Point (TAP) in a given video, along with TAP-Vid dataset to bring in progress in the under-studied domain. 

However, TAP still has limitations. The paper reads, “We cannot handle liquids or transparent objects, and for real data, annotators cannot be perfect, as they are limited to textured points and even then may make occasional errors due to carelessness.”

The team believes that the ethical concerns of the dataset are minimal. However, the real data comes from existing public sources which means that biases must be treated with care to ensure fairness of the final algorithm. The advancements in TAP will potentially bring solutions to many interesting challenges, such as better handling of dynamic or deformable objects in SFM [60] and allowing the semantic keypoint-based methods to be applied to generic objects. 

Other methods

Another interesting benchmark for tracking any object in the virtual world includes TAO, a large-scale benchmark for tracking any object, developed by researchers from Carnegie Mellon University,  Inria and Argo AI. They introduced a diverse dataset, similar to COCO, which consists of 2,907 high resolution videos, captured in diverse environments, and which are 30 seconds long on average. Besides TAO and COCO, other benchmarks include DAVIS, GOT-10K, YouTube BB, ScanNet, and others. 

However, in the case of DeepMind’s latest benchmark, the researchers have introduced the problem of tracking any point (TAP), alongside the TAP-Vid dataset, which has set a new standard in this under-studied domain. “By training on synthetic data, TAP-Net performs better on our benchmark than prior methods,” said the researchers. 

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Bhuvana Kamath

Bhuvana Kamath

I am fascinated by technology and AI’s implementation in today’s dynamic world. Being a technophile, I am keen on exploring the ever-evolving trends around applied science and innovation.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories