Listen to this story
It is important for AI to perceive how objects move, gain physical understanding of rotation and shape change and particularly so in the context of surveillance, self-driving vehicles, and more. Evaluating this further, researchers from Google-owned DeepMind in their paper, ‘TAP-Vid: A Benchmark for Tracking Any Point in a Video’, have introduced a new benchmark called ‘TAP-Vid’ to track points on physical surfaces on videos.
The model is composed of both real-world videos with human annotations of point tracks, along with perfect, ground-truth point tracks of synthetic videos. The team have proposed a simple end-to-end point tracking model, TAP-Net, outperforming all prior methods which were trained on synthetic data.
Click here to view code and data.
Sign up for your weekly dose of what's up in emerging technology.
The researchers have introduced the problem of Tracking Any Point (TAP) in a given video, along with TAP-Vid dataset to bring in progress in the under-studied domain.
Download our Mobile App
However, TAP still has limitations. The paper reads, “We cannot handle liquids or transparent objects, and for real data, annotators cannot be perfect, as they are limited to textured points and even then may make occasional errors due to carelessness.”
The team believes that the ethical concerns of the dataset are minimal. However, the real data comes from existing public sources which means that biases must be treated with care to ensure fairness of the final algorithm. The advancements in TAP will potentially bring solutions to many interesting challenges, such as better handling of dynamic or deformable objects in SFM  and allowing the semantic keypoint-based methods to be applied to generic objects.
Another interesting benchmark for tracking any object in the virtual world includes TAO, a large-scale benchmark for tracking any object, developed by researchers from Carnegie Mellon University, Inria and Argo AI. They introduced a diverse dataset, similar to COCO, which consists of 2,907 high resolution videos, captured in diverse environments, and which are 30 seconds long on average. Besides TAO and COCO, other benchmarks include DAVIS, GOT-10K, YouTube BB, ScanNet, and others.
However, in the case of DeepMind’s latest benchmark, the researchers have introduced the problem of tracking any point (TAP), alongside the TAP-Vid dataset, which has set a new standard in this under-studied domain. “By training on synthetic data, TAP-Net performs better on our benchmark than prior methods,” said the researchers.