Meta AI proposes a new approach to improve object detection

Detic, like ViLD, uses CLIP embeddings as the classifier.
Meta AI proposes a new approach to improve object detection

Researchers from Meta AI and the University of Texas at Austin have proposed Detic, which simply trains the classifiers of a detector on image classification data and thus expands the vocabulary of detectors significantly. Detic does not assign image labels to boxes based on model predictions, and is easier to implement and works well with a range of detection architectures and backbones.

What makes Detic different? 

Object detection consists of two subproblems: locating the object (localisation), and annotating (classification). Usually, detection datasets are substantially smaller in scale and vocabulary (object classes) than image classification datasets. For example, the newest LVIS detection dataset has 120,000 images for approximately 1,000 classes. 

The researchers suggested a simple classification loss which utilises image-level supervision to the proposal with the most spatial size–without supervising any other outputs for image-annotated data.  

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Weakly supervised object detection 

Weakly supervised object detection (WSOD) is gathering steam due to the inconvenience of collecting a large amount of data with accurate object-level annotations in fully supervised object detection methods.

Existing weakly-supervised detection techniques train object detectors using just image tag supervisions. They often use poorly labeled data to oversee both the localisation and the classification of object detection, without any box supervision. 

Download our Mobile App

Since image classification data doesn’t make use of box labels, WSOD techniques have to construct different label-to-box designation methods to obtain boxes. However, these designations require solid initial detections. This leads to a chicken-and-egg problem: good label assignment requires a good detector, but a good detector requires many boxes to be trained.

On the other hand, semi-supervised WSOD uses bounding box supervisions in conjunction with image labels. For instance, YOLO9000 combines detection data and classification data within a smaller batch, and designates classification labels to anchors with the strongest predicted scores. 

Detic is a semi-supervised WSOD. It avoids the process of annotation-assignment by supervising the classification subproblem independently while using classification data. The technique learns detectors for new classes, which was previously impossible to predict and assign.

Open vocabulary object detection 

Open vocabulary object detection aims to detect objects that aren’t part of the training vocabulary. The standard  method is to replace the final classification layer with language embeddings of the class names. 

Earlier, classifier embeddings have been improved by introducing further text material or using contrastive learning to pre-train the detector on image-text pairs. ViLD has achieved this by upgrading the language embedding to CLIP, and then separating region features from CLIP image features. Detic, like ViLD, uses CLIP embeddings as the classifier. However, instead of using distillation, it integrates additional image-annotated data for co-training. 

Large vocabulary object detection

This requires identification of 1,000+ classes. Detic’s method builds on previous attempts at handling the long-tail problem including repeat factor sampling (which oversamples classes with fewer annotations), as well as Equalization losses and SeeSaw losses (which re-weight per-class loss by evening out the gradients or number of samples). Meanwhile, Detic takes on the problem by including additional image-labeled data. 

Language supervision for object detection 

Detic uses language data in a manner similar to Cap2Det, which first learns a mapping from sentences to image annotations in the detector’s object classes, and then employs WSOD. Unlike Cap2Det, Detic can extract image labels from captions by using a simple text-match. 


Detic is an easy method of applying image supervision in object detection with a significant vocabulary. However, one major limitation is it does not consider overall data statistics and supervises all image labels to the same region. Furthermore, open vocabulary generalisation may not work in extreme domains.  

Nevertheless, Detic has successfully expanded large-vocabulary detection with an assortment of weak data sources, classifiers, detector architectures, and training methods. Furthermore, its generalisation capabilities are supplemented by the large-scale pretraining that CLIP is put through. The researchers hope Detic will make object detection easiest to deploy and advance research in open-vocabulary detection. 

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Srishti Mukherjee
Drowned in reading sci-fi, fantasy, and classics in equal measure; Srishti carries her bond with literature head-on into the world of science and tech, learning and writing about the fascinating possibilities in the fields of artificial intelligence and machine learning. Making hyperrealistic paintings of her dog Pickle and going through succession memes are her ideas of fun.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: Evolution of Data Science: Skillset, Toolset, and Mindset

In my opinion, there will be considerable disorder and disarray in the near future concerning the emerging fields of data and analytics. The proliferation of platforms such as ChatGPT or Bard has generated a lot of buzz. While some users are enthusiastic about the potential benefits of generative AI and its extensive use in business and daily life, others have raised concerns regarding the accuracy, ethics, and related issues.