Now Reading
How To Detect Objects With Detection Transformers?

How To Detect Objects With Detection Transformers?

Pretraining is an ingenious concept that has enabled us to improve model performance on different tasks. We train a model on a pretext task and use the same weights as initialization weights for the actual task. This results in the transfer of knowledge from pretext task and data to the new task. It is especially beneficial to pretrain models with an unsupervised/self-supervised learning task as these kinds of tasks do not need labelled datasets. Gathering a good quality labelled dataset for most of the problems is a cumbersome task. In this article, let’s explore the Unsupervised Pre-Training of DETR( Detection Transformers) model( UPDETR ).


DETR(Detection Transformer) is an end to end object detection model that does object classification and localization i.e boundary box detection. It is a simple encoder-decoderTransformer with a novel loss function that allows us to formulate the complex object detection problem as a set prediction problem. It is very simple compared to other Transformer models used for vision, 50 lines of code are all you need to run this model.

Register for Data & Analytics Conclave>>

Architecture of DETR

This model performs on par with complex hand-designed models like Fast RCNN. Following is a High level overview of the DETR model’s architecture.

DETR Architecture

The backbone structure in the above model is a CNN that extracts useful feature maps from images.The CNN outputs a large number of low-resolution feature maps. Number of feature maps is brought down using a 1X1 convolutional layer and positional encodings are added to incorporate spatial information into the activations.

Now a permutation invariant transformer encoder takes in these sums of  feature maps and positional encodings as input and generates representations for each pixel in the flattened feature maps.

A transformer decoder takes learned positional embeddings called object queries as inputs and generates a vector for each object query. These object queries can be thought of as inputs asking the decoder to pay attention to specific regions.

Visualization of Object Queries

These are the visualizations of bounded boxes predicted for each object query.We can see that each of these queries concentrate on different regions of the image.

These inputs are passed through the following transformer decoder which uses the encoder’s output to generate outputs for each query.


The decoder’s results are fed into a Feed Forward network to predict classes and bound boxes for each object query in a sequential fashion. To represent that no class is present a class label Ø is used


The predicted results are a set of N predictions. Comparing this N predictions to the ground truth set is non trivial. To assess the loss an optimal Bipartite matching is created between predicted labels and ground truth. This is done by minimizing matching loss between a tuple’s elements by picking one element from predicted labels and another element from ground truth.

LMatch is given by 

Once we have this bipartite mapping we can calculate a loss called Hungarian Loss. 

Lbox is given by

Liou is the intersection over union of the bounding boxes.

Unsupervised Pre-Training of DETR

The DETR Model, despite being very simple, is not easily approachable because of its huge size. It takes a week to train this model on ImageNet Data on a State of the Art system with 8 V100 GPUs. Without some degree of pretraining it is almost impossible for most people to use this model. A novel unsupervised pretext task was introduced by Zhigang Dai, Bolun Cai, Yugeng Lin, Junying Chen, researchers at South China University of Technology and Tencent Wechat AI Team in a paper published on 18th November 2020.

Patches are randomly cropped from the input image and the model is used to reconstruct these patches.

Single Query Patch

Patches are passed through a frozen CNN, and the outputs are summed with object queries.The result is used as input to the DETR decoder. Outputs of the decoder are matched with the original cropped patches and the following loss is minimized.

There are only two classes class 0 when the predicted bounding box is not matching with the cropped patch else 1. Lrec part is a reconstruction loss between the patch generated by the model and the original patch. It is given by

To provide the model with ability to localize more than one object at a time. Multiple cropped patches are used to pretrain at a time. Following is the approach to use multiple patches at once

Multiple Query Patches

Compared to a single query patch this is a little bit complex. Because we have multiple patches we need a scheme to select the object queries to use for each patch. This is done by  dividing N object queries into groups M equal groups and assigning each group to an image patch.

Two new problems arise because of this. Let’s see what these problems are and how to solve them.

Independence of Query Patches

The cropped patches are randomly selected and are independent of each other. This independence must be preserved throughout the decoder .i.e Object Queries assigned to one patch must not interact with object queries from other patches. This can be enforced using an attention mask. This attention mask is added to the similarity of Q,K while calculating attention. Value of the mask is -infinity when Q and K belong to different image patches and 0 when they belong to the same image patch.

See Also

Diversity of Object Queries

If we assign the first N/M object queries to an image patch, all of them will learn the same parameters. To keep the object queries from converging to the same value, we shuffle the object queries’ order to see a different set of image patches and learn different positional embeddings.

Performance Boost

Unsupervised Pre Training of DETR improves the speed of fine-tuning as well as the precision of DETR. According to the paper, UpDETR makes the model converge much faster and improves the Average Precision by a large margin.



Inference from the UPDETR model is the same as an inference from DETR. Let’s see how to do inference using a model pre trained on ImageNet and fine-tuned on COCO Dataset

First, we need to get the code from this github repository.

 !git clone
 !mv ./up-detr/* ./ 

Now we need to download the fine tuned model from here. The following boilerplate code will help you load the data into google colab.

 !pip install -U -q PyDrive
 from pydrive.auth import GoogleAuth
 from import GoogleDrive
 from google.colab import auth
 from oauth2client.client import GoogleCredentials
 # Authenticate and create the PyDrive client.
 # This only needs to be done once per notebook.
 gauth = GoogleAuth()
 gauth.credentials = GoogleCredentials.get_application_default()
 drive = GoogleDrive(gauth)
 file_id = '1_YNtzKKaQbgFfd6m2ZUCO6LWpKqd7o7X' # URL id for the UP-DETR finetuned model. 
 downloaded = drive.CreateFile({'id': file_id})

Following code will build the DETR model and load pretrained weights

 import torch
 from models.backbone import Backbone, Joiner
 from models.detr import DETR, PostProcess
 from models.position_encoding import PositionEmbeddingSine
 from models.segmentation import DETRsegm, PostProcessPanoptic
 from models.transformer import Transformer
 import torchvision.transforms as T
 def build_detr(num_classes=91):
     hidden_dim = 256
     backbone = Backbone("resnet50", train_backbone=False, return_interm_layers=False, dilation=False)
     pos_enc = PositionEmbeddingSine(hidden_dim // 2, normalize=True)
     backbone_with_pos_enc = Joiner(backbone, pos_enc)
     backbone_with_pos_enc.num_channels = backbone.num_channels
     transformer = Transformer(d_model=hidden_dim, normalize_before=True, return_intermediate_dec=True)
     model = DETR(
     return model
 model = build_detr()
 checkpoint = torch.load('up-detr-coco-fine-tuned-300ep.pth',map_location='cpu')['model']
 msg = model.load_state_dict(checkpoint,strict=False)

On successful restoration of model from the  checkpoint the following message will be printed

<All keys matched successfully> is printed>

Now we should get a sample image and prepare it for a model.

 from PIL import Image
 import random
 import requests
 url = ''
 image =, stream=True).raw) 
Input Image
 image_transform = T.Compose([
     T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
 probas = outputs['pred_logits'].softmax(-1)[0, :, :-1]
 # print(probas)
 keep = probas.max(-1).values > 0.9
 # convert boxes from [0; 1] to image scales
 bboxes_scaled = rescale_bboxes(outputs['pred_boxes'][0, keep], image.size)
 plot_results(image, probas[keep], bboxes_scaled) 
Detected Objects

We would have wanted the two rightmost cats to be detected as two different cats but the overlap is too much for the model to distinguish.


What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top