Now Reading
Guide To 6D Object Pose Estimation Using PoseCNN

Guide To 6D Object Pose Estimation Using PoseCNN

Mohit Maithani

PoseCNN(Convolutional Neural Network) is an end to end framework for 6D object pose estimation, It calculates the 3D translation of the object by localizing the mid of the image and predicting its distance from the camera, and the rotation is calculated by relapsing to a quaternion representation. PoseCNN is papered by Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox in collaboration with Nvidia research. They also discussed a novel loss function that can help PoseCNN to handle symmetrical objects from images. They created a custom dataset YCB video dataset, which gives 6D poses of 21 objects in 92 videos with almost 133k frames for producing their results. PoseCNN is able to handle symmetrical objects pretty well and can do certain pose estimation using only a single image as an input.

Network Architecture

The PoseCNN network contains two stages; the first stage is 13 CNN layers and four max-pooling layers, which helps extract feature maps with different input image resolution. The first stage is the primary backbone of the network. 

The second stage is all about the embedding step that uses high feature maps generated by the first stage into low-dimensional features. After that network performs three different tasks and is trained to do specifically three tasks:

  1. Semantic labeling.
  2. 3D translation estimation.
  3. 3D rotation regression.

1. Semantic labeling

Semantic labeling detects objects in images, where on the other hand network classifies each image input pixels into an object class. In comparison with the 6D pose estimation technique that leverages object detection using a bounding box, semantic labeling gives more information about the objects in the image and can handle occlusions better.

It takes two feature maps with dimensions 512 as inputs to the network, as shown in the above figure. The resolution is ⅛ and 1/16 of the original input image, it first reduces dimensions of the two features to 64 using the CNN layer. Then it doubles the resolution of that 1/16 feature map by using another deconvolutional layer. After that, another two feature map and deconvolution layer is used to increase the resolution of input by 8x. Finally, the convolutional layer produces a semantic labeling score for image pixels.

Remember, in training, a softmax cross-entropy is used, and in testing, the softmax function is used to predict image pixels class.

2. 3D translation estimation

3D translation localize the 2D object center in the image to estimate the object distance from the camera 

3. 3D rotation regression

The lower part of the above architecture diagram shows the 3D rotation regression method. In this researchers tried to use the Hough voting layer object detection bounding box to predict two RoI pooling layers to crop and pool the feature of the image by generating the first stage of the network for 3D regression. 

About layers, 3D rotation regression uses pooled feature map by integrating into three fully connected layers. The first two FC layers have dimensions 4096, and the last FC layer have 4 x n (n=number of object classes)

See Also
American Sign Language Classification


The dataset used for this approach is the YCB dataset, it consists of 80 videos for train, and 2949 key features are extracted from the 12 test videos.


It is trained and tested on Ubuntu 16.04 with PyTorch 0.41+ and CUDA 9.1

  1. Install PyTorch
  2. Install Eigen from Github here
  3. Install Sophus from Github here
 git clone
 pip install -r requirement.txt
 git submodule update --init --recursive
 ##Compile the new layers under $ROOT/lib/layers
 cd $ROOT/lib/layers
 sudo python install
 ##Compile cython 
 cd ..
 cd $ROOT/lib/utils
 python build_ext --inplace
 ##compile the ycb_render in $ROOT/ycb_render
 cd ..
 cd $ROOT/ycb_render
 sudo python develop 


  • Download 3D models of YCB Objects from here. And Save it under $ROOT/data.
  • Download pre-trained checkpoints from here and similarly save it under $ROOT/data. 
  • Real-world images with pose annotations for 20 YCB objects can be downloaded from here (53Gb). 

Running the demo

  1. Download 3D models and our pre-trained checkpoints and setup environment.
  2. run the following command

Train and Test on YCB- dataset

First, download the YCB-Video dataset from here and then create a symlink for the YCB-Video dataset using below command:

cd $ROOT/data/YCB_Video
 ln -s $ycb_data data 
Let’s Train and test on the YCB-Video dataset
cd $ROOT
 # multi-gpu training, use 1 GPU or 2 GPUs ./experiments/scripts/
 # testing, $GPU_ID can be 0, 1, etc.
 ./experiments/scripts/ $GPU_ID 


We learned the new method for object pose estimation, PoseCNN decouples the estimation of 3D rotation and translation. It localizes the object center and predicts the center distance of the image. To learn more you can follow given below resources:

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
You can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top