Guide To 6D Object Pose Estimation Using PoseCNN


PoseCNN(Convolutional Neural Network) is an end to end framework for 6D object pose estimation, It calculates the 3D translation of the object by localizing the mid of the image and predicting its distance from the camera, and the rotation is calculated by relapsing to a quaternion representation. PoseCNN is papered by Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox in collaboration with Nvidia research. They also discussed a novel loss function that can help PoseCNN to handle symmetrical objects from images. They created a custom dataset YCB video dataset, which gives 6D poses of 21 objects in 92 videos with almost 133k frames for producing their results. PoseCNN is able to handle symmetrical objects pretty well and can do certain pose estimation using only a single image as an input.

Network Architecture

The PoseCNN network contains two stages; the first stage is 13 CNN layers and four max-pooling layers, which helps extract feature maps with different input image resolution. The first stage is the primary backbone of the network. 

The second stage is all about the embedding step that uses high feature maps generated by the first stage into low-dimensional features. After that network performs three different tasks and is trained to do specifically three tasks:

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
  1. Semantic labeling.
  2. 3D translation estimation.
  3. 3D rotation regression.

1. Semantic labeling

Semantic labeling detects objects in images, where on the other hand network classifies each image input pixels into an object class. In comparison with the 6D pose estimation technique that leverages object detection using a bounding box, semantic labeling gives more information about the objects in the image and can handle occlusions better.

It takes two feature maps with dimensions 512 as inputs to the network, as shown in the above figure. The resolution is ⅛ and 1/16 of the original input image, it first reduces dimensions of the two features to 64 using the CNN layer. Then it doubles the resolution of that 1/16 feature map by using another deconvolutional layer. After that, another two feature map and deconvolution layer is used to increase the resolution of input by 8x. Finally, the convolutional layer produces a semantic labeling score for image pixels.

Remember, in training, a softmax cross-entropy is used, and in testing, the softmax function is used to predict image pixels class.

2. 3D translation estimation

3D translation localize the 2D object center in the image to estimate the object distance from the camera 

3. 3D rotation regression

The lower part of the above architecture diagram shows the 3D rotation regression method. In this researchers tried to use the Hough voting layer object detection bounding box to predict two RoI pooling layers to crop and pool the feature of the image by generating the first stage of the network for 3D regression. 

About layers, 3D rotation regression uses pooled feature map by integrating into three fully connected layers. The first two FC layers have dimensions 4096, and the last FC layer have 4 x n (n=number of object classes)


The dataset used for this approach is the YCB dataset, it consists of 80 videos for train, and 2949 key features are extracted from the 12 test videos.


It is trained and tested on Ubuntu 16.04 with PyTorch 0.41+ and CUDA 9.1

  1. Install PyTorch
  2. Install Eigen from Github here
  3. Install Sophus from Github here
 git clone
 pip install -r requirement.txt
 git submodule update --init --recursive
 ##Compile the new layers under $ROOT/lib/layers
 cd $ROOT/lib/layers
 sudo python install
 ##Compile cython 
 cd ..
 cd $ROOT/lib/utils
 python build_ext --inplace
 ##compile the ycb_render in $ROOT/ycb_render
 cd ..
 cd $ROOT/ycb_render
 sudo python develop 


  • Download 3D models of YCB Objects from here. And Save it under $ROOT/data.
  • Download pre-trained checkpoints from here and similarly save it under $ROOT/data. 
  • Real-world images with pose annotations for 20 YCB objects can be downloaded from here (53Gb). 

Running the demo

  1. Download 3D models and our pre-trained checkpoints and setup environment.
  2. run the following command

Train and Test on YCB- dataset

First, download the YCB-Video dataset from here and then create a symlink for the YCB-Video dataset using below command:

cd $ROOT/data/YCB_Video
 ln -s $ycb_data data 
Let’s Train and test on the YCB-Video dataset
cd $ROOT
 # multi-gpu training, use 1 GPU or 2 GPUs ./experiments/scripts/
 # testing, $GPU_ID can be 0, 1, etc.
 ./experiments/scripts/ $GPU_ID 


We learned the new method for object pose estimation, PoseCNN decouples the estimation of 3D rotation and translation. It localizes the object center and predicts the center distance of the image. To learn more you can follow given below resources:

Mohit Maithani
Mohit is a Data & Technology Enthusiast with good exposure to solving real-world problems in various avenues of IT and Deep learning domain. He believes in solving human's daily problems with the help of technology.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox