PoseCNN(Convolutional Neural Network) is an end to end framework for 6D object pose estimation, It calculates the 3D translation of the object by localizing the mid of the image and predicting its distance from the camera, and the rotation is calculated by relapsing to a quaternion representation. PoseCNN is papered by Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox in collaboration with Nvidia research. They also discussed a novel loss function that can help PoseCNN to handle symmetrical objects from images. They created a custom dataset YCB video dataset, which gives 6D poses of 21 objects in 92 videos with almost 133k frames for producing their results. PoseCNN is able to handle symmetrical objects pretty well and can do certain pose estimation using only a single image as an input.
The PoseCNN network contains two stages; the first stage is 13 CNN layers and four max-pooling layers, which helps extract feature maps with different input image resolution. The first stage is the primary backbone of the network.
The second stage is all about the embedding step that uses high feature maps generated by the first stage into low-dimensional features. After that network performs three different tasks and is trained to do specifically three tasks:
- Semantic labeling.
- 3D translation estimation.
- 3D rotation regression.
1. Semantic labeling
Semantic labeling detects objects in images, where on the other hand network classifies each image input pixels into an object class. In comparison with the 6D pose estimation technique that leverages object detection using a bounding box, semantic labeling gives more information about the objects in the image and can handle occlusions better.
It takes two feature maps with dimensions 512 as inputs to the network, as shown in the above figure. The resolution is ⅛ and 1/16 of the original input image, it first reduces dimensions of the two features to 64 using the CNN layer. Then it doubles the resolution of that 1/16 feature map by using another deconvolutional layer. After that, another two feature map and deconvolution layer is used to increase the resolution of input by 8x. Finally, the convolutional layer produces a semantic labeling score for image pixels.
Remember, in training, a softmax cross-entropy is used, and in testing, the softmax function is used to predict image pixels class.
2. 3D translation estimation
3D translation localize the 2D object center in the image to estimate the object distance from the camera
3. 3D rotation regression
The lower part of the above architecture diagram shows the 3D rotation regression method. In this researchers tried to use the Hough voting layer object detection bounding box to predict two RoI pooling layers to crop and pool the feature of the image by generating the first stage of the network for 3D regression.
About layers, 3D rotation regression uses pooled feature map by integrating into three fully connected layers. The first two FC layers have dimensions 4096, and the last FC layer have 4 x n (n=number of object classes)
The dataset used for this approach is the YCB dataset, it consists of 80 videos for train, and 2949 key features are extracted from the 12 test videos.
It is trained and tested on Ubuntu 16.04 with PyTorch 0.41+ and CUDA 9.1
git clone https://github.com/NVlabs/PoseCNN-PyTorch.git pip install -r requirement.txt git submodule update --init --recursive ##Compile the new layers under $ROOT/lib/layers cd $ROOT/lib/layers sudo python setup.py install ##Compile cython cd .. cd $ROOT/lib/utils python setup.py build_ext --inplace ##compile the ycb_render in $ROOT/ycb_render cd .. cd $ROOT/ycb_render sudo python setup.py develop
- Download 3D models of YCB Objects from here. And Save it under $ROOT/data.
- Download pre-trained checkpoints from here and similarly save it under $ROOT/data.
- Real-world images with pose annotations for 20 YCB objects can be downloaded from here (53Gb).
Running the demo
- Download 3D models and our pre-trained checkpoints and setup environment.
- run the following command
Train and Test on YCB- dataset
First, download the YCB-Video dataset from here and then create a symlink for the YCB-Video dataset using below command:
cd $ROOT/data/YCB_Video ln -s $ycb_data data
Let’s Train and test on the YCB-Video dataset cd $ROOT # multi-gpu training, use 1 GPU or 2 GPUs ./experiments/scripts/ycb_video_train.sh # testing, $GPU_ID can be 0, 1, etc. ./experiments/scripts/ycb_video_test.sh $GPU_ID
We learned the new method for object pose estimation, PoseCNN decouples the estimation of 3D rotation and translation. It localizes the object center and predicts the center distance of the image. To learn more you can follow given below resources:
- PoseCNN (GitHub)
- Research paper
- The YCB-Video Dataset ~ 265G
- The YCB-Video 3D Models ~ 367M
- The YCB-Video Dataset Toolbox (GitHub)
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
You can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.
Mohit is a Data & Technology Enthusiast with good exposure to solving real-world problems in various avenues of IT and Deep learning domain. He believes in solving human's daily problems with the help of technology.