MITB Banner

Guide To VIBE: Video Inference for 3D Human Body Pose and Shape Estimation

VIBE - Video Inference for 3D Human Body Pose and Shape Estimation. It uses CNNs, RNNs(GRU) and GANs along with a self-attention layer to achieve its state-of-the-art results.

Pose estimation is now a greater research area. Until now developments have been made based on human body 2D keypoint annotations. Most of the solutions have been around a single image or 2D motion and significantly less on the 3D motion as it involves more challenges. The primary challenge being less ground truth training 3D annotated data. Some of these 3D motion researches that have come across are not satisfactory and suffer from many drawbacks. Also, these methods are mostly frame-based, which increase error rates. 

In February 2020 (later updated in April) PhD students Muhammed Kocabas, Nikos Athanasiou, and director Michael J. Black at Max Planck Institute for Intelligent Systems represented their paper to CVPR named “VIBE: Video Inference for Human Body Pose and Shape Estimation”. VIBE uses CNNs, RNNs(GRU) and GANs along with a self-attention layer to achieve its state-of-the-art results. A monocular video is analysed into video sequences. They have used both 2D keypoint annotated data and AMASS (Archive of Motion Capture as Surface Shapes) dataset of unpaired static 3D human motion containing shapes and poses. The model is tested upon 3DPW and MPI-INF-3DHP datasets and has produced new benchmark results.

Shown above is a state-of-the-art video-pose-estimation approach, failing to produce accurate 3D body poses. To address these limitations, a large-scale motion-capture dataset is used to train a motion discriminator using an adversarial approach. VIBE can produce realistic and accurate pose and shape, beating previous methods on standard benchmarks. Below gif shows results achieved by VIBE.

VIBE uses CNNs to extract image features. The output from the CNN is fed as input to the recurrent neural network, which processes the sequential nature of human motion. Then a temporal encoder and regressor are used to predict the body parameters for the whole input sequence. This whole part is referred to as the Generator(G) model. Now with the help of the AMASS dataset 3D, realistic human motion is achieved for adversarial training and build a motion discriminator(Dm). The motion discriminator takes in both predicted pose sequences along with pose sequences sampled from AMASS. The discriminator tries to differentiate between the fake and real motions by providing a real/fake probability for each input sequence which helps in producing realistic motion. The output of this method is a standard SMPL body model format consisting sequence of pose and shape parameters.

Code Snippet

Source Code – https://github.com/mkocabas/VIBE

The code is implemented in PyTorch and underneath is the train.py file illustration. 

# importing libraries

import torch
import pprint
import random
import numpy as np
from torch.utils.tensorboard import SummaryWriter
from lib.core.loss import VIBELoss
from lib.core.trainer import Trainer
from lib.core.config import parse_args
from lib.utils.utils import prepare_output_dir
from lib.models import VIBE, MotionDiscriminator
from lib.dataset.loaders import get_data_loaders
from lib.utils.utils import create_logger, get_optimizer

# Dataloaders

data_loaders = get_data_loaders(cfg)

# compiling loss

loss = VIBELoss(
       e_loss_weight=cfg.LOSS.KP_2D_W,
       e_3d_loss_weight=cfg.LOSS.KP_3D_W,
       e_pose_loss_weight=cfg.LOSS.POSE_W,
       e_shape_loss_weight=cfg.LOSS.SHAPE_W,
       d_motion_loss_weight=cfg.LOSS.D_MOTION_LOSS_W,
   )

# Initializing networks – CNN used is ResNet-50, T = 16 (after experimenting different values this gave best results) as the sequence length minibatch size 32, the temporal encoder has 2-layer GRU with a hidden size of 1024, regressor two fully-connected layers with 1024 neurons each, followed by a final layer

generator = VIBE(
       n_layers=cfg.MODEL.TGRU.NUM_LAYERS,
       batch_size=cfg.TRAIN.BATCH_SIZE,
       seqlen=cfg.DATASET.SEQLEN,
       hidden_size=cfg.MODEL.TGRU.HIDDEN_SIZE,
       pretrained=cfg.TRAIN.PRETRAINED_REGRESSOR,
       add_linear=cfg.MODEL.TGRU.ADD_LINEAR,
       bidirectional=cfg.MODEL.TGRU.BIDIRECTIONAL,
       use_residual=cfg.MODEL.TGRU.RESIDUAL,
   ).to(cfg.DEVICE)

# initializing optimizers – Adam optimizer with a learning rate of 5  10*5 and 110*4 for the G and DM respectively

gen_optimizer = get_optimizer(
       model=generator,
       optim_type=cfg.TRAIN.GEN_OPTIM,
       lr=cfg.TRAIN.GEN_LR,
       weight_decay=cfg.TRAIN.GEN_WD,
       momentum=cfg.TRAIN.GEN_MOMENTUM,)

# initializing discriminator – contains a sequence of GRUs and self-attention to amplify distinctive frames. 2 MLP layers containing 1024 neurons each with tanh activation function.

motion_discriminator = MotionDiscriminator(
       rnn_size=cfg.TRAIN.MOT_DISCR.HIDDEN_SIZE,
       input_size=69,
       num_layers=cfg.TRAIN.MOT_DISCR.NUM_LAYERS,
       output_size=1,
       feature_pool=cfg.TRAIN.MOT_DISCR.FEATURE_POOL,
       attention_size=None if cfg.TRAIN.MOT_DISCR.FEATURE_POOL !='attention' else cfg.TRAIN.MOT_DISCR.ATT.SIZE,
       attention_layers=None if cfg.TRAIN.MOT_DISCR.FEATURE_POOL !='attention' else cfg.TRAIN.MOT_DISCR.ATT.LAYERS,
       attention_dropout=None if cfg.TRAIN.MOT_DISCR.FEATURE_POOL !='attention' else cfg.TRAIN.MOT_DISCR.ATT.DROPOUT
   ).to(cfg.DEVICE)

dis_motion_optimizer = get_optimizer(
       model=motion_discriminator,
       optim_type=cfg.TRAIN.MOT_DISCR.OPTIM,
       lr=cfg.TRAIN.MOT_DISCR.LR,
       weight_decay=cfg.TRAIN.MOT_DISCR.WD,
       momentum=cfg.TRAIN.MOT_DISCR.MOMENTUM
   )

# initializing lr_schedulers

motion_lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
       dis_motion_optimizer,
       mode='min',
       factor=0.1,
       patience=cfg.TRAIN.LR_PATIENCE,
       verbose=True,
   )
   lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
       gen_optimizer,
       mode='min',
       factor=0.1,
       patience=cfg.TRAIN.LR_PATIENCE,
       verbose=True,
   )

# Training model – Both the generator and discriminator are trained together and thus helps in reducing loss.

Trainer(
       data_loaders=data_loaders,
       generator=generator,
       motion_discriminator=motion_discriminator,
       criterion=loss,
       dis_motion_optimizer=dis_motion_optimizer,
       dis_motion_update_steps=cfg.TRAIN.MOT_DISCR.UPDATE_STEPS,
       gen_optimizer=gen_optimizer,
       start_epoch=cfg.TRAIN.START_EPOCH,
       end_epoch=cfg.TRAIN.END_EPOCH,
       device=cfg.DEVICE,
       writer=writer,
       debug=cfg.DEBUG,
       logdir=cfg.LOGDIR,
       lr_scheduler=lr_scheduler,
       motion_lr_scheduler=motion_lr_scheduler,
       resume=cfg.TRAIN.RESUME,
       num_iters_per_epoch=cfg.TRAIN.NUM_ITERS_PER_EPOCH,
       debug_freq=cfg.DEBUG_FREQ,
   ).fit()

Benchmark results 

Below are the results achieved by SOTA models on 3DPW, MPI-INF-3DHP, and Human3.6M datasets.

This is taken from the paper. VIBE (direct comp.) is trained on video datasets like others, while VIBE is trained with extra data from the 3DPW training set. Vibe outperforms all.

Limitations – Vibe fails in heavy occlusion, fast motion, and multi-person occlusion.

End Notes

3D pose estimation is necessary to understand human behaviour. Vibe has introduced many methods that are clubbed together to achieve the state-of-art results. In future releases, we can expect supervision on single-frame methods by fine-tuning the HMR features, extending experiments to optical flow, and resolve the multi-person and occlusion problem. Also, now in the era of transformers, the authors have plans to explore it and enhance more to showcase better performances.

Video Demonstration – https://youtu.be/rIr-nX63dUA

Explained talk by the authors – https://twimlai.com/thats-a-vibe-ml-for-human-pose-and-shape-estimation-with-nikos-athanasiou-muhammed-kocabas-michael-black/

Code Demo – https://colab.research.google.com/drive/1dFfwxZ52MN86FA6uFNypMEdFShd2euQA

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Jayita Bhattacharyya

Jayita Bhattacharyya

Machine learning and data science enthusiast. Eager to learn new technology advances. A self-taught techie who loves to do cool stuff using technology for fun and worthwhile.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories