Last updated December 24, 2020
In AI Mysteries

Guide To VIBE: Video Inference for 3D Human Body Pose and Shape Estimation

VIBE - Video Inference for 3D Human Body Pose and Shape Estimation. It uses CNNs, RNNs(GRU) and GANs along with a self-attention layer to achieve its state-of-the-art results.

Published on December 24, 2020
by Jayita Bhattacharyya

Pose estimation is now a greater research area. Until now developments have been made based on human body 2D keypoint annotations. Most of the solutions have been around a single image or 2D motion and significantly less on the 3D motion as it involves more challenges. The primary challenge being less ground truth training 3D annotated data. Some of these 3D motion researches that have come across are not satisfactory and suffer from many drawbacks. Also, these methods are mostly frame-based, which increase error rates.

In February 2020 (later updated in April) PhD students Muhammed Kocabas, Nikos Athanasiou, and director Michael J. Black at Max Planck Institute for Intelligent Systems represented their paper to CVPR named “VIBE: Video Inference for Human Body Pose and Shape Estimation”. VIBE uses CNNs, RNNs(GRU) and GANs along with a self-attention layer to achieve its state-of-the-art results. A monocular video is analysed into video sequences. They have used both 2D keypoint annotated data and AMASS (Archive of Motion Capture as Surface Shapes) dataset of unpaired static 3D human motion containing shapes and poses. The model is tested upon 3DPW and MPI-INF-3DHP datasets and has produced new benchmark results.

Shown above is a state-of-the-art video-pose-estimation approach, failing to produce accurate 3D body poses. To address these limitations, a large-scale motion-capture dataset is used to train a motion discriminator using an adversarial approach. VIBE can produce realistic and accurate pose and shape, beating previous methods on standard benchmarks. Below gif shows results achieved by VIBE.

VIBE uses CNNs to extract image features. The output from the CNN is fed as input to the recurrent neural network, which processes the sequential nature of human motion. Then a temporal encoder and regressor are used to predict the body parameters for the whole input sequence. This whole part is referred to as the Generator(G) model. Now with the help of the AMASS dataset 3D, realistic human motion is achieved for adversarial training and build a motion discriminator(Dm). The motion discriminator takes in both predicted pose sequences along with pose sequences sampled from AMASS. The discriminator tries to differentiate between the fake and real motions by providing a real/fake probability for each input sequence which helps in producing realistic motion. The output of this method is a standard SMPL body model format consisting sequence of pose and shape parameters.

Code Snippet

Source Code – https://github.com/mkocabas/VIBE

The code is implemented in PyTorch and underneath is the train.py file illustration.

# importing libraries

import torch
import pprint
import random
import numpy as np
from torch.utils.tensorboard import SummaryWriter
from lib.core.loss import VIBELoss
from lib.core.trainer import Trainer
from lib.core.config import parse_args
from lib.utils.utils import prepare_output_dir
from lib.models import VIBE, MotionDiscriminator
from lib.dataset.loaders import get_data_loaders
from lib.utils.utils import create_logger, get_optimizer

# Dataloaders

data_loaders = get_data_loaders(cfg)

# compiling loss

loss = VIBELoss(
       e_loss_weight=cfg.LOSS.KP_2D_W,
       e_3d_loss_weight=cfg.LOSS.KP_3D_W,
       e_pose_loss_weight=cfg.LOSS.POSE_W,
       e_shape_loss_weight=cfg.LOSS.SHAPE_W,
       d_motion_loss_weight=cfg.LOSS.D_MOTION_LOSS_W,
   )

# Initializing networks – CNN used is ResNet-50, T = 16 (after experimenting different values this gave best results) as the sequence length minibatch size 32, the temporal encoder has 2-layer GRU with a hidden size of 1024, regressor two fully-connected layers with 1024 neurons each, followed by a final layer

generator = VIBE(
       n_layers=cfg.MODEL.TGRU.NUM_LAYERS,
       batch_size=cfg.TRAIN.BATCH_SIZE,
       seqlen=cfg.DATASET.SEQLEN,
       hidden_size=cfg.MODEL.TGRU.HIDDEN_SIZE,
       pretrained=cfg.TRAIN.PRETRAINED_REGRESSOR,
       add_linear=cfg.MODEL.TGRU.ADD_LINEAR,
       bidirectional=cfg.MODEL.TGRU.BIDIRECTIONAL,
       use_residual=cfg.MODEL.TGRU.RESIDUAL,
   ).to(cfg.DEVICE)

# initializing optimizers – Adam optimizer with a learning rate of 5 10*5 and 110*4 for the G and DM respectively

gen_optimizer = get_optimizer(
       model=generator,
       optim_type=cfg.TRAIN.GEN_OPTIM,
       lr=cfg.TRAIN.GEN_LR,
       weight_decay=cfg.TRAIN.GEN_WD,
       momentum=cfg.TRAIN.GEN_MOMENTUM,)

# initializing discriminator – contains a sequence of GRUs and self-attention to amplify distinctive frames. 2 MLP layers containing 1024 neurons each with tanh activation function.

motion_discriminator = MotionDiscriminator(
       rnn_size=cfg.TRAIN.MOT_DISCR.HIDDEN_SIZE,
       input_size=69,
       num_layers=cfg.TRAIN.MOT_DISCR.NUM_LAYERS,
       output_size=1,
       feature_pool=cfg.TRAIN.MOT_DISCR.FEATURE_POOL,
       attention_size=None if cfg.TRAIN.MOT_DISCR.FEATURE_POOL !='attention' else cfg.TRAIN.MOT_DISCR.ATT.SIZE,
       attention_layers=None if cfg.TRAIN.MOT_DISCR.FEATURE_POOL !='attention' else cfg.TRAIN.MOT_DISCR.ATT.LAYERS,
       attention_dropout=None if cfg.TRAIN.MOT_DISCR.FEATURE_POOL !='attention' else cfg.TRAIN.MOT_DISCR.ATT.DROPOUT
   ).to(cfg.DEVICE)

dis_motion_optimizer = get_optimizer(
       model=motion_discriminator,
       optim_type=cfg.TRAIN.MOT_DISCR.OPTIM,
       lr=cfg.TRAIN.MOT_DISCR.LR,
       weight_decay=cfg.TRAIN.MOT_DISCR.WD,
       momentum=cfg.TRAIN.MOT_DISCR.MOMENTUM
   )

# initializing lr_schedulers

motion_lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
       dis_motion_optimizer,
       mode='min',
       factor=0.1,
       patience=cfg.TRAIN.LR_PATIENCE,
       verbose=True,
   )
   lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
       gen_optimizer,
       mode='min',
       factor=0.1,
       patience=cfg.TRAIN.LR_PATIENCE,
       verbose=True,
   )

# Training model – Both the generator and discriminator are trained together and thus helps in reducing loss.

Trainer(
       data_loaders=data_loaders,
       generator=generator,
       motion_discriminator=motion_discriminator,
       criterion=loss,
       dis_motion_optimizer=dis_motion_optimizer,
       dis_motion_update_steps=cfg.TRAIN.MOT_DISCR.UPDATE_STEPS,
       gen_optimizer=gen_optimizer,
       start_epoch=cfg.TRAIN.START_EPOCH,
       end_epoch=cfg.TRAIN.END_EPOCH,
       device=cfg.DEVICE,
       writer=writer,
       debug=cfg.DEBUG,
       logdir=cfg.LOGDIR,
       lr_scheduler=lr_scheduler,
       motion_lr_scheduler=motion_lr_scheduler,
       resume=cfg.TRAIN.RESUME,
       num_iters_per_epoch=cfg.TRAIN.NUM_ITERS_PER_EPOCH,
       debug_freq=cfg.DEBUG_FREQ,
   ).fit()

Benchmark results

Below are the results achieved by SOTA models on 3DPW, MPI-INF-3DHP, and Human3.6M datasets.

This is taken from the paper. VIBE (direct comp.) is trained on video datasets like others, while VIBE is trained with extra data from the 3DPW training set. Vibe outperforms all.

Limitations – Vibe fails in heavy occlusion, fast motion, and multi-person occlusion.

End Notes

3D pose estimation is necessary to understand human behaviour. Vibe has introduced many methods that are clubbed together to achieve the state-of-art results. In future releases, we can expect supervision on single-frame methods by fine-tuning the HMR features, extending experiments to optical flow, and resolve the multi-person and occlusion problem. Also, now in the era of transformers, the authors have plans to explore it and enhance more to showcase better performances.

Video Demonstration – https://youtu.be/rIr-nX63dUA

Explained talk by the authors – https://twimlai.com/thats-a-vibe-ml-for-human-pose-and-shape-estimation-with-nikos-athanasiou-muhammed-kocabas-michael-black/

Code Demo – https://colab.research.google.com/drive/1dFfwxZ52MN86FA6uFNypMEdFShd2euQA

Access all our open Survey & Awards Nomination forms in one place >>

Jayita Bhattacharyya

Machine learning and data science enthusiast. Eager to learn new technology advances. A self-taught techie who loves to do cool stuff using technology for fun and worthwhile.

Guide To VIBE: Video Inference for 3D Human Body Pose and Shape Estimation

Jayita Bhattacharyya

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

KissanAI Releases Dhenu Llama 3, an Indic LLM for Farmers

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Is it Humane to Bash Humane Ai Pin?

Meta Llama 3 Now Available on Databricks For Enterprise

How Databricks is Enabling Agriculture’s Data Revolution with UPL

How Good is Llama 3 for Indic Languages?

OpenAI Hires Pragya Misra As Its First Employee in India

Meta Forces Developers Cite ‘Llama 3’ in their AI Development

India is Making its Own AI Servers

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.

AIM Launches the 3rd Edition of Data Engineering Summit. May 30-31, Bengaluru