Pose estimation is now a greater research area. Until now developments have been made based on human body 2D keypoint annotations. Most of the solutions have been around a single image or 2D motion and significantly less on the 3D motion as it involves more challenges. The primary challenge being less ground truth training 3D annotated data. Some of these 3D motion researches that have come across are not satisfactory and suffer from many drawbacks. Also, these methods are mostly frame-based, which increase error rates.
In February 2020 (later updated in April) PhD students Muhammed Kocabas, Nikos Athanasiou, and director Michael J. Black at Max Planck Institute for Intelligent Systems represented their paper to CVPR named “VIBE: Video Inference for Human Body Pose and Shape Estimation”. VIBE uses CNNs, RNNs(GRU) and GANs along with a self-attention layer to achieve its state-of-the-art results. A monocular video is analysed into video sequences. They have used both 2D keypoint annotated data and AMASS (Archive of Motion Capture as Surface Shapes) dataset of unpaired static 3D human motion containing shapes and poses. The model is tested upon 3DPW and MPI-INF-3DHP datasets and has produced new benchmark results.
Sign up for your weekly dose of what's up in emerging technology.
Shown above is a state-of-the-art video-pose-estimation approach, failing to produce accurate 3D body poses. To address these limitations, a large-scale motion-capture dataset is used to train a motion discriminator using an adversarial approach. VIBE can produce realistic and accurate pose and shape, beating previous methods on standard benchmarks. Below gif shows results achieved by VIBE.
VIBE uses CNNs to extract image features. The output from the CNN is fed as input to the recurrent neural network, which processes the sequential nature of human motion. Then a temporal encoder and regressor are used to predict the body parameters for the whole input sequence. This whole part is referred to as the Generator(G) model. Now with the help of the AMASS dataset 3D, realistic human motion is achieved for adversarial training and build a motion discriminator(Dm). The motion discriminator takes in both predicted pose sequences along with pose sequences sampled from AMASS. The discriminator tries to differentiate between the fake and real motions by providing a real/fake probability for each input sequence which helps in producing realistic motion. The output of this method is a standard SMPL body model format consisting sequence of pose and shape parameters.
Source Code – https://github.com/mkocabas/VIBE
The code is implemented in PyTorch and underneath is the train.py file illustration.
# importing libraries
import torch import pprint import random import numpy as np from torch.utils.tensorboard import SummaryWriter from lib.core.loss import VIBELoss from lib.core.trainer import Trainer from lib.core.config import parse_args from lib.utils.utils import prepare_output_dir from lib.models import VIBE, MotionDiscriminator from lib.dataset.loaders import get_data_loaders from lib.utils.utils import create_logger, get_optimizer
data_loaders = get_data_loaders(cfg)
# compiling loss
loss = VIBELoss( e_loss_weight=cfg.LOSS.KP_2D_W, e_3d_loss_weight=cfg.LOSS.KP_3D_W, e_pose_loss_weight=cfg.LOSS.POSE_W, e_shape_loss_weight=cfg.LOSS.SHAPE_W, d_motion_loss_weight=cfg.LOSS.D_MOTION_LOSS_W, )
# Initializing networks – CNN used is ResNet-50, T = 16 (after experimenting different values this gave best results) as the sequence length minibatch size 32, the temporal encoder has 2-layer GRU with a hidden size of 1024, regressor two fully-connected layers with 1024 neurons each, followed by a final layer
generator = VIBE( n_layers=cfg.MODEL.TGRU.NUM_LAYERS, batch_size=cfg.TRAIN.BATCH_SIZE, seqlen=cfg.DATASET.SEQLEN, hidden_size=cfg.MODEL.TGRU.HIDDEN_SIZE, pretrained=cfg.TRAIN.PRETRAINED_REGRESSOR, add_linear=cfg.MODEL.TGRU.ADD_LINEAR, bidirectional=cfg.MODEL.TGRU.BIDIRECTIONAL, use_residual=cfg.MODEL.TGRU.RESIDUAL, ).to(cfg.DEVICE)
# initializing optimizers – Adam optimizer with a learning rate of 5 10*5 and 110*4 for the G and DM respectively
gen_optimizer = get_optimizer( model=generator, optim_type=cfg.TRAIN.GEN_OPTIM, lr=cfg.TRAIN.GEN_LR, weight_decay=cfg.TRAIN.GEN_WD, momentum=cfg.TRAIN.GEN_MOMENTUM,)
# initializing discriminator – contains a sequence of GRUs and self-attention to amplify distinctive frames. 2 MLP layers containing 1024 neurons each with tanh activation function.
motion_discriminator = MotionDiscriminator( rnn_size=cfg.TRAIN.MOT_DISCR.HIDDEN_SIZE, input_size=69, num_layers=cfg.TRAIN.MOT_DISCR.NUM_LAYERS, output_size=1, feature_pool=cfg.TRAIN.MOT_DISCR.FEATURE_POOL, attention_size=None if cfg.TRAIN.MOT_DISCR.FEATURE_POOL !='attention' else cfg.TRAIN.MOT_DISCR.ATT.SIZE, attention_layers=None if cfg.TRAIN.MOT_DISCR.FEATURE_POOL !='attention' else cfg.TRAIN.MOT_DISCR.ATT.LAYERS, attention_dropout=None if cfg.TRAIN.MOT_DISCR.FEATURE_POOL !='attention' else cfg.TRAIN.MOT_DISCR.ATT.DROPOUT ).to(cfg.DEVICE) dis_motion_optimizer = get_optimizer( model=motion_discriminator, optim_type=cfg.TRAIN.MOT_DISCR.OPTIM, lr=cfg.TRAIN.MOT_DISCR.LR, weight_decay=cfg.TRAIN.MOT_DISCR.WD, momentum=cfg.TRAIN.MOT_DISCR.MOMENTUM )
# initializing lr_schedulers
motion_lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau( dis_motion_optimizer, mode='min', factor=0.1, patience=cfg.TRAIN.LR_PATIENCE, verbose=True, ) lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau( gen_optimizer, mode='min', factor=0.1, patience=cfg.TRAIN.LR_PATIENCE, verbose=True, )
# Training model – Both the generator and discriminator are trained together and thus helps in reducing loss.
Trainer( data_loaders=data_loaders, generator=generator, motion_discriminator=motion_discriminator, criterion=loss, dis_motion_optimizer=dis_motion_optimizer, dis_motion_update_steps=cfg.TRAIN.MOT_DISCR.UPDATE_STEPS, gen_optimizer=gen_optimizer, start_epoch=cfg.TRAIN.START_EPOCH, end_epoch=cfg.TRAIN.END_EPOCH, device=cfg.DEVICE, writer=writer, debug=cfg.DEBUG, logdir=cfg.LOGDIR, lr_scheduler=lr_scheduler, motion_lr_scheduler=motion_lr_scheduler, resume=cfg.TRAIN.RESUME, num_iters_per_epoch=cfg.TRAIN.NUM_ITERS_PER_EPOCH, debug_freq=cfg.DEBUG_FREQ, ).fit()
Below are the results achieved by SOTA models on 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
This is taken from the paper. VIBE (direct comp.) is trained on video datasets like others, while VIBE is trained with extra data from the 3DPW training set. Vibe outperforms all.
Limitations – Vibe fails in heavy occlusion, fast motion, and multi-person occlusion.
3D pose estimation is necessary to understand human behaviour. Vibe has introduced many methods that are clubbed together to achieve the state-of-art results. In future releases, we can expect supervision on single-frame methods by fine-tuning the HMR features, extending experiments to optical flow, and resolve the multi-person and occlusion problem. Also, now in the era of transformers, the authors have plans to explore it and enhance more to showcase better performances.
Video Demonstration – https://youtu.be/rIr-nX63dUA
Explained talk by the authors – https://twimlai.com/thats-a-vibe-ml-for-human-pose-and-shape-estimation-with-nikos-athanasiou-muhammed-kocabas-michael-black/