Generating 3D pose meshes from monocular images is a computer vision problem, aiming to automate a tedious and time-consuming aspect of Visual Effects. Modelling objects with long and complex kinematic chains, such as the human body, is labour intensive as the VFX artist has to go frame by frame to rotoscope different sections of the kinematic chain.
Existing approaches for automating these tasks fall under two broad paradigms: optimization-based and regression-based. Optimization-based approaches directly fit the models to 2D data and produce accurate mesh-image alignments but are slow and sensitive to the initialization. Regression-based approaches directly map raw pixels to model parameters to create parametric models in a feed-forward manner via neural networks.
Sign up for your weekly dose of what's up in emerging technology.
These models are sensitive to minor deviations in parameters which often leads to misalignment between the generated meshes and the image evidence. In their paper, “3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop”, Hongwen Zhang, Yating Tian, et al. proposed a new feedback loop that utilizes a feature pyramid to rectify the parameters explicitly based on mesh-image alignment.
Architecture & Approach
Feature Pyramid for Human Mesh Regression
The PyMAF image encoder produces a pyramid of spatial features that provide information of the human pose in the image at different scale levels. This allows the subsequent deep regressor to leverage multi-scale alignment contexts. The point-wise features extracted by the encoder then go through a multi-layer perceptron for dimensionality reduction and are concatenated together to form a feature vector. The pose parameters are represented as relative rotations along kinetic chains and are thus sensitive to minor parameter errors. To deal with such misalignments, the parameter regressor uses 2D supervisions on the 2D key-points projected from the estimated mesh and additional 3D supervisions on 3D joints and model parameters when ground truth 3D labels are available.
Mesh Alignment Feedback Loop
Regressing mesh parameters in a single pass is challenging; to overcome this limitation existing approaches have employed an Iterative Error Feedback (IEF) loop to update parameters iteratively. Although this approach reduces parameter errors, it uses the same global features each time for parameter update. These global features lack fine-grained information and are not responsive to new, more current predictions. PyMAF introduces a new Mesh Alignment Feedback (MAF) loop that leverages mesh-aligned features. In contrast to using uniformly sampled grid features or global features, the mesh-aligned features provide alignment details of the current estimation, which is more useful for parameter optimization.
Auxiliary Pixel-wise Supervision
Spatial features can easily be affected by noise in images, as can be seen in the image above. To tackle noise caused by occlusion and illumination difference, PyMAF utilizes an auxiliary pixel-wise loss on the spatial features at the last level. This auxiliary supervision provides mesh-image association cues for the image encoder to preserve the most relevant information in the spatial feature maps.
Creating Human Pose Meshes From Monocular Images Using PyMAF
The following code has been taken from the official demo Colab notebook available here.
- Clone the PyMAF GitHub repository and navigate into the master directory.
!git clone https://github.com/HongwenZhang/PyMAF.git !cd PyMAF
!pip3 install -U https://download.pytorch.org/whl/cu100/torch-1.1.0-cp37-cp37m-linux_x86_64.whl !pip3 install -U https://download.pytorch.org/whl/cu100/torchvision-0.3.0-cp37-cp37m-linux_x86_64.whl !pip install -r requirements.txt
- Run the
demo.pyscript to generate the 3D mesh for your video; make sure to replace
./sample_video.mp4with the path to your video file.
!CUDA_VISIBLE_DEVICES=0 python3 demo.py --checkpoint=data/pretrained_model/PyMAF_model_checkpoint.pt --vid_file ./sample_video.mp4
This article went through PyMAF, a regression-based approach for human pose 3D mesh recovery. It introduced a new mesh alignment feedback loop that leverages different scales of spatial information obtained from a feature pyramid. Model parameters are optimized by the feedback loop based on the alignment status of the currently estimated meshes. In addition to that, an auxiliary supervision task is imposed on the spatial feature maps during the training of the regressor. This pixel-wise supervision makes the regressors less susceptible to noise in the images and improves the reliability of the mesh-aligned features. PyMAF was evaluated on both indoor and in-the-wild datasets, and it consistently improved the mesh image alignment performance over previous regression-based methods.
All images, except the output, has been taken from the PyMAF paper.