Novel view synthesis finds interesting applications in movie production, sports broadcasting and telepresence. Novel view synthesis is the process of generating a 3D view, or sometimes 3D scene, with the available 2D images captured from different poses, orientations and illuminations. Human body view synthesis is one of the challenging problems, especially the human body, which is in motion. Present view synthesis methods employ either image-based rendering or implicit neural representation to develop the 3D view.
However, the major hindrance in these view synthesis approaches is the hardware complexity. View synthesis requires either a dense array of cameras to capture the object from different views and orientations or a few high-definition depth sensors. High hardware requirements make the system highly expensive or impossible to establish due to spatial constraints and strict-configurations requirements. These dense camera requirements may be overcome by employing a relatively lesser number of cameras or sensors. But, the reduced number of cameras causes sparsity in the view continuity. This causes ill-posed representation learning of views and thus results in poor view rendering. An approach to novel view synthesis with limited number of cameras or sensors has become a need nowadays.
To this end, Sida Peng, Yuanqing Zhang, Qing Shuai, Hujun Bao and Xiaowei Zhou of Zhejiang University, Yinghao Xu of The Chinese University of Hong Kong, and Qianqian Wang of Cornell University introduced a powerful approach named Neural Body that employs sparse cameras to capture the poses of dynamic human body and renders high-quality 3D view as well as 3D scene of the original human body.
This approach assumes that the learnt implicit neural representation among different sparse camera capturing frames share the same structured latent space representation code set anchored to a deformable mesh. Thus, the sparse capturing can be integrated to form a continuous 3D view representation. The deformable mesh can be deformed to any possible human position based on the input pose. Neural Body synthesizes photorealistic novel views of a human performer in complex motions and varying illustrations from sparse multi-view video frames. Moreover, this framework needs no pre-trained networks to learn the representations.
While training the Neural Body framework, the structured latent codes are fed as input into a sparse convolutional neural network (SparseConvNet) that outputs a latent code volume. Thus 3D space representation is enabled from the input data. Latent code for any inference 3D point can be obtained by performing trilinear interpolation of the neighbour points in the latent code volume. Once latent code is obtained for any inference pose, they are fed into feed-forward networks for colour and density regression.
Deformable mesh is designed by connecting its vertices with structured latent codes. For this, the famous SMPL (Skinned Multi-Person Linear Model) is employed that is governed by the shape parameters and the pose parameters. By anchoring the latent representations to this SMPL model, a dynamic mesh of the human body is developed. This model enables quick inference on 3D reconstruction and novel view synthesis.
Python Implementation
Neural Body requires Python 3.6+, CUDA 10.0, PyTorch 1.4.0 and a GPU runtime. The following commands install PyTorch 1.4.0 compatible with CUDA 10.0.
!pip install torch==1.4.0+cu100 -f https://download.pytorch.org/whl/torch_stable.html
Output:
The following command downloads source code to the local machine.
!git clone https://github.com/zju3dv/neuralbody
Output:
Verify the downloaded contents by exploring the directory.
!ls neuralbody
Output:
Change the current directory to refer content/neuralbody/
by providing the line-magic command.
%cd neuralbody/
Download the Anaconda-3 package using the following command, if the local machine does not have a conda environment.
!wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
Output:
Install the downloaded Anaconda-3 package using the following command.
!bash Anaconda3-2020.02-Linux-x86_64.sh
Enable the conda directory to run further commands,
%cd content/neuralbody/ !export PATH=~/anaconda3/bin:$PATH !exec bash
and activate the environment and NeuralBody by providing the following commands inside the inner base mode command cell as shown below.
conda create -n neuralbody python=3.7 conda activate neuralbody
Install the dependencies using the following command.
!pip install -r requirements.txt
Install spconv
library and build its wheels using the following commands.
%%bash cd git clone https://github.com/traveller59/spconv --recursive cd spconv git checkout abf0acf30f5526ea93e687e3f424f62d9cd8313a export CUDA_HOME="/usr/local/cuda-10.0" python setup.py bdist_wheel cd dist pip install spconv-1.2.1-cp36-cp36m-linux_x86_64.whl
Download the datasets from the official data page to the directory /content/neuralbody
. It should be noted that the size of the datasets exceeds 30GB in size. Once downloaded, the datasets can be prepared using the following commands.
%%bash ROOT= content/neuralbody/ cd $ROOT/data ln -s content/neuralbody/people_snapshot people_snapshot # OR ln -s content/neuralbody/zju_mocap zju_mocap
Download the pre-trained model from the official models page to a newly created /data directory, and enable one of the models and run it using the commands,
%%bash $ROOT/data/trained_model/if_nerf/female3c/latest.pth python run.py --type visualize --cfg_file configs/snapshot_f3c_demo.yaml exp_name female3c python run.py --type visualize --cfg_file configs/snapshot_f3c_perform.yaml exp_name female3c python run.py --type visualize --cfg_file configs/snapshot_f3c_mesh.yaml exp_name female3c train.num_workers 0 # start training python train_net.py --cfg_file configs/snapshot_f3c.yaml exp_name female3c resume False # distribute training based on the gpu availability python -m torch.distributed.launch --nproc_per_node=4 train_net.py --cfg_file configs/snapshot_f3c.yaml exp_name female3c resume False gpus "0, 1, 2, 3" distributed True
It should be noted that training may take several hours based on the memory availability and device configuration.
Performance of Neural Body
The Neural Body framework is trained on complex human motions such as twirling, Taichi, arm swings, warmups, punching and kicking. The complex human motions are captured by a multi-camera system of 21 synchronized cameras. Inputs from 4 evenly distributed cameras are chosen for training and the rest for testing. Training and testing of the Neural Body and the recent state-of-the-arts, the NeRF (Neural Radiance Fields), the NV (Neural Volumes), COLMAP, DVR (Differentiable Volumetric Rendering), People-Snapshot and the PIFuHD are carried out under identical conditions. Neural Body greatly outperforms any other model on the PSNR (Peak Signal-to-Noise Ratio) scale and the SSIM (Structural Similarity Index Metric) scale.