Stable View Synthesis achieves state-of-the-art performance in 3D photorealistic view synthesis, significantly outperforming any of the current approaches. It was developed by Gernot Riegler and Vladlen Koltun from Intel Labs and published(Research Paper) recently. Photorealistic view synthesis is the art of acquiring a new viewpoint of a subject by learning from various actual images of that subject captured in different views and orientations with identical camera settings.
Photorealistic view synthesis can help explore space and other technologies where real photography is hardly possible. Stable View Synthesis develops a scene-based image and allows one to view the same scene from almost all possible viewpoints that can be run as a sequence of images. Input to the computer vision system can be a short video of a subject by moving the camera around the subject with a focus on the subject.
Stable View Synthesis, shortly called SVS, develops structure-from-motion (SfM) scenario to develop image poses of input images and prediction of camera settings and orientation. These image poses are used in multi-view stereo to generate 3D dense point clouds. A 3D geometric scaffold of the scene is synthetically constructed by meshing these points. On the other hand, an autoencoder convolutional neural network is incorporated to encode sequences of input images into feature tensors.
The pixels on the geometric scaffold corresponding to that specific view are located in many of the original images to synthesise a new view. Each of such images is used to generate feature maps through rays to arrive at view synthesis. SVS employs on-surface aggregation using a differentiable set network to process this synthesized data to produce the target ray’s feature vector.
Rendering of the output image can be done by developing a depth map using camera poses and other details. This depth map is used to define how far the pixels on the geometric scaffold need to be unprojected. Thus output-view-dependent feature vectors are generated and assembled to form the feature tensors. Using the already-trained convolutional neural network, these feature tensors are transformed into the 3D reconstructed scene.
A few sampled images in a sequence capturing a playground scene from the Tanks and Temples dataset are shown below.
Coding Stable View Synthesis in python
To install Stable View Synthesis and its dependencies in your local machine, run the following commands. It should be noted that Stable View Synthesis can be trained or run only on CUDA GPU. Hence, users who work with notebook environments should enable CUDA GPU runtime to install and train the system.
# install necessary libraries %%bash sudo apt-add-repository universe sudo apt-get install libeigen3-dev pip install torchvision pip install torch-scatter pip install torch-sparse pip install torch-geometric pip install torch-sparse pip install open3d pip install python-opencv pip install ninja
In order to obtain necessary source files from the github repository, clone it and update submodules.
%%bash git clone https://github.com/intel-isl/StableViewSynthesis.git cd StableViewSynthesis git submodule update --init --recursive --remote
Install the files
%%bash cd StableViewSynthesis/ext/preprocess cmake -DCMAKE_BUILD_TYPE=Release . make cd ../mytorch python setup.py build_ext --inplace
Open up the experiments directory and run evaluation by providing the following commands. This invokes the pretrained model and runs with four sampled sequences from the tanks and temples dataset.
%%bash cd StableViewSynthesis/experiments python exp.py --net resunet3.16_penone.dirs.avg.seq+9+1+unet+5+2+16.single+mlpdir+mean+3+64+16 --cmd eval --iter last --eval-dsets tat-subseq
The whole model can also be retrained completely using the command,
%%bash python exp.py --net resunet3.16_penone.dirs.avg.seq+9+1+unet+5+2+16.single+mlpdir+mean+3+64+16 --cmd retrain
Stable View Synthesis exhibits qualitative as well as quantitative outperformance compared to well acclaimed approaches such as Free View Synthesis (FVS), Local Light Field Fusion (LLFF), Neural Radiance Fields (NERF), Improved NERF (NERF++), Extreme View Synthesis (EVS), and Neural Point-Based Graphics (NPBG).
Note: The articles’ illustrations are obtained from the Tanks and Temples dataset, FVS dataset, and original research paper.
Some useful references:
Github official code repository