NeX is a new scene representation based on the multiplane image (MPI) that models view-dependent effects by performing basis expansion on the pixel representation. Rather than simply storing static colour values as in a traditional MPI, NeX represents each colour as a function of the viewing angle and approximates this function using a linear combination of learnable spherical basis functions. Moreover, it uses a hybrid parameter modeling strategy that models high-frequency details in an explicit structure within an implicit MPI modeling framework. This helps improve fine details that are difficult to model by a neural network and produces sharper results in fewer training iterations. NeX also introduced a new dataset, Shiny, designed to test the limits of view-dependent modeling with significantly more challenging effects such as rainbow reflections on a CD and refraction through a test tube.
Approach & Architecture
A multiplane image(MPI) is a 3D scene representation consisting of a collection of D planar images, each with dimension H × W × 4 where the last dimension contains RGB values and alpha transparency values. The planes are scaled and placed equidistantly either in the depth space (for bounded close-up objects) or inverse depth space (for scenes that extend out to infinity) along a reference viewing frustum.
An RGBα MPI can be rendered in any target view by first warping all its planes to the target view via a homography that relates the reference and target view, and then applying the composite operator. Let ci ∈ R H×W×3 and αi ∈ R H×W×1 be the RGB and alpha “images” of the ith plane respectively, ordered from back to front. And A = {α1, α2, …, αD}, C = {c1, c2, …, cD} be the sets of these images. This MPI can be rendered in a new view using the composite operator O:
Here W is the homography warping function and O is:
One main limitation of multiplane images is that they can only model Lambertian surfaces, i.e., surfaces whose colours appear constant regardless of the viewing angle. In real-world scenarios, many objects are non-Lambertian such as a CD, a glass table, or a metal spoon. These objects exhibit view-dependent effects such as reflection and refraction. Reconstructing these objects with an MPI makes the objects appear unrealistically dull without reflections or even break down completely due to the violation of the brightness constancy assumption used for matching invariant and 3D reconstruction.
To allow for view-dependent modeling in NeX, the pixel color representation is modified by parameterizing each color value as a function of the viewing direction v = (vx, vy, vz). This results in a 3-dimensional mapping function C(v): R 3 → R 3 for every pixel. However, storing this mapping explicitly is limiting and not generalizable to new, unobserved angles. Regressing the color directly from v (and the pixel location) with a neural network is possible but inefficient for real-time rendering. The key idea behind NeX is to approximate this function with a linear combination of learnable basis functions {Hn(v) : R3 → R} over the spherical domain described by vector v:
Here kpn ∈ R3 for pixel p are RGB coefficients, or reflectance parameters, of N global basis functions. There are several ways to define a suitable set of basis functions, spherical harmonics basis is one common choice used heavily in computer graphics. Fourier’s basis or Taylor’s basis can also be used.
However, these “fixed” basis functions have one shortcoming: the number of basis functions required to capture high-frequency changes within a narrow viewing angle can be very high. This in turn requires more reflectance parameters which make both learning these parameters and rendering them more difficult. With learnable basis functions, the modified NeX MPI outperforms other versions with alternative basis functions that use the same number of coefficients.
NeX uses two separate MLPs; one for predicting per-pixel parameters given the pixel location, and the other for predicting all global basis functions given the viewing angle. The motivation for using the second network is to ensure that the prediction of the basis functions, which are global, is not a function of the pixel location. The first MLP is modeled as Fθ with parameter θ:
Here x = (x, y, d) contains the location information of pixel (x, y) at plane d. The second network is modeled as Gɸ with parameter ɸ:
Here v is the normalized viewing direction.
Fine details are lost when using a traditional MLP to model kn, or “coefficient images”. In view-synthesis problems, these fine details tend to come from the surface texture itself and not necessarily from complex scene geometry. NeX uses positional encoding to regress these images, which helps to an extent but still produces blurry results. Amidst experimentation, the authors stumbled upon a simple fix; storing the first coefficient k0, or “base color,” explicitly reduced the network’s burden of compressing and reproducing detail and led to sharper results, in fewer iterations. With this implicit-explicit modeling strategy, NeX predicts every parameter with MLPs except k0 which is optimized explicitly as a learnable parameter with a total variation regularize.
Real-time View Synthesis using NeX
Requirements
- Install the COLMAP and lpips. FFmpeg and other Python dependencies are already installed in Colab.
!pip install lpips !apt install colmap
- Clone the NeX GitHub repository and navigate into the newly created
nex-code
directory.
!git clone https://github.com/nex-mpi/nex-code !cd nex-code
- Select a scene, make running directories and download the selected dataset from OneDrive.
You can also use your own images but you’ll need at least 12 images in order for NeX to work. In addition to that, downscaling the images to 400-pixel width is recommended for fast upload and training.
scene_urls = { 'cake': 'https://vistec-my.sharepoint.com/:u:/g/personal/pakkapon_p_s19_vistec_ac_th/ESg8LNsTqmtFmKO-9X4dUsUBVgfw_TbuAheVAEKnsiouug?download=1', 'crest': 'https://vistec-my.sharepoint.com/:u:/g/personal/pakkapon_p_s19_vistec_ac_th/EYqAlbiZqO1GsiAg-HgEi34B3cBL3tuaFQxg5fyrV5Prew??download=1', 'giants': 'https://vistec-my.sharepoint.com/:u:/g/personal/pakkapon_p_s19_vistec_ac_th/EUx6wPzSVRtMhpinHKF9ArcBE_4c98xxJLAGSCaM54MiJQ?download=1', 'room': 'https://vistec-my.sharepoint.com/:u:/g/personal/pakkapon_p_s19_vistec_ac_th/ERVHMv2NeOtKgFLGRJ22jgMBdo3BqCQIfd27MFgLvNOW5w?download=1', 'seasoning': 'https://vistec-my.sharepoint.com/:u:/g/personal/pakkapon_p_s19_vistec_ac_th/EedXEIqliIZGk-6fxd-cb9cBsUjidu9G5du1TIYOF5FOyQ?download=1', 'sushi': 'https://vistec-my.sharepoint.com/:u:/g/personal/pakkapon_p_s19_vistec_ac_th/EZZA-3nyCBVLtIra5yMZzC0BFx3f4wqg1cm8rKzTAt2x0g?download=1', } scene = “room” onedrive_dataset =scene_url[scene] # make directories for running !mkdir -p data/demo !mkdir -p runs # download the dataset get_ipython().system_raw('wget -O data/demo/data.zip {}'.format(onedrive_dataset)) get_ipython().system_raw('unzip -o -d data/demo/ data/demo/data.zip') get_ipython().system_raw('rm data/demo/data.zip')
- Set parameters for training.
epochs = 40 image_width = 400 import math pos_level = math.ceil(math.log(image_width) / math.log(2)) num_offset = int(image_width / 5.0) web_width = 4096 if image_width <= 400 else 16000
- Train NeX on the downloaded images
!python train.py -scene data/demo -model_dir demo -layers 12 -sublayers 6 -epochs $epochs -offset $num_offset -tb_toc 1 -hidden 128 -pos_level $pos_level -depth_level 7 -tb_saveimage 2 -num_workers 2 -llff_width $image_width -web_width=$web_width
Training will take around 10 minutes for preset images and 20 minutes for new (your) images.
- Display the generated video.
from IPython.display import HTML from base64 import b64encode video_path = "runs/video_output/demo/video.mp4" mp4 = open(video_path, "rb").read() data_url = "data:video/mp4;base64," + b64encode(mp4).decode() HTML(f""" <video width=400 controls> <source src="{data_url}" type="video/mp4" controls playsinline autoplay muted loop> </video> """)
Last Epoch (Endnote)
This article discussed NeX, a new approach to novel view synthesis using multiplane image (MPI) with neural basis expansion. Although NeX is effective in capturing and reproducing complex view-dependent effects, it is based on MPI and inherits MPIs limitations. When viewing from an angle too far away from the center, there are “stack of cards” artifacts that expose individual MPI planes. NeX still cannot fully reproduce the hardest scenes in the Shiny dataset which include effects like light sparkles, extremely sharp highlights, or refraction through test tubes.
References
To learn more about NeX refer to the following resources:
Want to learn more about view-synthesis? Check out our guide to Intel’s Stable View Synthesis.