Guide to Apple’s Equivariant Neural Rendering: Generating 3D scenes with no 3D supervision

Equivariant Neural Rendering

Apple has introduced a new method for generating 3D scenes called Equivariant Neural Rendering. This method requires no 3D supervision but only images and their relative positions to learn equivariant scene representations. The Equivariant Neural Rendering model(Research Paper) was first introduced virtually at the International Conference on Machine Learning (ICML) in June 2020 by  Emilien Dupont, Miguel Angel Bautista, Alex Colburn, Aditya Sankar, Carlos Guestrin, Josh Susskind, Qi Shan.

Before going further into the details, let’s have a look at some terminologies.


Rendering is an image of generating a 2D image from a 3D scene. An example of it is shown below:

Inverse Rendering

Inverse rendering is a process of generating a 3D scene from an image or set of images. An example of it is shown below.

Neural Rendering

Neural Rendering is the parameterizing of the rendering function by the neural network. In the example below, the 3D mesh on the left is converted to image on the right.

Proposed Method

The proposed method takes account of an inverse renderer and a forward neural renderer directly from images with no 3D supervision. This model can infer a scene representation from a single image and renders novel views of it in real time and the model is not only limited to simple scenes, even for complex scenes like images with no background, no complex lighting effects. An example of result of the model is shown below:

The images on the right are generated by a single image on the left giving a 3D view with the help of a neural renderer.

Representation of 3D scenes

Let us consider an image x ∈ X = Rc x h x w, where c, h and w are the number of channels, height and width of the image respectively and a scene representation denoted by z ∈ Z. Further, we define a mapping function g : Z → X , denoting the render function from scene representation to images and similarly an inverse render function can be denoted by f : X → Z , mapping images to scenes.

Next is to distinguish between the two scene representations i.e., between explicit and implicit representation. 

Explicit Representation – These are interpretable and rendered with any fixed rendered function. For example the 3D mesh bunny(on the left) below can be rendered with any of the standard rendered functions.

Implicit Representation – These representations are pretty abstract and rendered with learned models. For example, the camera on the sphere(on the right) represents a 3D tensor, which is denoting an abstract representation and is rendered through this camera, which can be a neural network.

Scene Transformations

The main idea of Equivariant Neural Rendering is that we do not require any scene representation to be explicit as long as it transforms like a real scene. This powerful model can handle many complex scenes and visual effects. For example: in the figure below, the figure I represents a scene representation as a mesh, it can be an implicit representation and we can apply the transformation to produce figure II(rotation transformation). Then further we can apply the rendering function to figure II, in order to produce figure IV and similar results i.e., figure III can be obtained from the original figure I. This means transforming a scene and rendering the mesh is equal to rendering the image and then applying transformation on the image. 

Equivariant scene representation

From the above scene transformation, we can conclude that rendering is equivalent with respect to transformation in space. This conclusion brings us to the definition of equivariant scene representation i.e., scene representation when coupled with suitable rendering function, obeys these transformation properties. It provides a strong inductive bias for representation learning in a 3D environment. Hence, the model defines a method to learn these transformations instead of learning scene representations.

Model Architecture

The whole architecture is divided into 2 parts. The inverse renderer takes an RGB image and output a scene representation while forward renderer does its opposite

Data Representation

The data required to train above architecture is in the form of a tuple where the first two positions of a tuple contain the two different images of the same things and the last element of a tuple represents the transformation to go from the first image to the second image. For example: 

Here, T1 represents the transformation that occurred in scene space to match the two views together.

Train the data 

The figure below represents the training of data. Two images of the same object but of different orientation are converted to scene representation with the help of inverse render function. Since the object is the same so the scene representation must be same and we apply these transformations to each other i.e., apply the transformation from image 2 to the scene representation of image 1 and vice versa. Now, the neural render should decode these scene transformations to other images i.e., image 1 should result in image 2 and vice versa. Now, simply train the model by minimizing the difference between the rendered image and image expected to be rendered after rotating the scene.

Datasets used to train the model

Demo – Generating 3D view scene using Equivariant Neural Rendering Pre-trained Model 

In this section, we will generate a 3D view from a single image by using a pre-trained model of Equivariant Neural Rendering. The steps are as follows:

  1. Clone the repository and change the working directory of the colab notebook.
!git clone
import os
  1. Now, import the required libraries and packages and create a helper function to print the images from torch.tensor
#import the required functions and library
import matplotlib.pyplot as plt
%matplotlib inline
import imageio
import torch
import torchvision
#select the device 
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
#helper function to plot the image from torch.tensor
def plot_img_tensor(img, nrow=4):
    """Helper function to plot image tensors.
        img (torch.Tensor): Image or batch of images of shape 
            (batch_size, channels, height, width).
    img_grid = torchvision.utils.make_grid(img, nrow=nrow)
    plt.imshow(img_grid.cpu().numpy().transpose(1, 2, 0))
  1. Next, is to load the pre-trained chair model. The code snippet for this is given below:
from models.neural_renderer import load_model
# Load trained chairs model
model = load_model('trained-models/').to(device)
  1. Convert the loaded image to tensor and then again convert those tensor to scene representations with the help of inverse render. The code snippet is available below:
from torchvision.transforms import ToTensor
# Convert image to tensor and add batch dimension
img_source = ToTensor()(img)
img_source = img_source.unsqueeze(0).to(device)
# Infer scene representation
scene = model.inverse_render(img_source)
# Print scene shape
print("Scene shape: {}".format(scene.shape))
  1. Create a rotated image of the original image so as to feed both the images to the pre-trained model.
# Initialize a rotation matrix
rotation_matrix = torch.Tensor(
   [[[ 0.4198, -0.3450, -0.8395],
     [-0.2159,  0.8605, -0.4615],
     [ 0.8816,  0.3749,  0.2867]]]
# Rotate scene by rotation matrix
rotated_scene = model.rotate(scene, rotation_matrix)
# Render rotated scene
rendered = model.render(rotated_scene)
  1. Since we have converted the original image to a new image by rotation, it looks a little abstract. So, we can give a camera specification of source like:
# As a rotation matrix can feel a little abstract, we can also reason in terms of 
# camera azimuth and elevation. The initial coordinate at which the source image
# is observed is given by the following azimuth and elevation. Note that these
# are not necessary to generate novel views (as shown above), we just use them 
# for convenience to generate rotation matrices
azimuth_source = torch.Tensor([42.561195]).to(device)
elevation_source = torch.Tensor([23.039995]).to(device)
  1. Define the azimuthal angle and elevation shift from the source image to obtain the rotational one and add them to the source variables in order to create the camera specification for the target image.
# You can set these to any value you like!
# Positive (negative) values correspond to moving camera to the right (left)
azimuth_shift = torch.Tensor([180.]).to(device)  
# Positive (negative) values correspond to moving camera up (down)
elevation_shift = torch.Tensor([20.]).to(device)
azimuth_target = azimuth_source + azimuth_shift
elevation_target = elevation_source + elevation_shift
  1. Now, rotate the original scene to match the target camera angle and render it.
# Rotate scene to match target camera angle
rotated_scene = model.rotate_source_to_target(scene, azimuth_source, elevation_source,  azimuth_target, elevation_target)
# Render rotated scene
rendered = model.render(rotated_scene)

This gives you the desired view at particular azimuthal shift and elevation shift.

  1. Now, to generate Novel Views, define the list of all azimuthal and elevation shifts.
# We can also generate several novel views of the same object
azimuth_shifts = torch.Tensor([20., -50., 120., 180., -90., 50.]).to(device)
elevation_shifts = torch.Tensor([10., -30., 40., -70., 10., 30.]).to(device)

Then pass these values to model.generate_novel_views to get all the novel views. The code is available below.

# This function expects a single image as input, so remove batch dimension
views = generate_novel_views(model, img_source[0], azimuth_source, elevation_source,
                             azimuth_shifts, elevation_shifts)
plot_img_tensor(views.detach(), nrow=2)

This will produce all the views(according to the shifts listed out above). 

  1. Now, convert the above novel views into an animation. The code for this is available here and the output is shown below.

You can check the full demo, here.

Advantages of Equivariant Neural Rendering

  • Model makes no assumptions about scene representations and the rendering process like we can model backgrounds, reflections and visual effects.
  • Very few requirements as in no 3D supervision are required, purely based on posed 2D images.
  • Fast – inference at 45 fps on V100 GPU.

Limitation of Equivariant Neural Rendering

  • High-frequency texture and details can be difficult to capture.
  • Model can fail on unusual objects or objects having thin structures as shown below.
  • Requires a lot of memory so training becomes slow.


The above discussion provides a method to generate 3D scenes by learning scene representations of 2D images. This method is called Equivariant Neural Rendering. The pre-trained models provided, can infer scene representations and render a novel view of the scene from a single image. This model has been tested on various datasets like chairs, cars, mugs, mountains.

Note : All images mentioned in this article are taken from official documents except the output of the code. Links are at the end.

Official code, docs and tutorial are available at:

More Great AIM Stories

Aishwarya Verma
A data science enthusiast and a post-graduate in Big Data Analytics. Creative and organized with an analytical bent of mind.

More Stories


8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>


3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM