Andreas Geiger and Michael Niemeyer from Max Planck Institute for Intelligent Systems and the University of Tubingen have won the best paper award at CVPR 2021 (Conference on Computer Vision and Pattern Recognition).
Their paper titled — ‘GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields’, explored generating new images and controlling what will appear, the objects and their positions and orientations, the background, etc. Using a modified GAN architecture, they can even move objects in the image without affecting the background or the other objects.
GAN was developed by Ian J Goodfellow in 2014. The system uses a generator and a discriminator. The Generator creates fake data samples (images in this case) to deceive the Discriminator. On the other hand, the task of the discriminator is to identify which one is real and fake. The Generator and the Discriminator are strong neural networks, and throughout the training phase, they compete with each other. Finally, the steps are repeated until the generator and the discriminator can maximise the results of the tasks.
Present GAN-based systems can’t separate the objects from one another, and they’re all affected when modifications are introduced in a specific object. Moreover, traditional GANs stay in the 2-D world. GIRAFFE solves this problem and that too in the 3-D scene representation.
“We present GIRAFFE, a novel method for controllable image synthesis. Our key idea is to incorporate a compositional 3D scene representation into the generative model. By representing scenes as compositional generative neural feature fields, we disentangle individual objects from the background as well as their shape and appearance without explicit supervision. Combining this with a neural renderer yields fast and controllable image synthesis,” as per the paper.
The four main components of the method include:
- First, researchers model individual objects as neural feature fields.
- Exploit the additive property of feature fields to composite scenes from multiple individual objects.
- For rendering – researchers explored an efficient combination of volume and neural rendering techniques.
- Finally, discussions on how researchers train their model from raw image collections.
- For datasets: Researchers report results on commonly-used single object datasets Chairs, Cats, CelebA and CelebA-HQ. The first consists of synthetic renderings of Photo shape chairs, and the others are image collections of cat and human faces, respectively. The data complexity is limited as the background is purely white or only takes up a small part of the image. They further report results on the more challenging single-object, real-world datasets CompCars, LSUN Churches, and FFHQ.
- Baselines: Compared against voxel-based PlatonicGAN, BlockGAN, and HoloGAN, and radiance field-based GRAF. We further compare against HoloGAN w/o 3D Conv, a variant proposed for higher resolutions. We additionally report a ResNet-based 2D GAN for reference.
- Metrics: Reported the Frechet Inception Distance (FID) score to quantify image quality. Used 20,000 real and fake samples to calculate the FID score.
Limitations & scope
Dataset Bias: Their method struggles to disentangle factors of variation if there is an inherent bias in the data. In the celebA-HQ dataset, the eye and hair orientation is predominantly pointing towards the camera, regardless of the face rotation. When rotating the object, the eyes and hair in generated images do not stay fixed but are adjusted to meet the dataset bias.
Object Transformation Distributions: The study shows disentanglement failures, e.g. for Churches where the background contains a church or for CompCars where the foreground contains background elements. Researchers attribute these to mismatches between the assumed uniform distributions over camera poses and object-level transformations and their real distribution.
“In the future, we plan to investigate how the distributions over object-level transformations and camera poses can be learned from data. Further, incorporating supervision which is easy to obtain, e.g. predicted object masks, is a promising approach to scale to more complex, multi-object scenes,” concluded researchers. Find the full research paper here.