Last updated January 13, 2021
In AI Mysteries

Hands-On Guide To Nvidia Imaginaire: Image & Video translation GAN Library

Share

Published on January 6, 2021

by Mohit Maithani

Recently Nvidia labs launched a PyTorch-based GAN(Generative Adversarial Network) library: Imaginaire, that integrates the implementations of several images and video synthesis methods developed by NVIDIA into one. Let’s discuss all about Nvidia and their research with implementations.

Preface
- Nvidia Research labs
- GAN(Generative Adversarial Network)
Imaginaire
Imaginaire Models
- 1. Image-to-Image translation
- 2. Video-to-Video translation
Installing Imaginaire
1. pix2pixHD
- Implementation
  - Training
2. SPADE
3. UNIT(Unsupervised image-to-image Translation)
4. MUNIT(Multimodal Unsupervised image-to-image Translation)
- Implementation
5. FUNIT(Few-Shot Unsupervised image-to-image Translation)
- Implementation
  - Training
  - Download sample test data by running
6. COCO-FUNIT(Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder)
- Implementation
7. vid2vid(Video-to-Video Synthesis)
- Train vid2vid on the Cityscapes dataset
- Wrapping Up

Preface

Nvidia Research labs

Nvidia Research Labs has been a great success for Nvidia over the years, they have done some great tie-ups with the top fortune companies to leverage the power of AI with their extraordinary powerful chipsets, and also their research area is getting bigger over time. They are extensively working on GAN techniques.

GAN(Generative Adversarial Network)

“A generative adversarial network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in 2014. Two neural networks contesting with each other in a game (in the form of a zero-sum game, where one agent’s gain is another agent’s loss^”

some of the fields where Nvidia research is working are as follows:

Nvidia also host ai podcast with their researcher to share the thoughts and inspiration behind their work you can listen to them here

Imaginaire

Imaginaire is a PyTorch-based Generative Adversarial Network(GAN) library, that integrates all the optimized implementations of multiple images and video synthesis projects developed by Nvidia into one. It is released under Nvidia software license

Imaginaire Models

Imaginaire added many supervised, unsupervised, image to image & video to video translation models into their library, all the models are pretrained on Nvidia DGX 1machien with 8 32GB V100 using PyTorch docker v20.03. let’s discuss all of them one by one:

1. Image-to-Image translation

Image-to-Image Translation. Image-to-image translation is a class… | by Yongfu Hao | Towards Data Science — https://towardsdatascience.com/image-to-image-translation-69c10c18f6ff

The image-to-Image translation is a method of vision and graphical problems where the goal of algorithms is to learn the mapping between an input image and an output image, some of the areas of Image-to-Image translation are style transfer, object transfiguration, and photo enhancement.

Nvidia Imaginairy contains 6 algorithms that support image to image translation

pix2pixHD
SPADE
UNIT
MUNIT
FUNIT
COCO-FUNIT

2. Video-to-Video translation

Video translation is similar to image-to-image translation but here we use video input and try to process images frame by frame. Some of the video-to-video translation models imaginaire library trained on ar as follows:

vid2vid
fs-vid2vid
wc-vid2vid

Installing Imaginaire

Imaginaire is tested on Ubuntu v16.04 operating system and some of the prerequisites which are needed to run this library are Anaconda3, cuda10.2, and cudnn. Installation is pretty simple like the other repositories installation, but before that first install the additional package which we needed to follows the imaginaire library practices:

!pip install flake8
!flake8 --install-hook git
!git config --bool flake8.strict true

Now install Imaginaire from the source

! git clone https://github.com/nvlabs/imaginaire
## changing directory to inside the imaginaire folder
% cd imaginaire
## install using scripts
! bash scripts/install.sh
! bash scripts/test_training.sh

To install Imaginaire in Windows follows the steps here

Let’s see the different model implementation of the imaginaire library one by one:

1. pix2pixHD

It is a high-resolution Image Synthesis library that supports semantic manipulation with conditional Generative Adversarial Network(GAN’s), it was initially implemented by Ting-Chun Wang, Ming-Yu Liu. Jun-Yan Zhu and Andrew Tao of Nvidia Research team.

According to pix2pixHD official research paper: ”High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs”

It was able to synthesize a 2048 × 1024 image from semantic label maps as shown in the above image (upper-left corner in (a))
(b) It can change labels in the original label map to create new scenes, like replacing trees with buildings.
(c) This framework also allows a user to edit the appearance of individual objects in the scene, e.g. changing the color of a car or the texture of a road.

pix2apiHD is trained on NVIDIA DGX1 with 8 V100 16GB GPUs, which still takes about 10 hours to train. Don’t worry, You can download the pretrained PyTorch model on the Cityscapes dataset from here

Implementation

pix2pixHD follows a structured dataset folder, before training you need to Download the Cityscapes dataset from https://www.cityscapes-dataset.com/

then Extract images, segmentation masks, and object instance maks. Organize them

based on the following data structure.

You can check out the previous pix2pixHD repo for extended details here

Training

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--config configs/projects/pix2pixhd/cityscapes/ampO1.yaml

Testing

python scripts/download_test_data.py --model_name pix2pixhd
 python inference.py --single_gpu \
 --config configs/projects/pix2pixhd/cityscapes/ampO1.yaml \
 --output_dir projects/pix2pixhd/output/cityscapes

The results are stored inside /projects/pix2pixhd/output/cityscapes Folder

2. SPADE

SPADE is a semantic image synthesis library that was previously launched by Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. And now is integrated into the imaginaire library, SPADE model is trained using an NVIDIA DGX1 with 8 V100 32GB GPUs. Training took about 2-3 weeks. Its official Research paper named: “Semantic Image Synthesis with Spatially-Adaptive Normalization” is one of its own kind image enhancement technique and was admired by many researchers to rebuild the output you can have to download, preprocess, and train the dataset.

Implementation

Let’s see how we can rebuild the results:

Download COCO training images, COCO validation images, Extract images, segmentation masks, and object boundaries for the edge maps. Organize them based on the below data structure.

Build lmdbs

for f in train val; do
python scripts/build_lmdb.py \
--config configs/projects/spade/cocostuff/base128_bs4.yaml \
--data_root dataset/cocostuff_raw/${f} \
--output_root dataset/cocostuff/${f} \
--overwrite \
--paired
done

Train:

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--config configs/projects/spade/cocostuff/base128_bs4.yaml \
--logdir logs/projects/spade/cocostuff/base128_bs4.yaml

Test:

python scripts/download_test_data.py --model_name spade
python -m torch.distributed.launch --nproc_per_node=1 inference.py \
--config configs/projects/spade/cocostuff/base128_bs4.yaml \
--output_dir projects/spade/output/cocostuff

For realtime output, you can visit Nvidia playground where you can experience the outputs virtual reality: https://www.nvidia.com/en-us/research/ai-playground/
The previous implementation of SPADE is available here: https://github.com/NVlabs/SPADE
SPADE Research paper: https://arxiv.org/abs/1903.07291

3. UNIT(Unsupervised image-to-image Translation)

It is an improved version of the previous implementation of UNIT: https://github.com/mingyuliutw/UNIT This library supports one-to-one mapping between two visual domains. Some of the major difference in this library are:

Use spectral normalization in the generator & discriminator.
Uses two-time-scale update rule(TTUR)
Uses hinge loss instead of using Least square loss.

UNIT is trained using an NVIDIA DGX1 with 8 V100 32GB GPUs. You can try using fewer GPUs for training or you can reduce the batch size, but training stability and image quality will be reduced, or it may be not up to the mark.

Download a small dataset for training

python scripts/download_test_data.py --model_name unit

Arrange the dataset into the following data structure format:

3. UNIT(Unsupervised image-to-image Translation) daata structure nvidia

Translating images

python -m torch.distributed.launch --nproc_per_node=1 inference.py \
--config configs/projects/unit/winter2summer/base48_bs1.yaml \
--output_dir projects/unit/output/winter2summer

Outputs are saved in projects/unit/output/winter2summer:

4. MUNIT(Multimodal Unsupervised image-to-image Translation)

This is an improved implementation of MUNIT and many improvements have been done, some of them are:

Use hinge loss.
Use spectral normalization in the generator and the discriminator.
Use the two-timescale update rule (TTUR) with the discriminator learning rate 0.0004.
Use a global residual discriminator
doesn’t require pixel-wise correspondence (e.g., animal faces).

Implementation

we use dog and cat images in the animal face datasets (AFHQ). The dataset is available here. Download and extract the data:

–output_dir projects/munit/output/afhq_dog2cat

Previous implementation: https://github.com/NVlabs/MUNIT

Official research paper: https://arxiv.org/abs/1804.04732

5. FUNIT(Few-Shot Unsupervised image-to-image Translation)

FUNIT framework aims at mapping an image of a source class to an analogous image of an unseen target class by leveraging a few target class images that are made available at test time”

Implementation

Download the dataset and untar the files.

python scripts/download_dataset.py --dataset animal_faces

Build the lmdbs:

for f in train train_all val; do
python scripts/build_lmdb.py \
--config  configs/projects/funit/animal_faces/base64_bs8_class119.yaml \
--data_root dataset/animal_faces_raw/${f} \
--output_root dataset/animal_faces/${f} \
--overwrite
done

Training

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--config configs/projects/funit/animal_faces/base64_bs8_class119.yaml \
--logdir logs/projects/funit/animal_faces/base64_bs8_class119.yaml

Download sample test data by running

python scripts/download_test_data.py --model_name funit
python inference.py --single_gpu \
--config configs/projects/funit/animal_faces/base64_bs8_class149.yaml \
--output_dir projects/funit/output/animal_faces

Learn more about FUNIT:

6. COCO-FUNIT(Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder)

COCO-FUNIT was published by Kuniaki Saito, Kate Saenko, and Ming-Yu Liu from Boston university, it is used for generating a photorealistic translation of the input content image in the unseen domain.

It is able to compute the style embedding of the images by leveraging a new module called the constant style bias.COCo FUNIT model shows effectiveness in addressing the content loss problem. For code and pretrained models reference, you can also check out the old repository: https://nvlabs.github.io/COCO-FUNIT/

Implementation

Download the dataset and the raw images will be saved in project/coco_funit/data/training folder as this library comes with a copy of the Animal Faces dataset for the quick experiment you can use that one.

Build the lmdbs

for f in train train_all val; do
python -m imaginaire.tools.build_lmdb \
--config  configs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml \
--data_root projects/coco_funit/data/raw/training/animal_faces/${f} \
--output_root projects/coco_funit/data/lmdb/training/animal_faces/${f} \
--overwrite
done

Training

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--config configs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml \
--logdir logs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml

Inference

The output results are stored in /projects/coco_funit/output/animal_faces

#download test dataset
python scripts/download_test_data.py --model_name coco_funit
python inference.py --single_gpu \
--config configs/projects/coco_funit/animal_faces/base64_bs8_class149.yaml \
--output_dir projects/coco_funit/output/animal_faces

coco-funint image transaltion nvidia library imaginaire

vid2vid a video translation library used for tuning semantic label maps into realistic videos, synthesizing objects, or you can generate human motions from poses.it is trained using an NVIDIA DGX1 with 8 V100 32GB GPUs.

Train vid2vid on the Cityscapes dataset

First, download and rearrange the dataset into the following data structure format:

cityscapes

└───images

└───seq0001

└───00001.png

└───00002.png

└───seg_maps

└───seq0001

└───00001.png

└───00002.png

Preprocess the data into LMDB format

python scripts/build_lmdb.py --config configs/projects/vid2vid/cityscapes/ampO1.yaml --data_root [PATH_TO_DATA] --output_root datasets/cityscapes/lmdb/[train | val] --paired

Train

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--config configs/projects/vid2vid/cityscapes/ampO1.yaml

Inference

python ./scripts/download_test_data.py --model_name vid2vid

python inference.py --single_gpu \
--config configs/projects/vid2vid/cityscapes/ampO1.yaml \
--output_dir projects/vid2vid/output/cityscapes

Wrapping Up

Indeed imaginaire is a multi-purpose library with lots of functionality from image processing to video translation and generative style transfer, we have seen introduction and results for all the different models(supervised image-to-image translation, video-to-to translation), there are two more video translation models we didn’t discuss in this article that are:

fs-vid2vid(a subject-agnostic mapping that converts a semantic video and an example image to a photorealistic video.)
wc-vid2vid(Improve vid2vid on long-term consistency.)

Learn more about imaginaire library here

Access all our open Survey & Awards Nomination forms in one place