Now Reading
Hands-On Guide To Nvidia Imaginaire: Image & Video translation GAN Library

Hands-On Guide To Nvidia Imaginaire: Image & Video translation GAN Library

Mohit Maithani
  • pix2pixHD
  • UNIT
  • vid2vid

Recently Nvidia labs launched a PyTorch-based GAN(Generative Adversarial Network) library: Imaginaire, that integrates the implementations of several images and video synthesis methods developed by NVIDIA into one. Let’s discuss all about Nvidia and their research with implementations.


Nvidia Research labs

NVIDIA logo.svg

Nvidia Research Labs has been a great success for Nvidia over the years, they have done some great tie-ups with the top fortune companies to leverage the power of AI with their extraordinary powerful chipsets, and also their research area is getting bigger over time. They are extensively working on GAN techniques.

GAN(Generative Adversarial Network)

A generative adversarial network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in 2014. Two neural networks contesting with each other in a game (in the form of a zero-sum game, where one agent’s gain is another agent’s loss

some of the fields where Nvidia research is working are as follows: 

Nvidia also host ai podcast with their researcher to share the thoughts and inspiration behind their work you can listen to them here


Imaginaire is a PyTorch-based Generative Adversarial Network(GAN) library, that integrates all the optimized implementations of multiple images and video synthesis projects developed by Nvidia into one. It is released under Nvidia software license 

Imaginaire Models

Mohit Maithani

Imaginaire added many supervised, unsupervised, image to image & video to video translation models into their library, all the models are pretrained on Nvidia DGX 1machien with 8 32GB V100 using PyTorch docker v20.03. let’s discuss all of them one by one:

1. Image-to-Image translation

Image-to-Image Translation. Image-to-image translation is a class… | by  Yongfu Hao | Towards Data Science

The image-to-Image translation is a method of vision and graphical problems where the goal of algorithms is to learn the mapping between an input image and an output image, some of the areas of Image-to-Image translation are style transfer, object transfiguration, and photo enhancement.

Nvidia Imaginairy contains 6 algorithms that support image to image translation

  1. pix2pixHD
  2. SPADE
  3. UNIT
  4. MUNIT
  5. FUNIT

2. Video-to-Video translation


Video translation is similar to image-to-image translation but here we use video input and try to process images frame by frame. Some of the video-to-video translation models imaginaire library trained on ar as follows:

  1. vid2vid
  2. fs-vid2vid
  3. wc-vid2vid

Installing Imaginaire

Imaginaire is tested on Ubuntu v16.04 operating system and some of the prerequisites which are needed to run this library are Anaconda3, cuda10.2, and cudnn. Installation is pretty simple like the other repositories installation, but before that first install the additional package which we needed to follows the imaginaire library practices:

!pip install flake8
!flake8 --install-hook git
!git config --bool flake8.strict true

Now install Imaginaire from the source

! git clone
## changing directory to inside the imaginaire folder
% cd imaginaire
## install using scripts
! bash scripts/
! bash scripts/

To install Imaginaire in Windows follows the steps here

Let’s see the different model implementation of the imaginaire library one by one:

1. pix2pixHD


It is a high-resolution Image Synthesis library that supports semantic manipulation with conditional Generative Adversarial Network(GAN’s), it was initially implemented by Ting-Chun Wang, Ming-Yu Liu. Jun-Yan Zhu and Andrew Tao of Nvidia Research team.

According to pix2pixHD official research paper: ”High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs”

  • It was able to synthesize a 2048 × 1024 image from semantic label maps as shown in the above image (upper-left corner in (a))
  • (b) It can change labels in the original label map to create new scenes, like replacing trees with buildings. 
  • (c) This framework also allows a user to edit the appearance of individual objects in the scene, e.g. changing the color of a car or the texture of a road.

pix2apiHD is trained on NVIDIA DGX1 with 8 V100 16GB GPUs, which still takes about 10 hours to train. Don’t worry, You can download the pretrained PyTorch model on the Cityscapes dataset from here


pix2pixHD follows a structured dataset folder, before training you need to Download the Cityscapes dataset from

then Extract images, segmentation masks, and object instance maks. Organize them

based on the following data structure. 

pix2pixhd library data structure

You can check out the previous pix2pixHD repo for extended details here


python -m torch.distributed.launch --nproc_per_node=8 \
--config configs/projects/pix2pixhd/cityscapes/ampO1.yaml


python scripts/ --model_name pix2pixhd
 python --single_gpu \
 --config configs/projects/pix2pixhd/cityscapes/ampO1.yaml \
 --output_dir projects/pix2pixhd/output/cityscapes 

The results are stored inside /projects/pix2pixhd/output/cityscapes Folder

pix2pixhd oiutputs


SPADE library outputs
SPADE outputs

SPADE is a semantic image synthesis library that was previously launched by Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. And now is integrated into the imaginaire library, SPADE model is trained using an NVIDIA DGX1 with 8 V100 32GB GPUs. Training took about 2-3 weeks. Its official Research paper named: “Semantic Image Synthesis with Spatially-Adaptive Normalization” is one of its own kind image enhancement technique and was admired by many researchers to rebuild the output you can have to download, preprocess, and train the dataset. 


Let’s see how we can rebuild the results:

Download COCO training images, COCO validation images, Extract images, segmentation masks, and object boundaries for the edge maps. Organize them based on the below data structure.

SPADE library data structures

Build lmdbs

for f in train val; do
python scripts/ \
--config configs/projects/spade/cocostuff/base128_bs4.yaml \
--data_root dataset/cocostuff_raw/${f} \
--output_root dataset/cocostuff/${f} \
--overwrite \


python -m torch.distributed.launch --nproc_per_node=8 \
--config configs/projects/spade/cocostuff/base128_bs4.yaml \
--logdir logs/projects/spade/cocostuff/base128_bs4.yaml


python scripts/ --model_name spade
python -m torch.distributed.launch --nproc_per_node=1 \
--config configs/projects/spade/cocostuff/base128_bs4.yaml \
--output_dir projects/spade/output/cocostuff

3. UNIT(Unsupervised image-to-image Translation)

UNIT(Unsupervised image-to-image Translation)
UNIT outputs

It is an improved version of the previous implementation of UNIT: This library supports one-to-one mapping between two visual domains. Some of the major difference in this library are:

  • Use spectral normalization in the generator & discriminator.
  • Uses two-time-scale update rule(TTUR)
  • Uses hinge loss instead of using Least square loss.

UNIT is trained using an NVIDIA DGX1 with 8 V100 32GB GPUs. You can try using fewer GPUs for training or you can reduce the batch size, but training stability and image quality will be reduced, or it may be not up to the mark.

Download a small dataset for training

python scripts/ --model_name unit

Arrange the dataset into the following data structure format:

3. UNIT(Unsupervised image-to-image Translation) daata structure nvidia

Translating images 

python -m torch.distributed.launch --nproc_per_node=1 \
--config configs/projects/unit/winter2summer/base48_bs1.yaml \
--output_dir projects/unit/output/winter2summer
UNIT image transalation

Outputs are saved in projects/unit/output/winter2summer:

4. MUNIT(Multimodal Unsupervised image-to-image Translation)

4. MUNIT(Multimodal Unsupervised image-to-image Translation)

This is an improved implementation of MUNIT and many improvements have been done, some of them are:

  • Use hinge loss.
  • Use spectral normalization in the generator and the discriminator.
  • Use the two-timescale update rule (TTUR) with the discriminator learning rate 0.0004.
  • Use a global residual discriminator
  • doesn’t require pixel-wise correspondence (e.g., animal faces).


we use dog and cat images in the animal face datasets (AFHQ). The dataset is available here. Download and extract the data:

–output_dir projects/munit/output/afhq_dog2cat


Previous implementation:

Official research paper:

5. FUNIT(Few-Shot Unsupervised image-to-image Translation)

5. FUNIT(Few-Shot Unsupervised image-to-image Translation)

FUNIT framework aims at mapping an image of a source class to an analogous image of an unseen target class by leveraging a few target class images that are made available at test time”


Download the dataset and untar the files.

python scripts/ --dataset animal_faces

Build the lmdbs:

for f in train train_all val; do
python scripts/ \
--config  configs/projects/funit/animal_faces/base64_bs8_class119.yaml \
--data_root dataset/animal_faces_raw/${f} \
--output_root dataset/animal_faces/${f} \


python -m torch.distributed.launch --nproc_per_node=8 \
--config configs/projects/funit/animal_faces/base64_bs8_class119.yaml \
--logdir logs/projects/funit/animal_faces/base64_bs8_class119.yaml

Download sample test data by running

python scripts/ --model_name funit
python --single_gpu \
--config configs/projects/funit/animal_faces/base64_bs8_class149.yaml \
--output_dir projects/funit/output/animal_faces
FUNIT image-to-image transaltion

Learn more about FUNIT:

6. COCO-FUNIT(Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder)

COCO-FUNIT(Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder)

COCO-FUNIT was published by Kuniaki Saito, Kate Saenko, and Ming-Yu Liu from Boston university, it is used for generating a photorealistic translation of the input content image in the unseen domain.

It is able to compute the style embedding of the images by leveraging a new module called the constant style bias.COCo FUNIT model shows effectiveness in addressing the content loss problem. For code and pretrained models reference, you can also check out the old repository:

See Also
Dtale Tutorial


Download the dataset and the raw images will be saved in project/coco_funit/data/training folder as this library comes with a copy of the Animal Faces dataset for the quick experiment you can use that one.

Build the lmdbs

for f in train train_all val; do
python -m \
--config  configs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml \
--data_root projects/coco_funit/data/raw/training/animal_faces/${f} \
--output_root projects/coco_funit/data/lmdb/training/animal_faces/${f} \


python -m torch.distributed.launch --nproc_per_node=8 \
--config configs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml \
--logdir logs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml


The output results are stored in /projects/coco_funit/output/animal_faces

#download test dataset
python scripts/ --model_name coco_funit
python --single_gpu \
--config configs/projects/coco_funit/animal_faces/base64_bs8_class149.yaml \
--output_dir projects/coco_funit/output/animal_faces
coco-funint image transaltion nvidia library imaginaire

Read More:

7. vid2vid(Video-to-Video Synthesis)


vid2vid a video translation library used for tuning semantic label maps into realistic videos, synthesizing objects, or you can generate human motions from is trained using an NVIDIA DGX1 with 8 V100 32GB GPUs.

Train vid2vid on the Cityscapes dataset

First, download and rearrange the dataset into the following data structure format:










Preprocess the data into LMDB format

python scripts/ --config configs/projects/vid2vid/cityscapes/ampO1.yaml --data_root [PATH_TO_DATA] --output_root datasets/cityscapes/lmdb/[train | val] --paired


python -m torch.distributed.launch --nproc_per_node=8 \
--config configs/projects/vid2vid/cityscapes/ampO1.yaml


python ./scripts/ --model_name vid2vid


python --single_gpu \
--config configs/projects/vid2vid/cityscapes/ampO1.yaml \
--output_dir projects/vid2vid/output/cityscapes

Wrapping Up

Indeed imaginaire is a multi-purpose library with lots of functionality from image processing to video translation and generative style transfer, we have seen introduction and results for all the different models(supervised image-to-image translation, video-to-to translation), there are two more video translation models we didn’t discuss in this article that are:

  • fs-vid2vid(a subject-agnostic mapping that converts a semantic video and an example image to a photorealistic video.)
  • wc-vid2vid(Improve vid2vid on long-term consistency.)

Learn more about imaginaire library here

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
You can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top