MITB Banner

Hands-On Guide To Nvidia Imaginaire: Image & Video translation GAN Library

Share

Recently Nvidia labs launched a PyTorch-based GAN(Generative Adversarial Network) library: Imaginaire, that integrates the implementations of several images and video synthesis methods developed by NVIDIA into one. Let’s discuss all about Nvidia and their research with implementations.

Preface

Nvidia Research labs

NVIDIA logo.svg

Nvidia Research Labs has been a great success for Nvidia over the years, they have done some great tie-ups with the top fortune companies to leverage the power of AI with their extraordinary powerful chipsets, and also their research area is getting bigger over time. They are extensively working on GAN techniques.

GAN(Generative Adversarial Network)

A generative adversarial network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in 2014. Two neural networks contesting with each other in a game (in the form of a zero-sum game, where one agent’s gain is another agent’s loss

some of the fields where Nvidia research is working are as follows: 

Nvidia also host ai podcast with their researcher to share the thoughts and inspiration behind their work you can listen to them here

Imaginaire

Imaginaire is a PyTorch-based Generative Adversarial Network(GAN) library, that integrates all the optimized implementations of multiple images and video synthesis projects developed by Nvidia into one. It is released under Nvidia software license 

Imaginaire Models

Mohit Maithani

Imaginaire added many supervised, unsupervised, image to image & video to video translation models into their library, all the models are pretrained on Nvidia DGX 1machien with 8 32GB V100 using PyTorch docker v20.03. let’s discuss all of them one by one:

1. Image-to-Image translation

Image-to-Image Translation. Image-to-image translation is a class… | by  Yongfu Hao | Towards Data Science
https://towardsdatascience.com/image-to-image-translation-69c10c18f6ff

The image-to-Image translation is a method of vision and graphical problems where the goal of algorithms is to learn the mapping between an input image and an output image, some of the areas of Image-to-Image translation are style transfer, object transfiguration, and photo enhancement.

Nvidia Imaginairy contains 6 algorithms that support image to image translation

  1. pix2pixHD
  2. SPADE
  3. UNIT
  4. MUNIT
  5. FUNIT
  6. COCO-FUNIT

2. Video-to-Video translation

teaser

Video translation is similar to image-to-image translation but here we use video input and try to process images frame by frame. Some of the video-to-video translation models imaginaire library trained on ar as follows:

  1. vid2vid
  2. fs-vid2vid
  3. wc-vid2vid

Installing Imaginaire

Imaginaire is tested on Ubuntu v16.04 operating system and some of the prerequisites which are needed to run this library are Anaconda3, cuda10.2, and cudnn. Installation is pretty simple like the other repositories installation, but before that first install the additional package which we needed to follows the imaginaire library practices:

!pip install flake8
!flake8 --install-hook git
!git config --bool flake8.strict true

Now install Imaginaire from the source

! git clone https://github.com/nvlabs/imaginaire
## changing directory to inside the imaginaire folder
% cd imaginaire
## install using scripts
! bash scripts/install.sh
! bash scripts/test_training.sh

To install Imaginaire in Windows follows the steps here

Let’s see the different model implementation of the imaginaire library one by one:

1. pix2pixHD

teaser

It is a high-resolution Image Synthesis library that supports semantic manipulation with conditional Generative Adversarial Network(GAN’s), it was initially implemented by Ting-Chun Wang, Ming-Yu Liu. Jun-Yan Zhu and Andrew Tao of Nvidia Research team.

According to pix2pixHD official research paper: ”High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs”

  • It was able to synthesize a 2048 × 1024 image from semantic label maps as shown in the above image (upper-left corner in (a))
  • (b) It can change labels in the original label map to create new scenes, like replacing trees with buildings. 
  • (c) This framework also allows a user to edit the appearance of individual objects in the scene, e.g. changing the color of a car or the texture of a road.

pix2apiHD is trained on NVIDIA DGX1 with 8 V100 16GB GPUs, which still takes about 10 hours to train. Don’t worry, You can download the pretrained PyTorch model on the Cityscapes dataset from here

Implementation

pix2pixHD follows a structured dataset folder, before training you need to Download the Cityscapes dataset from https://www.cityscapes-dataset.com/

then Extract images, segmentation masks, and object instance maks. Organize them

based on the following data structure. 

pix2pixhd library data structure

You can check out the previous pix2pixHD repo for extended details here

Training

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--config configs/projects/pix2pixhd/cityscapes/ampO1.yaml

Testing

python scripts/download_test_data.py --model_name pix2pixhd
 python inference.py --single_gpu \
 --config configs/projects/pix2pixhd/cityscapes/ampO1.yaml \
 --output_dir projects/pix2pixhd/output/cityscapes 

The results are stored inside /projects/pix2pixhd/output/cityscapes Folder

pix2pixhd oiutputs

2. SPADE

SPADE library outputs
SPADE outputs

SPADE is a semantic image synthesis library that was previously launched by Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. And now is integrated into the imaginaire library, SPADE model is trained using an NVIDIA DGX1 with 8 V100 32GB GPUs. Training took about 2-3 weeks. Its official Research paper named: “Semantic Image Synthesis with Spatially-Adaptive Normalization” is one of its own kind image enhancement technique and was admired by many researchers to rebuild the output you can have to download, preprocess, and train the dataset. 

Implementation

Let’s see how we can rebuild the results:

Download COCO training images, COCO validation images, Extract images, segmentation masks, and object boundaries for the edge maps. Organize them based on the below data structure.

SPADE library data structures

Build lmdbs

for f in train val; do
python scripts/build_lmdb.py \
--config configs/projects/spade/cocostuff/base128_bs4.yaml \
--data_root dataset/cocostuff_raw/${f} \
--output_root dataset/cocostuff/${f} \
--overwrite \
--paired
done

Train:

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--config configs/projects/spade/cocostuff/base128_bs4.yaml \
--logdir logs/projects/spade/cocostuff/base128_bs4.yaml

Test:

python scripts/download_test_data.py --model_name spade
python -m torch.distributed.launch --nproc_per_node=1 inference.py \
--config configs/projects/spade/cocostuff/base128_bs4.yaml \
--output_dir projects/spade/output/cocostuff

3. UNIT(Unsupervised image-to-image Translation)

UNIT(Unsupervised image-to-image Translation)
UNIT outputs

It is an improved version of the previous implementation of UNIT: https://github.com/mingyuliutw/UNIT This library supports one-to-one mapping between two visual domains. Some of the major difference in this library are:

  • Use spectral normalization in the generator & discriminator.
  • Uses two-time-scale update rule(TTUR)
  • Uses hinge loss instead of using Least square loss.

UNIT is trained using an NVIDIA DGX1 with 8 V100 32GB GPUs. You can try using fewer GPUs for training or you can reduce the batch size, but training stability and image quality will be reduced, or it may be not up to the mark.

Download a small dataset for training

python scripts/download_test_data.py --model_name unit

Arrange the dataset into the following data structure format:

3. UNIT(Unsupervised image-to-image Translation) daata structure nvidia

Translating images 

python -m torch.distributed.launch --nproc_per_node=1 inference.py \
--config configs/projects/unit/winter2summer/base48_bs1.yaml \
--output_dir projects/unit/output/winter2summer
UNIT image transalation

Outputs are saved in projects/unit/output/winter2summer:

4. MUNIT(Multimodal Unsupervised image-to-image Translation)

4. MUNIT(Multimodal Unsupervised image-to-image Translation)

This is an improved implementation of MUNIT and many improvements have been done, some of them are:

  • Use hinge loss.
  • Use spectral normalization in the generator and the discriminator.
  • Use the two-timescale update rule (TTUR) with the discriminator learning rate 0.0004.
  • Use a global residual discriminator
  • doesn’t require pixel-wise correspondence (e.g., animal faces).

Implementation

we use dog and cat images in the animal face datasets (AFHQ). The dataset is available here. Download and extract the data:

–output_dir projects/munit/output/afhq_dog2cat

example_input
example_output

Previous implementation: https://github.com/NVlabs/MUNIT

Official research paper: https://arxiv.org/abs/1804.04732

5. FUNIT(Few-Shot Unsupervised image-to-image Translation)

5. FUNIT(Few-Shot Unsupervised image-to-image Translation)

FUNIT framework aims at mapping an image of a source class to an analogous image of an unseen target class by leveraging a few target class images that are made available at test time”

Implementation

Download the dataset and untar the files.

python scripts/download_dataset.py --dataset animal_faces

Build the lmdbs:

for f in train train_all val; do
python scripts/build_lmdb.py \
--config  configs/projects/funit/animal_faces/base64_bs8_class119.yaml \
--data_root dataset/animal_faces_raw/${f} \
--output_root dataset/animal_faces/${f} \
--overwrite
done

Training

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--config configs/projects/funit/animal_faces/base64_bs8_class119.yaml \
--logdir logs/projects/funit/animal_faces/base64_bs8_class119.yaml

Download sample test data by running

python scripts/download_test_data.py --model_name funit
python inference.py --single_gpu \
--config configs/projects/funit/animal_faces/base64_bs8_class149.yaml \
--output_dir projects/funit/output/animal_faces
FUNIT image-to-image transaltion

Learn more about FUNIT:

6. COCO-FUNIT(Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder)

COCO-FUNIT(Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder)

COCO-FUNIT was published by Kuniaki Saito, Kate Saenko, and Ming-Yu Liu from Boston university, it is used for generating a photorealistic translation of the input content image in the unseen domain.

It is able to compute the style embedding of the images by leveraging a new module called the constant style bias.COCo FUNIT model shows effectiveness in addressing the content loss problem. For code and pretrained models reference, you can also check out the old repository: https://nvlabs.github.io/COCO-FUNIT/

Implementation

Download the dataset and the raw images will be saved in project/coco_funit/data/training folder as this library comes with a copy of the Animal Faces dataset for the quick experiment you can use that one.

Build the lmdbs

for f in train train_all val; do
python -m imaginaire.tools.build_lmdb \
--config  configs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml \
--data_root projects/coco_funit/data/raw/training/animal_faces/${f} \
--output_root projects/coco_funit/data/lmdb/training/animal_faces/${f} \
--overwrite
done

Training

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--config configs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml \
--logdir logs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml

Inference

The output results are stored in /projects/coco_funit/output/animal_faces

#download test dataset
python scripts/download_test_data.py --model_name coco_funit
python inference.py --single_gpu \
--config configs/projects/coco_funit/animal_faces/base64_bs8_class149.yaml \
--output_dir projects/coco_funit/output/animal_faces
coco-funint image transaltion nvidia library imaginaire

Read More:

7. vid2vid(Video-to-Video Synthesis)

teaser

vid2vid a video translation library used for tuning semantic label maps into realistic videos, synthesizing objects, or you can generate human motions from poses.it is trained using an NVIDIA DGX1 with 8 V100 32GB GPUs.

Train vid2vid on the Cityscapes dataset

First, download and rearrange the dataset into the following data structure format:

cityscapes

└───images

    └───seq0001

        └───00001.png

        └───00002.png

└───seg_maps

    └───seq0001

        └───00001.png

        └───00002.png

Preprocess the data into LMDB format

python scripts/build_lmdb.py --config configs/projects/vid2vid/cityscapes/ampO1.yaml --data_root [PATH_TO_DATA] --output_root datasets/cityscapes/lmdb/[train | val] --paired

Train

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--config configs/projects/vid2vid/cityscapes/ampO1.yaml

Inference

python ./scripts/download_test_data.py --model_name vid2vid

Or

python inference.py --single_gpu \
--config configs/projects/vid2vid/cityscapes/ampO1.yaml \
--output_dir projects/vid2vid/output/cityscapes
output

Wrapping Up

Indeed imaginaire is a multi-purpose library with lots of functionality from image processing to video translation and generative style transfer, we have seen introduction and results for all the different models(supervised image-to-image translation, video-to-to translation), there are two more video translation models we didn’t discuss in this article that are:

  • fs-vid2vid(a subject-agnostic mapping that converts a semantic video and an example image to a photorealistic video.)
  • wc-vid2vid(Improve vid2vid on long-term consistency.)

Learn more about imaginaire library here

Share
Picture of Mohit Maithani

Mohit Maithani

Mohit is a Data & Technology Enthusiast with good exposure to solving real-world problems in various avenues of IT and Deep learning domain. He believes in solving human's daily problems with the help of technology.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.