Recently Nvidia labs launched a PyTorch-based GAN(Generative Adversarial Network) library: Imaginaire, that integrates the implementations of several images and video synthesis methods developed by NVIDIA into one. Let’s discuss all about Nvidia and their research with implementations.
Table of contents
- Preface
- Imaginaire
- Imaginaire Models
- Installing Imaginaire
- 1. pix2pixHD
- 2. SPADE
- 3. UNIT(Unsupervised image-to-image Translation)
- 4. MUNIT(Multimodal Unsupervised image-to-image Translation)
- 5. FUNIT(Few-Shot Unsupervised image-to-image Translation)
- 6. COCO-FUNIT(Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder)
- 7. vid2vid(Video-to-Video Synthesis)
Preface
Nvidia Research labs
Nvidia Research Labs has been a great success for Nvidia over the years, they have done some great tie-ups with the top fortune companies to leverage the power of AI with their extraordinary powerful chipsets, and also their research area is getting bigger over time. They are extensively working on GAN techniques.
GAN(Generative Adversarial Network)
“A generative adversarial network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in 2014. Two neural networks contesting with each other in a game (in the form of a zero-sum game, where one agent’s gain is another agent’s loss”
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
some of the fields where Nvidia research is working are as follows:
- COVID-19
- 3D DEEP LEARNING
- ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
- COMPUTATIONAL PHOTOGRAPHY AND IMAGING
- COMPUTER VISION
- HUMAN-COMPUTER INTERACTION
- Medical
- REAL-TIME RENDERING
- ROBOTICS
- VIRTUAL AND AUGMENTED REALITY
Nvidia also host ai podcast with their researcher to share the thoughts and inspiration behind their work you can listen to them here
Imaginaire
Imaginaire is a PyTorch-based Generative Adversarial Network(GAN) library, that integrates all the optimized implementations of multiple images and video synthesis projects developed by Nvidia into one. It is released under Nvidia software license
Imaginaire Models
Imaginaire added many supervised, unsupervised, image to image & video to video translation models into their library, all the models are pretrained on Nvidia DGX 1machien with 8 32GB V100 using PyTorch docker v20.03. let’s discuss all of them one by one:
1. Image-to-Image translation
The image-to-Image translation is a method of vision and graphical problems where the goal of algorithms is to learn the mapping between an input image and an output image, some of the areas of Image-to-Image translation are style transfer, object transfiguration, and photo enhancement.
Nvidia Imaginairy contains 6 algorithms that support image to image translation
- pix2pixHD
- SPADE
- UNIT
- MUNIT
- FUNIT
- COCO-FUNIT
2. Video-to-Video translation
Video translation is similar to image-to-image translation but here we use video input and try to process images frame by frame. Some of the video-to-video translation models imaginaire library trained on ar as follows:
- vid2vid
- fs-vid2vid
- wc-vid2vid
Installing Imaginaire
Imaginaire is tested on Ubuntu v16.04 operating system and some of the prerequisites which are needed to run this library are Anaconda3, cuda10.2, and cudnn. Installation is pretty simple like the other repositories installation, but before that first install the additional package which we needed to follows the imaginaire library practices:
!pip install flake8 !flake8 --install-hook git !git config --bool flake8.strict true
Now install Imaginaire from the source
! git clone https://github.com/nvlabs/imaginaire ## changing directory to inside the imaginaire folder % cd imaginaire ## install using scripts ! bash scripts/install.sh ! bash scripts/test_training.sh
To install Imaginaire in Windows follows the steps here
Let’s see the different model implementation of the imaginaire library one by one:
1. pix2pixHD
It is a high-resolution Image Synthesis library that supports semantic manipulation with conditional Generative Adversarial Network(GAN’s), it was initially implemented by Ting-Chun Wang, Ming-Yu Liu. Jun-Yan Zhu and Andrew Tao of Nvidia Research team.
According to pix2pixHD official research paper: ”High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs”
- It was able to synthesize a 2048 × 1024 image from semantic label maps as shown in the above image (upper-left corner in (a))
- (b) It can change labels in the original label map to create new scenes, like replacing trees with buildings.
- (c) This framework also allows a user to edit the appearance of individual objects in the scene, e.g. changing the color of a car or the texture of a road.
pix2apiHD is trained on NVIDIA DGX1 with 8 V100 16GB GPUs, which still takes about 10 hours to train. Don’t worry, You can download the pretrained PyTorch model on the Cityscapes dataset from here
Implementation
pix2pixHD follows a structured dataset folder, before training you need to Download the Cityscapes dataset from https://www.cityscapes-dataset.com/
then Extract images, segmentation masks, and object instance maks. Organize them
based on the following data structure.
You can check out the previous pix2pixHD repo for extended details here
Training
python -m torch.distributed.launch --nproc_per_node=8 train.py \ --config configs/projects/pix2pixhd/cityscapes/ampO1.yaml
Testing
python scripts/download_test_data.py --model_name pix2pixhd python inference.py --single_gpu \ --config configs/projects/pix2pixhd/cityscapes/ampO1.yaml \ --output_dir projects/pix2pixhd/output/cityscapes
The results are stored inside /projects/pix2pixhd/output/cityscapes Folder
2. SPADE
SPADE is a semantic image synthesis library that was previously launched by Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. And now is integrated into the imaginaire library, SPADE model is trained using an NVIDIA DGX1 with 8 V100 32GB GPUs. Training took about 2-3 weeks. Its official Research paper named: “Semantic Image Synthesis with Spatially-Adaptive Normalization” is one of its own kind image enhancement technique and was admired by many researchers to rebuild the output you can have to download, preprocess, and train the dataset.
Implementation
Let’s see how we can rebuild the results:
Download COCO training images, COCO validation images, Extract images, segmentation masks, and object boundaries for the edge maps. Organize them based on the below data structure.
Build lmdbs
for f in train val; do python scripts/build_lmdb.py \ --config configs/projects/spade/cocostuff/base128_bs4.yaml \ --data_root dataset/cocostuff_raw/${f} \ --output_root dataset/cocostuff/${f} \ --overwrite \ --paired done
Train:
python -m torch.distributed.launch --nproc_per_node=8 train.py \ --config configs/projects/spade/cocostuff/base128_bs4.yaml \ --logdir logs/projects/spade/cocostuff/base128_bs4.yaml
Test:
python scripts/download_test_data.py --model_name spade python -m torch.distributed.launch --nproc_per_node=1 inference.py \ --config configs/projects/spade/cocostuff/base128_bs4.yaml \ --output_dir projects/spade/output/cocostuff
- For realtime output, you can visit Nvidia playground where you can experience the outputs virtual reality: https://www.nvidia.com/en-us/research/ai-playground/
- The previous implementation of SPADE is available here: https://github.com/NVlabs/SPADE
- SPADE Research paper: https://arxiv.org/abs/1903.07291
3. UNIT(Unsupervised image-to-image Translation)
It is an improved version of the previous implementation of UNIT: https://github.com/mingyuliutw/UNIT This library supports one-to-one mapping between two visual domains. Some of the major difference in this library are:
- Use spectral normalization in the generator & discriminator.
- Uses two-time-scale update rule(TTUR)
- Uses hinge loss instead of using Least square loss.
UNIT is trained using an NVIDIA DGX1 with 8 V100 32GB GPUs. You can try using fewer GPUs for training or you can reduce the batch size, but training stability and image quality will be reduced, or it may be not up to the mark.
Download a small dataset for training
python scripts/download_test_data.py --model_name unit
Arrange the dataset into the following data structure format:
Translating images
python -m torch.distributed.launch --nproc_per_node=1 inference.py \ --config configs/projects/unit/winter2summer/base48_bs1.yaml \ --output_dir projects/unit/output/winter2summer
Outputs are saved in projects/unit/output/winter2summer:
4. MUNIT(Multimodal Unsupervised image-to-image Translation)
This is an improved implementation of MUNIT and many improvements have been done, some of them are:
- Use hinge loss.
- Use spectral normalization in the generator and the discriminator.
- Use the two-timescale update rule (TTUR) with the discriminator learning rate 0.0004.
- Use a global residual discriminator
- doesn’t require pixel-wise correspondence (e.g., animal faces).
Implementation
we use dog and cat images in the animal face datasets (AFHQ). The dataset is available here. Download and extract the data:
–output_dir projects/munit/output/afhq_dog2cat
Previous implementation: https://github.com/NVlabs/MUNIT
Official research paper: https://arxiv.org/abs/1804.04732
5. FUNIT(Few-Shot Unsupervised image-to-image Translation)
FUNIT framework aims at mapping an image of a source class to an analogous image of an unseen target class by leveraging a few target class images that are made available at test time”
Implementation
Download the dataset and untar the files.
python scripts/download_dataset.py --dataset animal_faces
Build the lmdbs:
for f in train train_all val; do python scripts/build_lmdb.py \ --config configs/projects/funit/animal_faces/base64_bs8_class119.yaml \ --data_root dataset/animal_faces_raw/${f} \ --output_root dataset/animal_faces/${f} \ --overwrite done
Training
python -m torch.distributed.launch --nproc_per_node=8 train.py \ --config configs/projects/funit/animal_faces/base64_bs8_class119.yaml \ --logdir logs/projects/funit/animal_faces/base64_bs8_class119.yaml
Download sample test data by running
python scripts/download_test_data.py --model_name funit python inference.py --single_gpu \ --config configs/projects/funit/animal_faces/base64_bs8_class149.yaml \ --output_dir projects/funit/output/animal_faces
Learn more about FUNIT:
- FUNIT Project Page
- Research Paper: Few-Shot Unsupervised Image-to-Image Translation
- https://github.com/NVlabs/funit
6. COCO-FUNIT(Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder)
COCO-FUNIT was published by Kuniaki Saito, Kate Saenko, and Ming-Yu Liu from Boston university, it is used for generating a photorealistic translation of the input content image in the unseen domain.
It is able to compute the style embedding of the images by leveraging a new module called the constant style bias.COCo FUNIT model shows effectiveness in addressing the content loss problem. For code and pretrained models reference, you can also check out the old repository: https://nvlabs.github.io/COCO-FUNIT/
Implementation
Download the dataset and the raw images will be saved in project/coco_funit/data/training folder as this library comes with a copy of the Animal Faces dataset for the quick experiment you can use that one.
Build the lmdbs
for f in train train_all val; do python -m imaginaire.tools.build_lmdb \ --config configs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml \ --data_root projects/coco_funit/data/raw/training/animal_faces/${f} \ --output_root projects/coco_funit/data/lmdb/training/animal_faces/${f} \ --overwrite done
Training
python -m torch.distributed.launch --nproc_per_node=8 train.py \ --config configs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml \ --logdir logs/projects/coco_funit/animal_faces/base64_bs8_class119.yaml
Inference
The output results are stored in /projects/coco_funit/output/animal_faces
#download test dataset python scripts/download_test_data.py --model_name coco_funit python inference.py --single_gpu \ --config configs/projects/coco_funit/animal_faces/base64_bs8_class149.yaml \ --output_dir projects/coco_funit/output/animal_faces
Read More:
- COCO-FUNIT Project Page
- COCO-FUNIT: Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder
7. vid2vid(Video-to-Video Synthesis)
vid2vid a video translation library used for tuning semantic label maps into realistic videos, synthesizing objects, or you can generate human motions from poses.it is trained using an NVIDIA DGX1 with 8 V100 32GB GPUs.
Train vid2vid on the Cityscapes dataset
First, download and rearrange the dataset into the following data structure format:
cityscapes
└───images
└───seq0001
└───00001.png
└───00002.png
└───seg_maps
└───seq0001
└───00001.png
└───00002.png
Preprocess the data into LMDB format
python scripts/build_lmdb.py --config configs/projects/vid2vid/cityscapes/ampO1.yaml --data_root [PATH_TO_DATA] --output_root datasets/cityscapes/lmdb/[train | val] --paired
Train
python -m torch.distributed.launch --nproc_per_node=8 train.py \ --config configs/projects/vid2vid/cityscapes/ampO1.yaml
Inference
python ./scripts/download_test_data.py --model_name vid2vid
Or
python inference.py --single_gpu \ --config configs/projects/vid2vid/cityscapes/ampO1.yaml \ --output_dir projects/vid2vid/output/cityscapes
Wrapping Up
Indeed imaginaire is a multi-purpose library with lots of functionality from image processing to video translation and generative style transfer, we have seen introduction and results for all the different models(supervised image-to-image translation, video-to-to translation), there are two more video translation models we didn’t discuss in this article that are:
- fs-vid2vid(a subject-agnostic mapping that converts a semantic video and an example image to a photorealistic video.)
- wc-vid2vid(Improve vid2vid on long-term consistency.)
Learn more about imaginaire library here