Recently, a team of researchers from UC Berkeley and Adobe Research proposed a new machine learning model known as Swapping Autoencoder, which has the capability to perform image manipulation. The key idea of this research is to encode a picture into 2 independent components and then enforce that any swapped combination maps to a realistic image.
Deep generative models such as GANs or Generative Adversarial Networks and Variational Autoencoders (VAEs) have gained much traction by the researchers over the years. According to the researchers, deep generative models have become a popular technique when it comes to producing realistic images from randomly sampled data. However, such deep generative models face various challenges when used for a controllable manipulation of existing images.
Behind Swapping Autoencoder
Swapping Autoencoder is a new autoencoder-based machine learning model, where a single image is encoded into 2 different latent codes, called as the structure code and the texture code. Both the codes are specifically designed to represent structure and texture in a disentangled manner.
The architecture of Swapping Autoencoder consists of autoencoding at the top and swapping operation at the bottom. The structure code and the texture code is a tensor with spatial dimensions and a 2048-dimensional vector respectively.
The researchers stated that the method is trained with an encoder, which means obtaining the latent codes for a new input image becomes insignificant, rather than cumbersome. This helped the researchers to manipulate real input pictures in many other ways, such as texture swapping, latent code vector arithmetic as well as local and global editing.
During the training phase, the researchers swapped these two codes between pairs of images and enforced that the resulting hybrid images look realistic. Further, to encourage a meaningful disentanglement, the researchers enforced the images by introducing a co-occurrence patch discriminator with the same texture code to have the same low-level patch distribution.
The researchers stated that using a perceptual study, the model is validated such that the structure code learns to correspond accurately to the layout or structure of the scene and the texture codes are able to capture properties about its overall appearance including style.
Dataset Used
To evaluate the results, the researchers used multiple datasets such as LSUN churches and bedrooms, Animal Faces HQ (AFHQ), FlickrFaces-HQ, all the datasets at a resolution of 256px except FFHQ at 1024px.
In addition, the researchers introduced new datasets, which are Portrait2FFHQ, a combined dataset of 17k portrait paintings from wikiart.org and FFHQ at 256px, Flickr Mountain, 0.5M mountain images from flickr.com, and Waterfall, of 90k 256px waterfall images.
Benefits of This Model
The Swapping Autoencoder learns to disentangle texture from the structure for image editing tasks. According to the researchers, the proposed method can be used to efficiently embed a given image into a factored latent space and to generate hybrid images by swapping latent codes. Experiments on various datasets showed that the model produces better results than other generative models. Also, it is substantially more efficient compared to recent generative models.
Wrapping Up
According to the researchers, the encoder present at the architecture allows the model to swap styles in real-time, which is roughly 4 orders of magnitude faster than previous unconditional models, such as StyleGAN.
The researchers stated that this machine learning model is designed specifically for image manipulation, rather than random sampling. The main motive is to use image swapping as a pretext task for learning an embedding space useful for image manipulation.
Swapping Autoencoder has the capability to demonstrate three practical applications, and they are: –
- Synthesising new image hybrids from given example images
- Smooth manipulation of attributes or domain transfer of a given photo through traversing latent “directions”
- Local manipulation capability
Read the paper here.
Watch the video