Computer Vision is a wide, deep learning field with enormous applications. Image Generation is one of the most curious applications in Computer Vision. Again, Image Generation has a great collection of tasks; to mention, a few can outperform humans. Most image generation tasks are common for videos, too, since a video is a sequence of images.

A few popular Image Generation tasks are:

- Image-to-Image translation (e.g. grayscale image to colour image)
- Text-to-Image translation
- Super-resolution
- Photo-to-Cartoon/Emoji translation
- Image inpainting
- Image dataset generation
- Medical Image generation
- Realistic photo generation
- Semantic-to-Photo translation
- Image blending
- Deepfake video generation
- 2D-to-3D image translation

One deep learning generative model can perform one or more tasks with a few configuration changes. Some famous image generative models are the original versions and the numerous variants of Variational Autoencoder (VAE), and Generative Adversarial Networks (GAN).

This article discusses the concepts behind image generation and the code implementation of Variational Autoencoder with a practical example using TensorFlow Keras. TensorFlow is one of the top preferred frameworks for deep learning processes. Keras is a high-level API built on top of TensorFlow, which is meant exclusively for deep learning.

The following articles may fulfil the prerequisites by giving an understanding of deep learning and computer vision.

- Getting Started With Deep Learning Using TensorFlow Keras
- Getting Started With Computer Vision Using TensorFlow Keras

## How does Image Generation work?

Whether it is a VAE, or a GAN, or a variant, the common elements are an encoder and a decoder. An encoder is a deep neural network that transforms the high-dimensional input image into a low-dimensional latent vector representation. A decoder is a deep neural network that transforms the low-dimensional latent vector representation into a high-dimensional representation that is called the generated image. This encoder and decoder alone comprise the traditional Autoencoder (AE). Variational Autoencoder (VAE) was introduced with a modification in AE architecture to improve the image generation capabilities. The encoder part encodes the input image into a Gaussian representation that comprises Mean and Variance. A sampler samples these mean and variance vectors and develops the required latent representation. Later, the decoder part generates the synthetic image from this latent representation.

Since a high-dimensional input image is compressed by the encoder to a low-dimensional representation, the decoder is trained to generate a high-dimensional image out of the key representations. During training, the entire model compares the generated image and input image, calculates the loss and back-propagates it to train the network’s weights. Once the model is trained, the encoder part is discarded during inference. The decoder part makes inferences (i.e., generates images) based on the sampling, which becomes the input. Since the decoder part is used to generate the images, it is also called the generator.

## Create the Environment

Create the necessary Python environment by importing the required frameworks, libraries and modules.

import numpy as np import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers

## Load an Image Dataset

We use Fashion MNIST data available in-built with Keras Datasets.

fashion_data = keras.datasets.fashion_mnist.load_data() (x_train,y_train),(x_val,y_val)= fashion_data x_train.shape, x_val.shape

Output:

There are 60000 images in the train set and 10000 images in the validation set. Each image is a grayscale (1 channel) image of shape 28 by 28. Image generation using VAE follows a self-supervised approach. Therefore, we may delete the y_train and y_val data to save memory.

del y_train, y_val

Visualize an example from the downloaded image data to get a better insight.

plt.imshow(x_train[10]) plt.colorbar() plt.show()

Output:

It can be observed that the pixel values range from 0 to 255. We need to scale the values. Further, convolutional layers expect three-dimensional inputs, whereas the available images are in two dimensions. Self-supervised models do not require separate datasets for training and validation. We can merge the available training and validation sets to get relatively large data for training.

# Merge two datasets data = tf.concat([x_train, x_val], axis=0) # images from 2D to 3D data = tf.expand_dims(data, -1) # scale the images to [0,1] data = tf.cast(data, tf.float32) data = data / 255.0

## Build the VAE Architecture

class Sampling(layers.Layer): def call(self, inputs): mean, logvar = inputs batch = tf.shape(mean)[0] dim = tf.shape(mean)[1] eps = tf.keras.backend.random_normal(shape=(batch, dim)) return mean + tf.exp(0.5 * logvar) * eps

Build an encoder that takes an image as input and outputs sampling representation as output.

encoder_inputs = keras.Input(shape=(28, 28, 1)) x = layers.Conv2D(32, 3, activation="relu", strides=2, padding="same")(encoder_inputs) x = layers.Conv2D(64, 3, activation="relu", strides=2, padding="same")(x) x = layers.Flatten()(x) x = layers.Dense(16, activation="relu")(x) mean = layers.Dense(2, name="z_mean")(x) logvar = layers.Dense(2, name="z_log_var")(x) z = Sampling()([mean, logvar]) encoder = keras.Model(encoder_inputs, [mean, logvar, z], name="encoder") encoder.summary()

Output:

Plotting the model is always a great way to ensure shapes and workflow.

keras.utils.plot_model(encoder, show_shapes=True, dpi=64)

Output:

Build a decoder that takes the inputs from the encoder, performs transpose convolution, and develops a synthetic image of size 14 by 14.

latent_inputs = keras.Input(shape=(2,)) x = layers.Dense(7 * 7 * 64, activation="relu")(latent_inputs) # form 7 by 7 feature map x = layers.Reshape((7, 7, 64))(x) # form 14 by 14 feature map x = layers.Conv2DTranspose(64, 3, activation="relu", strides=2, padding="same")(x) # form 28 by 28 feature map x = layers.Conv2DTranspose(32, 3, activation="relu", strides=2, padding="same")(x) # form the sigmoid output - single image decoder_outputs = layers.Conv2DTranspose(1, 3, activation="sigmoid", padding="same")(x) decoder = keras.Model(latent_inputs, decoder_outputs, name="decoder") decoder.summary()

Output:

Plot the decoder to get a better understanding.

keras.utils.plot_model(decoder, show_shapes=True, dpi=64)

Output:

Let’s formulate the training methodology by customizing the losses and metrics as necessitated by the original research paper. The loss is the binary cross-entropy, calculated by comparing the original input image with the reconstructed synthetic (generated) image.

## Training the Model

class VAE(keras.Model): def __init__(self, encoder, decoder, **kwargs): super(VAE, self).__init__(**kwargs) self.encoder = encoder self.decoder = decoder self.total_loss_tracker = keras.metrics.Mean(name="total_loss") self.reconstruction_loss_tracker = keras.metrics.Mean( name="reconstruction_loss" ) self.kl_loss_tracker = keras.metrics.Mean(name="kl_loss") @property def metrics(self): return [ self.total_loss_tracker, self.reconstruction_loss_tracker, self.kl_loss_tracker, ] def train_step(self, data): with tf.GradientTape() as tape: mean, logvar, z = self.encoder(data) reconstruction = self.decoder(z) reconstruction_loss = tf.reduce_mean( tf.reduce_sum( keras.losses.binary_crossentropy(data, reconstruction), axis=(1, 2) ) ) kl_loss = -0.5 * (1 + logvar - tf.square(mean) - tf.exp(logvar)) kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis=1)) total_loss = reconstruction_loss + kl_loss grads = tape.gradient(total_loss, self.trainable_weights) self.optimizer.apply_gradients(zip(grads, self.trainable_weights)) self.total_loss_tracker.update_state(total_loss) self.reconstruction_loss_tracker.update_state(reconstruction_loss) self.kl_loss_tracker.update_state(kl_loss) return { "loss": self.total_loss_tracker.result(), "reconstruction_loss": self.reconstruction_loss_tracker.result(), "kl_loss": self.kl_loss_tracker.result(), }

We have built our model and defined the losses and metrics required to train it. We can compile the model with Adam optimizer and train it over 30 epochs with a batch size of 128.

vae = VAE(encoder, decoder) vae.compile(optimizer=keras.optimizers.Adam()) history = vae.fit(data, epochs=30, batch_size=128)

A portion of the output:

## Sample Image Generation

The model is trained with the input data. It is ready now to generate the images that look close to the original images. To generate the images, we need to sample some mean and variance with which the model can generate the images.

def plot_latent_space(vae, n=16, figsize=8): # display a n*n 2D manifold of fashion data digit_size = 28 scale = 1.0 figure = np.zeros((digit_size * n, digit_size * n)) # linearly spaced coordinates corresponding to the 2D plot # of digit classes in the latent space grid_x = np.linspace(-scale, scale, n) grid_y = np.linspace(-scale, scale, n)[::-1] for i, yi in enumerate(grid_y): for j, xi in enumerate(grid_x): z_sample = np.array([[xi, yi]]) x_decoded = vae.decoder.predict(z_sample) digit = x_decoded[0].reshape(digit_size, digit_size) figure[ i * digit_size : (i + 1) * digit_size, j * digit_size : (j + 1) * digit_size, ] = digit plt.figure(figsize=(figsize, figsize)) start_range = digit_size // 2 end_range = n * digit_size + start_range pixel_range = np.arange(start_range, end_range, digit_size) sample_range_x = np.round(grid_x, 1) sample_range_y = np.round(grid_y, 1) plt.xticks(pixel_range, sample_range_x) plt.yticks(pixel_range, sample_range_y) plt.xlabel("mean: z[0]") plt.ylabel("log of variance: z[1]") plt.imshow(figure, cmap="jet") plt.show() plot_latent_space(vae)

Output:

We can interpret the above generation as follows. With a constant variance sampled, we can generate different images by controlling the mean value. Likewise, by controlling the variance value against a fixed mean value, we can generate different images. Thus, image generation is greatly controlled by the sampling process.

## Performance Analysis of VAE

Plotting losses will give a better understanding of training performance.

loss = history.history['loss'] # plot loss from 4rd epoch onwards index = np.arange(3, 30) plt.plot(index, loss[3:], 'o-r') plt.xticks(np.arange(3, 30, 2)) plt.xlabel('Epochs') plt.ylabel('Total Loss') plt.show()

Output:

The losses keep on decreasing even till the end of the 30th epoch. It suggests that the training must be extended for more epochs to obtain better performance.

This notebook contains the above code implementation.

## Wrapping Up

This article discussed Image Generation, the various image generation applications, and the famous generative models. In particular, we have explored Variational Autoencoder (VAE) architecture and built it with TensorFlow, trained with Fashion MNIST data, and generated images by sampling mean and variance. Interested readers may try this implementation with different image data, more depth in encoder and decoder architecture (i.e., with more convolution layers and transpose convolution layers, respectively).