Now Reading
Getting Started With Image Generation Using TensorFlow Keras

Getting Started With Image Generation Using TensorFlow Keras

image generation VAE

Computer Vision is a wide, deep learning field with enormous applications. Image Generation is one of the most curious applications in Computer Vision. Again, Image Generation has a great collection of tasks; to mention, a few can outperform humans. Most image generation tasks are common for videos, too, since a video is a sequence of images. 

A few popular Image Generation tasks are:

Register for Data & Analytics Conclave>>
  1. Image-to-Image translation (e.g. grayscale image to colour image)
  2. Text-to-Image translation
  3. Super-resolution
  4. Photo-to-Cartoon/Emoji translation
  5. Image inpainting
  6. Image dataset generation
  7. Medical Image generation
  8. Realistic photo generation
  9. Semantic-to-Photo translation
  10. Image blending
  11. Deepfake video generation 
  12. 2D-to-3D image translation

One deep learning generative model can perform one or more tasks with a few configuration changes. Some famous image generative models are the original versions and the numerous variants of Variational Autoencoder (VAE), and Generative Adversarial Networks (GAN). 

This article discusses the concepts behind image generation and the code implementation of Variational Autoencoder with a practical example using TensorFlow Keras. TensorFlow is one of the top preferred frameworks for deep learning processes. Keras is a high-level API built on top of TensorFlow, which is meant exclusively for deep learning.

The following articles may fulfil the prerequisites by giving an understanding of deep learning and computer vision.

  1. Getting Started With Deep Learning Using TensorFlow Keras
  2. Getting Started With Computer Vision Using TensorFlow Keras

How does Image Generation work?

Whether it is a VAE, or a GAN, or a variant, the common elements are an encoder and a decoder. An encoder is a deep neural network that transforms the high-dimensional input image into a low-dimensional latent vector representation. A decoder is a deep neural network that  transforms the low-dimensional latent vector representation into a high-dimensional representation that is called the generated image. This encoder and decoder alone comprise the traditional Autoencoder (AE). Variational Autoencoder (VAE) was introduced with a modification in AE architecture to improve the image generation capabilities. The encoder part encodes the input image into a Gaussian representation that comprises Mean and Variance. A sampler samples these mean and variance vectors and develops the required latent representation. Later, the decoder part generates the synthetic image from this latent representation.   

VAE architecture
An Overview of the VAE Architecture

Since a high-dimensional input image is compressed by the encoder to a low-dimensional representation, the decoder is trained to generate a high-dimensional image out of the key representations. During training, the entire model compares the generated image and input image, calculates the loss and back-propagates it to train the network’s weights. Once the model is trained, the encoder part is discarded during inference. The decoder part makes inferences (i.e., generates images) based on the sampling, which becomes the input. Since the decoder part is used to generate the images, it is also called the generator. 

Create the Environment 

Create the necessary Python environment by importing the required frameworks, libraries and modules.

 import numpy as np
 import tensorflow as tf
 from tensorflow import keras
 from tensorflow.keras import layers 

Load an Image Dataset

We use Fashion MNIST data available in-built with Keras Datasets. 

 fashion_data = keras.datasets.fashion_mnist.load_data()
 (x_train,y_train),(x_val,y_val)= fashion_data 
 x_train.shape, x_val.shape 


Image Generation task - Fashion MNIST

There are 60000 images in the train set and 10000 images in the validation set. Each image is a grayscale (1 channel) image of shape 28 by 28. Image generation using VAE follows a self-supervised approach. Therefore, we may delete the y_train and y_val data to save memory.

del y_train, y_val

Visualize an example from the downloaded image data to get a better insight.



It can be observed that the pixel values range from 0 to 255. We need to scale the values. Further, convolutional layers expect three-dimensional inputs, whereas the available images are in two dimensions. Self-supervised models do not require separate datasets for training and validation. We can merge the available training and validation sets to get relatively large data for training.

 # Merge two datasets
 data = tf.concat([x_train, x_val], axis=0)
 # images from 2D to 3D
 data = tf.expand_dims(data, -1)
 # scale the images to [0,1]
 data = tf.cast(data, tf.float32)
 data = data / 255.0 

Build the VAE Architecture

 class Sampling(layers.Layer):
     def call(self, inputs):
         mean, logvar = inputs
         batch = tf.shape(mean)[0]
         dim = tf.shape(mean)[1]
         eps = tf.keras.backend.random_normal(shape=(batch, dim))
         return mean + tf.exp(0.5 * logvar) * eps 

Build an encoder that takes an image as input and outputs sampling representation as output.

 encoder_inputs = keras.Input(shape=(28, 28, 1))
 x = layers.Conv2D(32, 3, activation="relu", strides=2, padding="same")(encoder_inputs)
 x = layers.Conv2D(64, 3, activation="relu", strides=2, padding="same")(x)
 x = layers.Flatten()(x)
 x = layers.Dense(16, activation="relu")(x)
 mean = layers.Dense(2, name="z_mean")(x)
 logvar = layers.Dense(2, name="z_log_var")(x)
 z = Sampling()([mean, logvar])
 encoder = keras.Model(encoder_inputs, [mean, logvar, z], name="encoder")


Encoder summary

Plotting the model is always a great way to ensure shapes and workflow.

keras.utils.plot_model(encoder, show_shapes=True, dpi=64)


VAE encoder

Build a decoder that takes the inputs from the encoder, performs transpose convolution, and develops a synthetic image of size 14 by 14.

 latent_inputs = keras.Input(shape=(2,))
 x = layers.Dense(7 * 7 * 64, activation="relu")(latent_inputs)
 # form 7 by 7 feature map
 x = layers.Reshape((7, 7, 64))(x)
 # form 14 by 14 feature map
 x = layers.Conv2DTranspose(64, 3, activation="relu", strides=2, padding="same")(x)
 # form 28 by 28 feature map
 x = layers.Conv2DTranspose(32, 3, activation="relu", strides=2, padding="same")(x)
 # form the sigmoid output - single image
 decoder_outputs = layers.Conv2DTranspose(1, 3, activation="sigmoid", padding="same")(x)
 decoder = keras.Model(latent_inputs, decoder_outputs, name="decoder")


decoder summary

Plot the decoder to get a better understanding.

keras.utils.plot_model(decoder, show_shapes=True, dpi=64)


See Also

VAE decoder

Let’s formulate the training methodology by customizing the losses and metrics as necessitated by the original research paper. The loss is the binary cross-entropy, calculated by comparing the original input image with the reconstructed synthetic (generated) image.

Training the Model

 class VAE(keras.Model):
     def __init__(self, encoder, decoder, **kwargs):
         super(VAE, self).__init__(**kwargs)
         self.encoder = encoder
         self.decoder = decoder
         self.total_loss_tracker = keras.metrics.Mean(name="total_loss")
         self.reconstruction_loss_tracker = keras.metrics.Mean(
         self.kl_loss_tracker = keras.metrics.Mean(name="kl_loss")
     def metrics(self):
         return [
     def train_step(self, data):
         with tf.GradientTape() as tape:
             mean, logvar, z = self.encoder(data)
             reconstruction = self.decoder(z)
             reconstruction_loss = tf.reduce_mean(
                     keras.losses.binary_crossentropy(data, reconstruction), axis=(1, 2)
             kl_loss = -0.5 * (1 + logvar - tf.square(mean) - tf.exp(logvar))
             kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis=1))
             total_loss = reconstruction_loss + kl_loss
         grads = tape.gradient(total_loss, self.trainable_weights)
         self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
         return {
             "loss": self.total_loss_tracker.result(),
             "reconstruction_loss": self.reconstruction_loss_tracker.result(),
             "kl_loss": self.kl_loss_tracker.result(),

We have built our model and defined the losses and metrics required to train it. We can compile the model with Adam optimizer and train it over 30 epochs with a batch size of 128.

 vae = VAE(encoder, decoder)
 history =, epochs=30, batch_size=128) 

A portion of the output:

VAE training

Sample Image Generation

The model is trained with the input data. It is ready now to generate the images that look close to the original images. To generate the images, we need to sample some mean and variance with which the model can generate the images.

 def plot_latent_space(vae, n=16, figsize=8):
     # display a n*n 2D manifold of fashion data
     digit_size = 28
     scale = 1.0
     figure = np.zeros((digit_size * n, digit_size * n))
     # linearly spaced coordinates corresponding to the 2D plot
     # of digit classes in the latent space
     grid_x = np.linspace(-scale, scale, n)
     grid_y = np.linspace(-scale, scale, n)[::-1]
     for i, yi in enumerate(grid_y):
         for j, xi in enumerate(grid_x):
             z_sample = np.array([[xi, yi]])
             x_decoded = vae.decoder.predict(z_sample)
             digit = x_decoded[0].reshape(digit_size, digit_size)
                 i * digit_size : (i + 1) * digit_size,
                 j * digit_size : (j + 1) * digit_size,
             ] = digit
     plt.figure(figsize=(figsize, figsize))
     start_range = digit_size // 2
     end_range = n * digit_size + start_range
     pixel_range = np.arange(start_range, end_range, digit_size)
     sample_range_x = np.round(grid_x, 1)
     sample_range_y = np.round(grid_y, 1)
     plt.xticks(pixel_range, sample_range_x)
     plt.yticks(pixel_range, sample_range_y)
     plt.xlabel("mean: z[0]")
     plt.ylabel("log of variance: z[1]")
     plt.imshow(figure, cmap="jet")


image generation results

We can interpret the above generation as follows. With a constant variance sampled, we can generate different images by controlling the mean value. Likewise, by controlling the variance value against a fixed mean value, we can generate different images. Thus, image generation is greatly controlled by the sampling process.

Performance Analysis of VAE

Plotting losses will give a better understanding of training performance.

 loss = history.history['loss']
 # plot loss from 4rd epoch onwards
 index = np.arange(3, 30)
 plt.plot(index, loss[3:], 'o-r')
 plt.xticks(np.arange(3, 30, 2))
 plt.ylabel('Total Loss') 


VAE performance

The losses keep on decreasing even till the end of the 30th epoch. It suggests that the training must be extended for more epochs to obtain better performance.

This notebook contains the above code implementation.

Wrapping Up

This article discussed Image Generation, the various image generation applications, and the famous generative models. In particular, we have explored Variational Autoencoder (VAE) architecture and built it with TensorFlow, trained with Fashion MNIST data, and generated images by sampling mean and variance. Interested readers may try this implementation with  different image data, more depth in encoder and decoder architecture (i.e., with more convolution layers and transpose convolution layers, respectively).

References and Further Reading

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top