In machine learning, we can see the applications of autoencoder at various places, largely in unsupervised learning. There are various types of autoencoder available which work with various types of data. The main motive behind the autoencoder is to reconstruct the input sample. Data masking is a procedure of hiding sensitive information by showing a fake copy for security purposes. Combining autoencoder and data masking together, we can make a different autoencoder that can be named mask autoencoder. In this article, we will mask autoencoders in detail and will try to understand how does it work. The major points to be discussed in this article are listed below.
Table of Contents
- What is an Autoencoder?
- What is Data Masking?
- Mask Autoencoder(MAE)
- Encoder
- Decoder
Let’s start the discussion by understanding autoencoders in brief.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
What is an Autoencoder?

In deep learning, autoencoders are part of the family of neural networks, which are mainly used in the field of NLP and computer vision. It can perform the tasks which can be used to efficiently learn codings of unlabeled data in unsupervised learning. Validation of encoding in an autoencoder is done by regenerating the input from the encoder. The encoder is a combination of neural networks which learns the representation of a set of information. Dimensionality reduction is the main task of unsupervised learning where using an autoencoder, we can train the network to ignore insignificant representation of the data.
The below image can be a representation of the basic architecture of an autoencoder.

As we can see in the above representation, an autoencoder comprises two parts:-
- Encoder: This component maps the input into code or we can consider the code as a lower-dimensional representation of the input.
- Decoder: This component maps the code followed by the encoder and reconstructs the input.
However, We can also simply perform the copying task for the generation of input samples but the main motive behind using the autoencoder is to preserve only useful information from the data in the copy. A traditional autoencoder was generated to perform dimensional reduction for feature learning. The main motive behind training an autoencoder is to minimize the reconstruction error, which can be considered as a loss for the autoencoder.
There are three popular variants of autoencoders:-
- Regularized Autoencoder: This type of autoencoder has variations in the hidden layer, based on that we can categorize them into the following categories:
- Sparse encoder(SAE): This type of autoencoder holds more hidden layers than the input.
- Denoising Autoencoder(DAE): This type of autoencoder achieves a good representation of the input by changing the reconstruction criterion.
- Contractive Autoencoder: These autoencoders are designed for discrete feature selection by forcing the latent space to hold only a specified number of features from the input.
- Variational Autoencoder(VAE): In this type of autoencoder, the latent representation is composed of the mixture of distributions of the data instead of a fixed vector.
In this article, we are going to discuss the Mask autoencoder but before this, we are required to know what data masking is, which we will discuss in the next section.
What is Data Masking?
We can say that data masking is a technique through which we try to hide non-relevant information from the model so that we can train the model on the data which is more relatable to perform any specified task.
Let’s understand this by taking an example of image data where we have images of animals and the task for the network is to classify these images. For a human, it is quite easy to differentiate between the image of a cat and a dog by just looking at the eyes of the animal in the picture. These same capabilities can be added to a network if we train the network in such a way that it can classify the images by looking at only a few attributes of the images also, making the network learn on lower information makes it robust and resilient. For this, we are required to mask the information presented in the image data so that the network can get trained more powerfully with less amount of data. To classify any image network won’t need to go through all the pixels present in the image.
Data masking is simply a procedure that hides the information of the data by masking. Now the question which arises here is why do we need data masking? In many scenarios, we can find different uses of data masking but in our scenario, we can say that we are required to use data masking to train our model more accurately and precisely about the data. Also, it can help us in reducing the size of training data and models because data with lower attributes requires a low amount of memory and a low amount of power to be modelled. The below image can be considered as an example of data masking or image masking.

In the above image, we can see that by providing blank patches to the image we have masked some of the information present in the image. We can also see that the places for the patches are different every time which shows the type of masking on the image. That can be centred masking, block-wise masking or random masking.
In the above section, we can see that an image is masked using different strategies like block-wise masking, random masking, etc. Let’s move toward the mask autoencoder which will help us in creating a better understanding of the masking of an autoencoder.
Mask Autoencoder (MAE)
In the above section, we have seen what autoencoder is and what is masking and from these points, we can understand the motive of the article is to make an understanding of the mask autoencoder. Mask autoencoder can be considered as a process of using mask data with autoencoders.
In many examples, we can find that the autoencoder has worked well with the field of computer vision. Especially, where the image space is continuous but these autoencoders are not so successful in the NLP field. The reason behind the unsuccessful autoencoder in NLP is that the data in most of the cases are discrete. However, many times we pretend that text is not discrete by representing the words as vectors in word embedding but still the interpolation of words is often not that straightforward in the same sense it is in computer vision and this generates problems when we try to model the data in practice. The discreteness with the autoencoder creates the problem of reconstruction loss.
However, with the introduction to the transformer, we have overcome this problem with NLP data. Take an example of BERT models where the working of the model starts by corrupting a portion of the tokens and then BERT learns to predict the missing values in the data. By classifying more than 30,000 words in the English vocabulary, prediction can be the exact word instead of the mask tokens. This strategy of the BERT models seems similar to the mask autoencoder.
For the computer vision data, we can perform this same procedure using the masked autoencoder. Encoding and decoding data using the autoencoder has the advantage of self-supervision where we don’t need labelled data because using autoencoder we are just reconstructing the input and one more advantage is that the autoencoder can learn the general representation of the modality.
Self-supervision can be defined as a model that is not allowed to forget everything from the non-relevant for the label of that data sample as it is the case in the supervised learning portion but needs to remember the essence of that data. Using the NLP data thing is pretty easier than using the image data. Performing mask autoencoding using the language data and vision data has the following difference:
- Computer vision tasks mostly rely on the CNNs and this CNN relies on the data regularity of image grids and it is not easy to mask things in the image without introducing human specified restrictions in the output of the convolutional.
- Vision and language data are very different in terms of information density. Where a word consists of the meaning of it with it the pixels don’t come up with huge information.
- Decoding of vision and language data is also different from each other. Using a low-level MLP network in a decoder we can make an autoencoder to predict relatively little things with high semantic information. In the case of computer vision data, we need a decoder that can regress many pixels while maintaining contextual harmony between the many pixels.
By the above we can say that masking 20% of words data is sufficient for modelling where masking 20% of the image is approximately equal to the no corruption in the images.
In the above section, we have seen that using BERT, we can perform the masking and autoencoding and get a better result but with the image data, it is difficult to perform such things because of some differences.
Encoder
In masking, we have discussed that masking can be considered as an approach to remove information. This means we are hiding the information in the data from the network instead of hiding the information we can also delete it.
Let’s take an example of an image and divide it into non-overlapping patches. As in the following image,


Now like the setting in the ViT transformer we can think of these patches as the vector. After this decision, we can randomly mask these patches. Let’s say that 75% of the patches are masked and removed from the list of patches.

Now we can use transformers like ViT as an encoder and encode these remaining patches. Please note that the removal of patches is necessary to save the encoder from being lengthy in sequence.
Now by increasing the number of parameters in the feed-forward network of the transformer we can increase the capacity of the encoder. To avoid confusion for the encoder we are required to attach the positional embeddings with the patches.

Now removing the patches from the images allows the encoder with a small sequence length to process the input with fewer data points and encodes the patches into latent representation.

The output of the encoder transformer has the same number of tokens as the input has. Because we are using the transformer here, after training it, we can capture more image semantics in a better way.
Decoder
In the decoding part, we insert the masked part again with the latent representation. To reconstruct the input image we are required to use a decoder which is a transformer and must follow the transformer architecture, which means that there should be one vector token output for one vector token input.

From the set of latent representations of patches and masked patches, the transformer decoder takes each patch and computes the original image patches. Linear projection can be used to map the last activation into the pixel space so that the output of the model is an image. Using the input and the output we can compute the MSE value for authentication.
The below image is a representation of the basic structure of the mask autoencoder.

A similar method is applied in the Masked Autoencoders Are Scalable Vision Learners work. Model and instructions to be followed for implementation can be found here.
Final Words
In this article, we have discussed the introduction to the autoencoder in deep learning and data masking. Along with this, we have seen the method which can be followed for making a mask autoencoder using the transformers like ViT in the field of computer vision.