Is MLP Better Than CNN & Transformers For Computer Vision?

Share

Published on May 14, 2021

by Amit Raja Naik

Earlier this month, Google researchers released a new algorithm called MLP-Mixer, an architecture based exclusively on multi-layered perceptrons (MLPs) for computer vision. The MLP-Mixer code is now available on GitHub.

MLP is used to solve machine learning problems like tabular datasets, classification prediction problems and regression prediction problems. Apart from convolutional neural networks (CNN) and attention-based networks (transformers), researchers & developers use MLPs extensively in image processing.

In a recent paper, Google has introduced MLP-Mixer.

“While convolutions and attention are both sufficient for good performance, neither of them are necessary,” reads the paper, MLP-Mixer: An all-MLP Architecture for Vision, co-authored by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Thomas Unterthiner, Lucas Beyer, Xiaohua Zhai, Jessica Yung, Daniel Keysers, Mario Lucic, Jakob Uszkoreit and Alexey Dosovitskiy.

Interestingly, the new model achieves similar results compared to the state-of-the-art models trained on large datasets with almost 3x speed. “When trained on large datasets, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models,” claimed Google AI.

Architecture for computer vision

MLP-Mixer constraints two types of layers — one with MLPs applied independently to image patches (‘mixing’ the per-location features), and one with MLPs used across patches (‘mixing’ spatial information).

The image below depicts the macro-structure of Mixer with Mixer layers, per-patch linear embeddings and a classifier head. Mixer layers contain one channel-mixing MLP and one token-mixing MLP, each consisting of two fully connected layers and a GELU nonlinearity. Other components include skip-connections, layer norm on the channels, dropout, and linear classifier head.

Source: (arXiv.org)

The model accepts a sequence of linearly projected image patches as input and maintains the dimensionality. On the other hand, Mixer uses two layers of MLP: channel mixing MLPs and token-mixing MLPs.

Channel-mixing MLPs allow communication between different channels, and they operate independently on each token and rows of the table as inputs. Similarly, the token-mixing MLPs allow communication between various spatial locations (tokens); they operate on each channel independently and take an individual column of the table as inputs. The layers (channel-mixing MLPs and token-mixing MLPs) are interspersed to enable interaction of both input dimensions.

MLP-Mixer vs CNN vs vision transformers

“In the extreme situation, our architecture can be seen as a unique CNN, which uses (1×1) convolutions for channel mixing, and single-channel depth-wise convolutions for token mixing. However, the converse is not true as CNNs are not special cases of Mixer,” explained Google AI.

Convolution is more complex than the plain matrix multiplication in MLPs as it requires an additional cost reduction to matrix multiplication or specialised implementation.

Today, image processing networks typically involve mixed features at a given location or mix the features between multiple locations. For instance, in CNNs, both mixes happen with convolutions, kernels, and pooling, while vision transformers perform them with self-attention.

MLP-Mixer, on the other hand, attempts to do both in a more ‘separate’ fashion and only using MLPs. The advantage of only using MLPs — essentially matrix multiplication — is the simplicity of the architecture and the computational speed.

Also, the computational complexity of the MLP-Mixer is linear in the number of input patches, unlike vision transformers whose complexity is quadratic. Also, the model uses skip connections and regularisation.

The advantages of MLP-Mixer include:

Identical size of the layers
2 MLP blocks across each layer
Takes same size inputs across each layer
All image patches are projected linearly with the same projection matrix

Outcome

MLP is faster than other models. For instance, the throughput of Mixer (shown above) is around 105 image/sec/core, compared to 32 for the vision transformer.

“Hopefully, these results spark further research beyond the realms of well-established models based on convolutions and self-attention transformers,” concluded Google AI.

Access all our open Survey & Awards Nomination forms in one place

Amit Raja Naik

Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.