Is MLP Better Than CNN & Transformers For Computer Vision?

Earlier this month, Google researchers released a new algorithm called MLP-Mixer, an architecture based exclusively on multi-layered perceptrons (MLPs) for computer vision. The MLP-Mixer code is now available on GitHub.

MLP is used to solve machine learning problems like tabular datasets, classification prediction problems and regression prediction problems. Apart from convolutional neural networks (CNN) and attention-based networks (transformers), researchers & developers use MLPs extensively in image processing.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

In a recent paper, Google has introduced MLP-Mixer.

“While convolutions and attention are both sufficient for good performance, neither of them are necessary,” reads the paper, MLP-Mixer: An all-MLP Architecture for Vision, co-authored by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Thomas Unterthiner, Lucas Beyer, Xiaohua Zhai, Jessica Yung, Daniel Keysers, Mario Lucic, Jakob Uszkoreit and Alexey Dosovitskiy. 

Interestingly, the new model achieves similar results compared to the state-of-the-art models trained on large datasets with almost 3x speed. “When trained on large datasets, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models,” claimed Google AI. 

Architecture for computer vision 

MLP-Mixer constraints two types of layers — one with MLPs applied independently to image patches (‘mixing’ the per-location features), and one with MLPs used across patches (‘mixing’ spatial information).

The image below depicts the macro-structure of Mixer with Mixer layers, per-patch linear embeddings and a classifier head. Mixer layers contain one channel-mixing MLP and one token-mixing MLP, each consisting of two fully connected layers and a GELU nonlinearity. Other components include skip-connections, layer norm on the channels, dropout, and linear classifier head. 

Source: (arXiv.org)

The model accepts a sequence of linearly projected image patches as input and maintains the dimensionality. On the other hand, Mixer uses two layers of MLP: channel mixing MLPs and token-mixing MLPs. 

Channel-mixing MLPs allow communication between different channels, and they operate independently on each token and rows of the table as inputs. Similarly, the token-mixing MLPs allow communication between various spatial locations (tokens); they operate on each channel independently and take an individual column of the table as inputs. The layers (channel-mixing MLPs and token-mixing MLPs) are interspersed to enable interaction of both input dimensions. 

MLP-Mixer vs CNN vs vision transformers

“In the extreme situation, our architecture can be seen as a unique CNN, which uses (1×1) convolutions for channel mixing, and single-channel depth-wise convolutions for token mixing. However, the converse is not true as CNNs are not special cases of Mixer,” explained Google AI. 

Convolution is more complex than the plain matrix multiplication in MLPs as it requires an additional cost reduction to matrix multiplication or specialised implementation. 

Today, image processing networks typically involve mixed features at a given location or mix the features between multiple locations. For instance, in CNNs, both mixes happen with convolutions, kernels, and pooling, while vision transformers perform them with self-attention. 

MLP-Mixer, on the other hand, attempts to do both in a more ‘separate’ fashion and only using MLPs. The advantage of only using MLPs — essentially matrix multiplication — is the simplicity of the architecture and the computational speed. 

Also, the computational complexity of the MLP-Mixer is linear in the number of input patches, unlike vision transformers whose complexity is quadratic. Also, the model uses skip connections and regularisation

The advantages of MLP-Mixer include:

  • Identical size of the layers 
  • 2 MLP blocks across each layer 
  • Takes same size inputs across each layer
  • All image patches are projected linearly with the same projection matrix

Outcome 

MLP is faster than other models. For instance, the throughput of Mixer (shown above) is around 105 image/sec/core, compared to 32 for the vision transformer.

“Hopefully, these results spark further research beyond the realms of well-established models based on convolutions and self-attention transformers,” concluded Google AI.

More Great AIM Stories

Amit Raja Naik
Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

What can SEBI learn from casinos?

It is said that casino AI technology comes with superior risk management systems compared to traditional data analytics that regulators are currently using.

Will Tesla Make (it) in India?

Tesla has struggled with optimising their production because Musk has been intent on manufacturing all the car’s parts independent of other suppliers since 2017.