MITB Banner

Is MLP Better Than CNN & Transformers For Computer Vision?

Share

Earlier this month, Google researchers released a new algorithm called MLP-Mixer, an architecture based exclusively on multi-layered perceptrons (MLPs) for computer vision. The MLP-Mixer code is now available on GitHub.

MLP is used to solve machine learning problems like tabular datasets, classification prediction problems and regression prediction problems. Apart from convolutional neural networks (CNN) and attention-based networks (transformers), researchers & developers use MLPs extensively in image processing.

In a recent paper, Google has introduced MLP-Mixer.

“While convolutions and attention are both sufficient for good performance, neither of them are necessary,” reads the paper, MLP-Mixer: An all-MLP Architecture for Vision, co-authored by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Thomas Unterthiner, Lucas Beyer, Xiaohua Zhai, Jessica Yung, Daniel Keysers, Mario Lucic, Jakob Uszkoreit and Alexey Dosovitskiy. 

Interestingly, the new model achieves similar results compared to the state-of-the-art models trained on large datasets with almost 3x speed. “When trained on large datasets, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models,” claimed Google AI. 

Architecture for computer vision 

MLP-Mixer constraints two types of layers — one with MLPs applied independently to image patches (‘mixing’ the per-location features), and one with MLPs used across patches (‘mixing’ spatial information).

The image below depicts the macro-structure of Mixer with Mixer layers, per-patch linear embeddings and a classifier head. Mixer layers contain one channel-mixing MLP and one token-mixing MLP, each consisting of two fully connected layers and a GELU nonlinearity. Other components include skip-connections, layer norm on the channels, dropout, and linear classifier head. 

Source: (arXiv.org)

The model accepts a sequence of linearly projected image patches as input and maintains the dimensionality. On the other hand, Mixer uses two layers of MLP: channel mixing MLPs and token-mixing MLPs. 

Channel-mixing MLPs allow communication between different channels, and they operate independently on each token and rows of the table as inputs. Similarly, the token-mixing MLPs allow communication between various spatial locations (tokens); they operate on each channel independently and take an individual column of the table as inputs. The layers (channel-mixing MLPs and token-mixing MLPs) are interspersed to enable interaction of both input dimensions. 

MLP-Mixer vs CNN vs vision transformers

“In the extreme situation, our architecture can be seen as a unique CNN, which uses (1×1) convolutions for channel mixing, and single-channel depth-wise convolutions for token mixing. However, the converse is not true as CNNs are not special cases of Mixer,” explained Google AI. 

Convolution is more complex than the plain matrix multiplication in MLPs as it requires an additional cost reduction to matrix multiplication or specialised implementation. 

Today, image processing networks typically involve mixed features at a given location or mix the features between multiple locations. For instance, in CNNs, both mixes happen with convolutions, kernels, and pooling, while vision transformers perform them with self-attention. 

MLP-Mixer, on the other hand, attempts to do both in a more ‘separate’ fashion and only using MLPs. The advantage of only using MLPs — essentially matrix multiplication — is the simplicity of the architecture and the computational speed. 

Also, the computational complexity of the MLP-Mixer is linear in the number of input patches, unlike vision transformers whose complexity is quadratic. Also, the model uses skip connections and regularisation

The advantages of MLP-Mixer include:

  • Identical size of the layers 
  • 2 MLP blocks across each layer 
  • Takes same size inputs across each layer
  • All image patches are projected linearly with the same projection matrix

Outcome 

MLP is faster than other models. For instance, the throughput of Mixer (shown above) is around 105 image/sec/core, compared to 32 for the vision transformer.

“Hopefully, these results spark further research beyond the realms of well-established models based on convolutions and self-attention transformers,” concluded Google AI.

Share
Picture of Amit Raja Naik

Amit Raja Naik

Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.