Traditionally, Convolutional Neural Networks (CNN) have been the preferred choice for computer vision tasks. CNNs, composed of layers of artificial neurons, calculate the weighted sum of the inputs to give output in the form of activation values. In the case of computer vision applications, CNNs accept pixel values to output various visual features.
Indubitably, the invention of AlexNet was the apogee of the CNN movement. AlexNet has become the leading CNN-based architecture for object detection tasks in the computer vision field. CNNs cemented the position as the de facto model for computer vision with the introduction of VGGNet, ResNe(X)t, MobileNet, EfficientNet, and RegNet. These architectures focused on aspects like accuracy, efficiency, and scalability.
Transformers, introduced in 2017, revolutionised natural language processing. Today, we have large language models that can perform tasks such as writing and coding. In the last few years, the purview of Transformers application has grown in scope, especially in the computer vision domain.
To that end, the arrival of Google’s Vision transformer (ViT) has been a huge turning point. ViT uses the traditional Transformer architecture in NLP to represent image inputs as sequence, predicts class labels and then learns image structure independently. Since then, many Transformer architectures have been introduced for computer vision. Slowly, Transformers started catching up with CNNs for computer vision tasks.
The researchers from Facebook AI Research (FAIR) and UC Berkeley have introduced the revamped CNNs, ConvNeXt, aka ConvNets for 2020s.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
ConvNets and computer vision
ConvNets offer several in-built biases, making them well-suited for computer vision tasks: Translational equivariance for object detection is a good case in point. Further, when ConvNets are used in a sliding window manner, the computations are shared, making the whole operation highly efficient. In the 2010s, the introduction of region-based detectors helped establish ConvNets as the cornerstone of visual recognition systems.
With the entry of Transformers in computer vision, the popularity of ConvNet has taken a hit. For example, Google’s ViT trumps traditional models on the scaling side. That said, the major challenge of working with ViTs is its global attention design with quadratic complexity in terms of the input size. The problem compounds with higher-resolution inputs.
Taking the cure, researchers devised ‘hierarchical transformers’ with a hybrid approach: The sliding window strategy of ConvNets is applied to transformers like Swin Transformer. The popularity of Swin transformers proves Transformers have not rendered ConvNets obsolete.
ConvNeXts
Direct implementations of the sliding window strategy self-attention are expensive. Alternatives like cyclic shifting are cheaper, however, the system becomes more sophisticated in this case.
The major reason why people are opting for hierarchical transformers over ConvNets is because of the latter’s poor scalability, with multi-head attention being the key component.
FAIR and UC Berkeley scientists investigated the differences between ConvNets and Transformers to identify the confounding variables while comparing their network performance. As per the team, the objective of the study was to “bridge the gap between the pre-ViT and post-ViT eras for ConvNets, as well as to test the limits of what a pure ConvNet can achieve.”
Credit: ConvNeXts
The team has proposed a family of pure ConvNets called ConvNeXt. The researchers trained ResNet with the improved procedure, and the architecture was gradually ‘modernised’ to the construction of hierarchical vision transformers. The new class of models was then evaluated on several vision tasks like ImageNet classification, object detection on COCO, and semantic segmentation on ADE20K.
The team found that ConvNeXts could compete well with Transformers in terms of accuracy, robustness, and scalability. Furthermore, it has the efficiency standard of ConvNets and it is easier to implement because of the fully-convolutional nature for training and testing.
Credit: ConvNeXts
Read the full paper here.
Yann LeCun on ConvNeXts
ConvNets were introduced in the 1980s by Yann LeCun, Turing awardee, who currently works as the VP and chief AI scientist at Meta (Facebook). He had built on the work by Japanese scientist Kunihiko Fukushima who invented neocognition, an basic image recognition neural network. The early version of ConvNet was called LeNet, named after LeCun and could recognise handwritten digits.
Speaking on the new research on his personal LinkedIn profile, LeCun said that what works for ConvNeXts is “larger kernels, layer norm, fat layer inside residual blocks, one stage of non-linearity per residual block, separate downsampling layers….”.
Interestingly, despite being his own innovation, LeCun revealed in his post that he favours Detection Transformer (DETR) as his preferred architecture.