A neural network is a system of software patterned similar to the operation of neurons in a human brain. However, traditional and simple neural networks are not ideal for image processing and need images to be fed in reduced resolution divided into pieces. Convolutional Neural Networks or CNN are popularly known as deep learning architecture widely used for image-based learning applications in computational systems. CNNs comprise more than a single convolution layer and have found their use in many computer vision and image processing functionalities and classification and segmentation problems. Convolution is a mathematical operation and a process where an input I and an argument, kernel K together using calculation, produce an output that expresses how the shape is modified and affected by the other.
The convolutional layer is the core building block of a CNN as it contributes to one of the most important aspects of CNN called feature detection. The kernel K comprises a set of learnable filters and is small spatially compared to the image but extends through the full depth of the input image. Kernel K acts as a feature detector on the image I, where it tries to detect features and create multiple feature maps to help identify or classify the image. Similar multiple feature detectors help with feature identification problems like edge detection, identifying different shapes and others. Rather than looking and finding certain features from an entire image, it can be more effective to look at smaller portions of the image.
Convolutional neural networks have shown good advantages in image classification tasks. They have also achieved excellent results in many visual object recognition tasks, mainly because of their network structure which can extract multi-level features from images. For example, an input image is assigned one label from a fixed set of categories in the image classification problem. This classification problem is central and revolves around computer vision. Despite its simplicity, there are many practical applications. Such problems have multiple uses, such as labeling skin cancer images, detecting natural disasters such as floods, volcanoes, and severe droughts, noting the impacts and damage caused using high-resolution images.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
The performance and accuracy of image classification algorithms rely on the features used to feed them to the neural network. This means that the progress of image classification techniques using machine learning relies heavily on selecting the essential features of the images that would make up the database. Therefore, obtaining these resources has become a tedious task, resulting in increased complexity and computational cost.
Convolutional neural networks have dominated visual recognition scenarios for years. Although recently the prevailing vision transformers or VITs have shown the great potential of self attention-based models in ImageNet classification, their performance is still inferior to that of the latest SOTA CNNs; they need extra data to be provided. Modeling in visual recognition has also recently been revolutionized by Vision Transformers.
Unlike the CNNs that aggregate and transform features with the help of local and dense convolutional kernels, ViTs directly model the long-range dependencies of local patches, also known as tokens, through the self-attention mechanism that has greater flexibility in modeling visual contents. A major factor limiting the ViTs from outperforming CNNs is their low efficacy in encoding the fine-level features and contexts into token representations, which are critical for achieving compelling results and visual recognition performance. Fine-level information can be encoded into tokens through image tokenization, leading to a token sequence of greater length that increases the complexity of the self-attention mechanism in ViTs.
What is VOLO?
VOLO is a simple yet powerful model architecture typically used for visual recognition. VOLO helps achieve fine-level token representation encoding and global information aggregation using a two-stage architectural design. Specifically, when given an input image of size 224 × 224, VOLO tokenizes the input image on small-sized patches like 8 × 8. It employs multiple outlookers to encode token representations at the fine level, such as 28 × 28. The obtained token representations are more expressive and significantly improve the model performance in image classification problems.
The model has an architecture with two separate layer stages. The first layer consists of a repeated stack of Outlookers that generate fine-level token representations. The second layer deploys a sequence of transformer blocks to aggregate the global information present. Further, a patch embedding module is used to map the input to token representations with designed shapes at the beginning of each stage. The outlooker presents an outlook attention layer for transferring spatial information encoding and a multi-layer perceptron for inter-channel information interaction and exchange.
The outlook attention matrix for a local window in an image of K × K can be easily generated from the centre token with a linear layer followed by a reshape operation, as highlighted by the green dash box in the architectural image. Further, the attention weights are generated from the centre token within the window and act on the neighboring tokens and themselves. We name these operations as outlook attention.

The Outlook attention is simple, efficient, and easy to implement, and its main advantages are:
1) The features present at each spatial location are representative enough to generate the attention weights for locally aggregating the neighboring features.
2) Using dense and local spatial aggregation, it can encode fine-level information efficiently.
A sample outlook attention code might look like this:
# H: height, W: width, K: kernel size # x: input tensor (H, W, C) def init() v_pj = nn.Linear(C, C) attn = nn.Linear(C, k ** 4) unfold = nn.Unfold(K, padding) fold = nn.Fold(output_size=(H, W), K, padding) def outlook_attention(x) : # code in forward v = v_pj(x).permute(2, 1, 0) # Eqn. (3), embedding set of neighbors v = unfold(v).reshape(C, K*K, H*W).permute(2, 1, 0) a = attn(x).reshape(H*W, K*K, K*K) # Eqn. (4), weighted average a = a.softmax(dim=-1) x = mul(a, v).permute(2, 1, 0).reshape(C*K*K, H*W) # Eqn. (5) x = fold(x).permute(2, 1, 0) return x
Getting Started with Code Implementation of VOLO
This article will try to implement a basic visual recognition using VOLO and an input image, where the model will automatically label our input image. The following is an official implementation from VOLO creators whose GitHub repository link can be found here.
First Steps
First, we will install the basic libraries required to create our model. We will be installing two PyTorch image model libraries, timm and tlt. You can implement the following code,
!pip install timm tlt #installing image models !git clone https://github.com/sail-sg/volo.git #pulling volo model clone !wget https://github.com/sail- sg/volo/releases/download/volo_1/d1_224_84.2.pth.tar
Installing the Dependencies
Installing the further dependencies, importing all the model functions and pretrained set of weights with the following code as-
#loading the dependencies import models from PIL import Image from tlt.utils import load_pretrained_weights from timm.data import create_transform model = models.volo_d1(img_size=224) load_pretrained_weights(model=model, checkpoint_path='../d1_224_84.2.pth.tar') model.eval() transform = create_transform(input_size=224, crop_pct=model.default_cfg['crop_pct']) Setting path to our input image to be classified, image = Image.open('/content/b0ebe8d60592c03cac61bf1e28373705.jpg') input_image = transform(image).unsqueeze(0)
This will be our input image :

Predicting the Label
Loading the imagenet classes for image classification,
imagenet_classes = {0: 'tench, Tinca tinca', 1: 'goldfish, Carassius auratus', 2: 'great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias', 3: 'tiger shark, Galeocerdo cuvieri', 4: 'hammerhead, hammerhead shark', 5: 'electric ray, crampfish, numbfish, torpedo', 6: 'stingray', 7: 'rooster', 8: 'hen', 9: 'ostrich, Struthio camelus', 10: 'brambling, Fringilla montifringilla', 11: 'goldfinch, Carduelis carduelis', 12: 'house finch, linnet, Carpodacus mexicanus', 13: 'junco, snowbird', 14: 'indigo bunting, indigo finch, indigo bird, Passerina cyanea', 15: 'robin, American robin, Turdus migratorius', ……… 994: 'stinkhorn, carrion fungus', 995: 'earthstar', 996: 'hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa', 997: 'bolete', 998: 'ear, spike, capitulum', 999: 'toilet tissue, toilet paper, bathroom tissue'}
Making a label prediction for our input image,
pred = model(input_image) print(f'Prediction: {imagenet_classes[int(pred.argmax())]}.') image
Output :
Prediction: tiger, Panthera tigris.

EndNotes
This article has explored the VOLO model and how it is better than the traditional visual recognition methods. We also tried to get a basic hands-on feel of creating an image recognizer and predicted the label of the input image. You can apply the VOLO model for solving complex problems and teach it with other functions as well. The above implementation can be found in the form of a colab notebook which can be accessed from the link here.
Happy Learning!