Guide To VOLO: Vision Outlooker For Visual Recognition

VOLO is a simple yet powerful CNN model architecture used for visual recognition and helps achieve fine-level token representation.


A neural network is a system of software patterned similar to the operation of neurons in a human brain. However, traditional and simple neural networks are not ideal for image processing and need images to be fed in reduced resolution divided into pieces. Convolutional Neural Networks or CNN are popularly known as deep learning architecture widely used for image-based learning applications in computational systems. CNNs comprise more than a single convolution layer and have found their use in many computer vision and image processing functionalities and classification and segmentation problems.  Convolution is a mathematical operation and a process where an input I and an argument, kernel K together using calculation, produce an output that expresses how the shape is modified and affected by the other. 

The convolutional layer is the core building block of a CNN as it contributes to one of the most important aspects of CNN called feature detection. The kernel K comprises a set of learnable filters and is small spatially compared to the image but extends through the full depth of the input image. Kernel K acts as a feature detector on the image I, where it tries to detect features and create multiple feature maps to help identify or classify the image. Similar multiple feature detectors help with feature identification problems like edge detection, identifying different shapes and others. Rather than looking and finding certain features from an entire image, it can be more effective to look at smaller portions of the image. 


Sign up for your weekly dose of what's up in emerging technology.

Convolutional neural networks have shown good advantages in image classification tasks. They have also achieved excellent results in many visual object recognition tasks, mainly because of their network structure which can extract multi-level features from images. For example, an input image is assigned one label from a fixed set of categories in the image classification problem. This classification problem is central and revolves around computer vision. Despite its simplicity, there are many practical applications. Such problems have multiple uses, such as labeling skin cancer images, detecting natural disasters such as floods, volcanoes, and severe droughts, noting the impacts and damage caused using high-resolution images

The performance and accuracy of image classification algorithms rely on the features used to feed them to the neural network. This means that the progress of image classification techniques using machine learning relies heavily on selecting the essential features of the images that would make up the database. Therefore, obtaining these resources has become a tedious task, resulting in increased complexity and computational cost. 

Convolutional neural networks have dominated visual recognition scenarios for years. Although recently the prevailing vision transformers or VITs have shown the great potential of self attention-based models in ImageNet classification, their performance is still inferior to that of the latest SOTA CNNs; they need extra data to be provided. Modeling in visual recognition has also recently been revolutionized by Vision Transformers. 

Unlike the CNNs that aggregate and transform features with the help of local and dense convolutional kernels, ViTs directly model the long-range dependencies of local patches, also known as tokens, through the self-attention mechanism that has greater flexibility in modeling visual contents. A major factor limiting the ViTs from outperforming CNNs is their low efficacy in encoding the fine-level features and contexts into token representations, which are critical for achieving compelling results and visual recognition performance. Fine-level information can be encoded into tokens through image tokenization, leading to a token sequence of greater length that increases the complexity of the self-attention mechanism in ViTs. 

What is VOLO?

VOLO is a simple yet powerful model architecture typically used for visual recognition. VOLO helps achieve fine-level token representation encoding and global information aggregation using a two-stage architectural design. Specifically, when given an input image of size 224 × 224, VOLO tokenizes the input image on small-sized patches like 8 × 8. It employs multiple outlookers to encode token representations at the fine level, such as 28 × 28. The obtained token representations are more expressive and significantly improve the model performance in image classification problems. 

The model has an architecture with two separate layer stages. The first layer consists of a repeated stack of Outlookers that generate fine-level token representations. The second layer deploys a sequence of transformer blocks to aggregate the global information present. Further, a patch embedding module is used to map the input to token representations with designed shapes at the beginning of each stage. The outlooker presents an outlook attention layer for transferring spatial information encoding and a multi-layer perceptron for inter-channel information interaction and exchange. 

The outlook attention matrix for a local window in an image of K × K can be easily generated from the centre token with a linear layer followed by a reshape operation, as highlighted by the green dash box in the architectural image. Further, the attention weights are generated from the centre token within the window and act on the neighboring tokens and themselves. We name these operations as outlook attention.

Image Source:

The Outlook attention is simple, efficient, and easy to implement, and its main advantages are: 

1) The features present at each spatial location are representative enough to generate the    attention weights for locally aggregating the neighboring features.

 2) Using dense and local spatial aggregation, it can encode fine-level information efficiently.

A sample outlook attention code might look like this: 

 # H: height, W: width, K: kernel size # x: input tensor (H, W, C) 
 def init() 
 v_pj = nn.Linear(C, C) 
 attn = nn.Linear(C, k ** 4)
 unfold = nn.Unfold(K, padding) 
 fold = nn.Fold(output_size=(H, W), K, padding)
 def outlook_attention(x) :  # code in forward 
   v = v_pj(x).permute(2, 1, 0)
 # Eqn. (3), embedding set of neighbors 
 v = unfold(v).reshape(C, K*K, H*W).permute(2, 1, 0) 
 a = attn(x).reshape(H*W, K*K, K*K) 
 # Eqn. (4), weighted average
 a = a.softmax(dim=-1) x = mul(a, v).permute(2, 1, 0).reshape(C*K*K, H*W)
 # Eqn. (5)
 x = fold(x).permute(2, 1, 0) return x 

Getting Started with Code Implementation of VOLO

This article will try to implement a basic visual recognition using VOLO and an input image, where the model will automatically label our input image. The following is an official implementation from VOLO creators whose GitHub repository link can be found here

First Steps

First, we will install the basic libraries required to create our model. We will be installing two PyTorch image model libraries, timm and tlt. You can implement the following code, 

 !pip install timm tlt #installing image models
 !git clone #pulling volo model clone
 !wget  sg/volo/releases/download/volo_1/d1_224_84.2.pth.tar 
Installing the Dependencies 

Installing the further dependencies, importing all the model functions and pretrained set of weights with the following code as-

 #loading the dependencies
 import models
 from PIL import Image
 from tlt.utils import load_pretrained_weights
 from import create_transform
 model = models.volo_d1(img_size=224)
 load_pretrained_weights(model=model, checkpoint_path='../d1_224_84.2.pth.tar')
 transform = create_transform(input_size=224, crop_pct=model.default_cfg['crop_pct'])
 Setting path to our input image to be classified, 
 image ='/content/b0ebe8d60592c03cac61bf1e28373705.jpg')
 input_image = transform(image).unsqueeze(0) 

This will be our input image :

 Predicting the Label

Loading the imagenet classes for image classification, 

 imagenet_classes = {0: 'tench, Tinca tinca',
  1: 'goldfish, Carassius auratus',
  2: 'great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias',
  3: 'tiger shark, Galeocerdo cuvieri',
  4: 'hammerhead, hammerhead shark',
  5: 'electric ray, crampfish, numbfish, torpedo',
  6: 'stingray',
  7: 'rooster',
  8: 'hen',
  9: 'ostrich, Struthio camelus',
  10: 'brambling, Fringilla montifringilla',
  11: 'goldfinch, Carduelis carduelis',
  12: 'house finch, linnet, Carpodacus mexicanus',
  13: 'junco, snowbird',
  14: 'indigo bunting, indigo finch, indigo bird, Passerina cyanea',
  15: 'robin, American robin, Turdus migratorius',
 994: 'stinkhorn, carrion fungus',
  995: 'earthstar',
  996: 'hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa',
  997: 'bolete',
  998: 'ear, spike, capitulum',
  999: 'toilet tissue, toilet paper, bathroom tissue'} 

Making a label prediction for our input image, 

 pred = model(input_image)
 print(f'Prediction: {imagenet_classes[int(pred.argmax())]}.')

Output : 

 Prediction: tiger, Panthera tigris. 


This article has explored the VOLO model and how it is better than the traditional visual recognition methods. We also tried to get a basic hands-on feel of creating an image recognizer and predicted the label of the input image. You can apply the VOLO model for solving complex problems and teach it with other functions as well. The above implementation can be found in the form of a colab notebook which can be accessed from the link here

Happy Learning!


More Great AIM Stories

Victor Dey
Victor is an aspiring Data Scientist & is a Master of Science in Data Science & Big Data Analytics. He is a Researcher, a Data Science Influencer and also an Ex-University Football Player. A keen learner of new developments in Data Science and Artificial Intelligence, he is committed to growing the Data Science community.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM