Now Reading
RepVGG: Can You Make Simple Architectures Great Again?

RepVGG: Can You Make Simple Architectures Great Again?

RepVGG: Can You Make Simple Architectures Great Again

Complicated is better; at least that seems to be the case for ConvNets. With Inception, ResNets, DenseNets, and even automatic/manual architecture search, we shift away from simple stack architectures to more complicated architecture designs. Albeit the complex multi-branch architectures have achieved higher accuracies, they have significant drawbacks as well:

  • These complicated architectures’ multi-branch design is hard to implement and hard to tweak according to the task.
  • The more complex architectures have substantially higher memory access costs; this reduces memory utilization and slows down the inferences.
  • Complex model topology restricts the application of optimization techniques like channel pruning and filter pruning.

 In January 2021, Xiaohan Ding, Xiangyu Zhang, et al. published a paper describing a simple ConvNet architecture that combines multibranch topologies’ increased performance and the simplicity of VGG topology: “RepVGG: Making VGG-style ConvNets Great Again”. 

One of the reasons that multi-branch topologies perform better is that the branches create an ensemble-esque model, making the model an ensemble of numerous shallower networks. The proposed architecture achieves this by decoupling the training-time multi-branch topology and inference-time architecture of the model using structural re-parameterization.

Difference between RepVGG training and inference time architecture

“Notably, this paper is not merely a demonstration that plain models can converge reasonably well and does not intend to train extremely deep ConvNets like ResNets. Rather, we aim to build a simple model with reasonable depth and favourable accuracy-speed trade-off, which can be simply implemented with the most common components (e.g., regular conv and BN) and simple algebra.”

Model Reparameterization

Every network structure has an associated set of parameters, e.g., a conv layer is represented by a fourth-order tensor. If the parameters of a particular structure can be transformed into parameters of a different structure we can replace the former with the latter. In RepVGG the identity and the 1 x 1 branches from the training-time architecture are re-parameterized to create a simpler inference-time architecture consisting of only 3 x 3 conv and ReLU layers. 

One RepVGG block consists of a 3 x 3 conv, a 1 x 1 conv, an identity mapping, and three bias vectors corresponding to each of these.  The identity mapping can be viewed as a 1 x 1 conv with an identity matrix as the kernel, after doing this transformation we get one 3 x 3 kernel, two 1 x 1 kernels, and three bias vectors. The 1 x 1 kernels are then transformed into 3 x 3 kernels by using zero-padding, and we obtain the final conv block by adding all the 3 x 3 kernels and the three bias vectors.

Architecture 

RepVGG extensively uses 3 x 3 convs much like its namesake VGG,  but it forgoes the use of max pooling for the sake of simplicity. These 3 x 3 layers are 5 stages, the first layer of every stage downsamples with a stride of 2. To decrease computational complexity and inference time, the two most time-consuming stages have been reduced in size. The first stage deals with high-resolution inputs and the last stage has more channels, both of these stages have more parameters per layer so they are restricted to one layer each. And much like ResNets, the multi-branch topology RepVGG is built to compete with, it has a majority of its layers in the second last stage. 

There are two instances of RepVGG: RepVGG-A has 1, 2, 4, 14, 1 layers in its five stages and the deeper variant RepVGG-B has 1, 4, 6, 16, 1 layers.

RepVGG architecture

The different stages’ layer width is scaled using two multiplies, a for the first 4 stages and for the last stage. b is usually set to be greater than a to increase the feature quality and information passed down by it. Every layer apart from the first layer can be scaled up, the first layer is only scaled down. This is done to avoid large convolution operations on the large input feature map.

How well does RepVGG hold up?

To compare RepVGG with state-of-the-art models like ResNets, EfficientNets, and ResNeXt a plethora of RepVGG instances were created with varying layer widths to match the complexities of the baseline models. 

RepVGG architectures of different widths

Accuracy vs speed plot of RepVGG and baseline models

RepVGG beats its peers not only in accuracy but also in speed. What’s even more noteworthy is that RepVGG-B3 models reached accuracies above 80%, this is the first time simple models have achieved this feat.

Using RepVGG 

See Also
Handwritten Character Digit Classification

  1. Download or clone the repo

!git clone https://github.com/DingXiaoH/RepVGG

  1. Choose the network instance(A or B) and width, get the corresponding pre-trained model from here, or train it yourself.
python train.py -a RepVGG-A0 --multiprocessing-distributed --world-size 1 --rank 0 --workers 32 [folder with train and val folders]
  1. Convert the training-time model to the inference-time model and use it. Only do this conversion after you are done with any further training and finetuning.

Converting the model in the command line:

 python convert.py RepVGG-A0-train.pth  RepVGG-A0-deploy.pth  -a RepVGG-A0 
 from repvgg import create_RepVGG_A0
 deploy_model = create_RepVGG_A0(deploy=True)
 deploy_model.load_state_dict(torch.load('RepVGG-A0-deploy.pth'))
 … 

Converting the model in code:

 from repvgg import repvgg_model_convert, create_RepVGG_A0
 train_model = create_RepVGG_A0(deploy=False)
 train_model.load_state_dict(torch.load('RepVGG-A0-train.pth'))          
 deploy_model = repvgg_model_convert(train_model, create_RepVGG_A0, save_path='repvgg_deploy.pth')
 … 

If you want to use RepVGG as a component of another model, using the whole_model_convert method in repvgg.py is recommended.

 from repvgg import whole_model_convert, create_RepVGG_A0
 train_backbone = create_RepVGG-A0(deploy=False)
 train_backbone.load_state_dict(torch.load('RepVGG-A0-train.pth'))
 train_pspnet = build_pspnet(backbone=train_backbone)
 train(train_pspnet)
 deploy_backbone = create_RepVGG_B2(deploy=True)
 deploy_pspnet = build_pspnet(backbone=deploy_backbone)
 whole_model_convert(train_pspnet, deploy_pspnet) 
 test(deploy_pspnet)

Endnote

On the one hand, we have architectures like GPT-3 redefining what we can achieve using deep learning. On the other hand, we have complicated architectures for basic problems (like image classification)  that are slow, hard to customize, and excessively computation-intensive. We need more plain, easy to customize solutions like RepVGG to help reduce the delta between resource-intensive models’ performance and simpler, more accessible models.

What Do You Think?

Join Our Telegram Group. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top