Tech Behind Google’s New CNN, EfficientNetV2

EfficientNetV2 can train up to 11x faster than prior models, while being up to 6.8x smaller in parameter size.

Recently, Google introduced a family of convolutional networks known as EfficientNetV2. According to its developers, the EfficientNetV2 model significantly outperformed previous models on ImageNet and CIFAR/Cars/Flowers datasets.

There are many existing techniques to improve training efficiency. For example, ResNet-RS improves training efficiency by optimising the scaling hyperparameters; Vision Transformers improves training efficiency on large-scale datasets using Transformer blocks. However, these techniques often come with expensive overhead depending on parameter size. This is the reason why Google released this new family of convolutional networks. 

EfficientNetV2 vs EfficientNet

EfficientNetV2 is the successor of EfficientNets. Introduced in 2019, EfficientNet is a family of models optimised for FLOPs and parameter efficiency. It leverages neural architecture search to look for the baseline EfficientNet-B0 model with a better trade-off on accuracy and FLOPs.

EfficientNetV2 overcomes some of the training bottlenecks in EfficientNet, such as:

  • Training with enormous image sizes is slow: The large image size of EfficientNet results in significant memory usage. As the total memory on GPU and TPU is fixed, the researchers had to train the EfficientNet models with a smaller batch size that slows down the training.
  • Depthwise convolutions are slow in early layers: Another training bottleneck of EfficientNet comes from the extensive depthwise convolutions. Depthwise convolutions have fewer parameters and FLOPs than regular convolutions, but they often cannot fully utilise modern accelerators.
  • Equally scaling up every stage is sub-optimal: EfficientNet equally scales up all stages using a simple compound scaling rule. However, these stages not equally contribute to the training speed and parameter efficiency.

Based on these observations, the researchers designed a search space enriched with additional ops such as Fused-MBConv, and apply training-aware NAS and scaling to jointly optimise model accuracy, training speed, and parameter size. Also, EfficientNets aggressively scale up image size, leading to large memory consumption and slow training. The researchers slightly modified the scaling rule and restricted the maximum image size to a smaller value to address this issue.

Tech behind EfficientNetV2

The size of the deep learning models and training data are increasingly getting larger. In such a case, training efficiency plays an important role. For instance, GPT-3 model with an unprecedented model and training data sizes demonstrates few-shot learning. However, it requires weeks of training with thousands of GPUs, making it difficult to retrain or improve the model.

The researchers used a combination of training-aware neural architecture search (NAS) and scaling to optimise the training speed and parameter efficiency to develop this model. 


  • The researchers have introduced EfficientNetV2, a new family of smaller and faster models. EfficientNetV2 model outperformed previous models in training speed and parameter efficiency.
  • The researchers have proposed an improved method of progressive learning, which adaptively adjusts regularisation and image size. The researchers also showed that it speeds up training and simultaneously improves accuracy.
  • The researchers have demonstrated the new model achieved 11x faster training speed and up to 6.8x better parameter efficiency on ImageNet, CIFAR, Cars, and Flowers dataset. 

Wrapping up

EfficientNets use NAS to construct a baseline network and use “compound scaling” to increase the capacity of the network without adding more parameters. The training can be accelerated by progressively increasing the image size during training, but it leads to a drop in accuracy. To make up for this accuracy drop, the researchers proposed an improved method of progressive learning, which adaptively adjusts regularization along with image size. By pretraining on the same ImageNet21k, our EfficientNetV2 achieves 87.3% top-1 accuracy on ImageNet ILSVRC2012, outperforming the recent ViT by 2.0% accuracy while training 5x-11x faster using the same computing resources

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

More Stories


8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

Yugesh Verma
A beginner’s guide to Spatio-Temporal graph neural networks

Spatio-temporal graphs are made of static structures and time-varying features, and such information in a graph requires a neural network that can deal with time-varying features of the graph. Neural networks which are developed to deal with time-varying features of the graph can be considered as Spatio-temporal graph neural networks. 

Yugesh Verma
How is Boolean algebra used in Machine learning?

Machine learning model with Boolean algebra starts with the data with a target variable and input or learner variables and using the set of rules it generates output value by considering a given configuration of input samples.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM