Recently, Google introduced a family of convolutional networks known as EfficientNetV2. According to its developers, the EfficientNetV2 model significantly outperformed previous models on ImageNet and CIFAR/Cars/Flowers datasets.
There are many existing techniques to improve training efficiency. For example, ResNet-RS improves training efficiency by optimising the scaling hyperparameters; Vision Transformers improves training efficiency on large-scale datasets using Transformer blocks. However, these techniques often come with expensive overhead depending on parameter size. This is the reason why Google released this new family of convolutional networks.
EfficientNetV2 vs EfficientNet
EfficientNetV2 is the successor of EfficientNets. Introduced in 2019, EfficientNet is a family of models optimised for FLOPs and parameter efficiency. It leverages neural architecture search to look for the baseline EfficientNet-B0 model with a better trade-off on accuracy and FLOPs.
EfficientNetV2 overcomes some of the training bottlenecks in EfficientNet, such as:
- Training with enormous image sizes is slow: The large image size of EfficientNet results in significant memory usage. As the total memory on GPU and TPU is fixed, the researchers had to train the EfficientNet models with a smaller batch size that slows down the training.
- Depthwise convolutions are slow in early layers: Another training bottleneck of EfficientNet comes from the extensive depthwise convolutions. Depthwise convolutions have fewer parameters and FLOPs than regular convolutions, but they often cannot fully utilise modern accelerators.
- Equally scaling up every stage is sub-optimal: EfficientNet equally scales up all stages using a simple compound scaling rule. However, these stages not equally contribute to the training speed and parameter efficiency.
Based on these observations, the researchers designed a search space enriched with additional ops such as Fused-MBConv, and apply training-aware NAS and scaling to jointly optimise model accuracy, training speed, and parameter size. Also, EfficientNets aggressively scale up image size, leading to large memory consumption and slow training. The researchers slightly modified the scaling rule and restricted the maximum image size to a smaller value to address this issue.
Tech behind EfficientNetV2
The size of the deep learning models and training data are increasingly getting larger. In such a case, training efficiency plays an important role. For instance, GPT-3 model with an unprecedented model and training data sizes demonstrates few-shot learning. However, it requires weeks of training with thousands of GPUs, making it difficult to retrain or improve the model.
The researchers used a combination of training-aware neural architecture search (NAS) and scaling to optimise the training speed and parameter efficiency to develop this model.
Contributions
- The researchers have introduced EfficientNetV2, a new family of smaller and faster models. EfficientNetV2 model outperformed previous models in training speed and parameter efficiency.
- The researchers have proposed an improved method of progressive learning, which adaptively adjusts regularisation and image size. The researchers also showed that it speeds up training and simultaneously improves accuracy.
- The researchers have demonstrated the new model achieved 11x faster training speed and up to 6.8x better parameter efficiency on ImageNet, CIFAR, Cars, and Flowers dataset.
Wrapping up
EfficientNets use NAS to construct a baseline network and use “compound scaling” to increase the capacity of the network without adding more parameters. The training can be accelerated by progressively increasing the image size during training, but it leads to a drop in accuracy. To make up for this accuracy drop, the researchers proposed an improved method of progressive learning, which adaptively adjusts regularization along with image size. By pretraining on the same ImageNet21k, our EfficientNetV2 achieves 87.3% top-1 accuracy on ImageNet ILSVRC2012, outperforming the recent ViT by 2.0% accuracy while training 5x-11x faster using the same computing resources