How To Build Smaller, Faster, Better Deep Learning Models

Google researcher Gaurav Menghani proposed a method to make ‘deep learning models smaller, faster, and better.
efficient deep learning models

Deep learning has broad applications in sentiment analysis, natural language understanding, computer vision, etc. The technology is growing at a breakneck speed on the back of rapid innovation. However, such innovations call for a higher number of parameters and resources. In other words, the model is as good as the metrics.

To that end, Google researcher Gaurav Menghani has published a paper on model efficiency. The survey covers the landscape of model efficiency from modelling techniques to hardware support. He proposed a method to make ‘deep learning models smaller, faster, and better’.


Menghani argues that while larger and more complicated models perform well on the tasks they are trained on, they may not show the same performance when applied to real-life situations.

Following are the challenges practitioners face while training and deploying models:

  • The cost of training and deploying large deep learning models is high. The large models are memory-intensive and leave a bigger carbon footprint.
  • A few deep learning applications need to run in real-time on IoT and smart devices. This calls for optimisation of models for specific devices.
  • Building training models with as little data as possible when the user data might be sensitive. 
  • Off the shelf models may not always be able to address the constraints of new applications.
  • Training and deployment of multiple models on the same infrastructure for different applications may exhaust available resources.

A mental model

Menghani presents a mental model complete with algorithms, techniques, and tools for efficient deep learning. It has five major areas: compression techniques, learning techniques, automation, efficient architectures, and infrastructure (hardware).

Mental model

Compression techniques: These algorithms and techniques optimise the model’s architecture, typically by compressing its layers. One of the most popular examples of compression technique is quantisation, where the weight of a layer is compressed by reducing its precision with minimum loss in quality.

Learning techniques: These algorithms are focused on training the model to make fewer prediction errors, require less data, and converge faster. It also gives scope for trimming the parameters for obtaining a smaller footprint or a more efficient model. One example of this is distillation that allows accuracy improvement of a smaller model by teaching it to mimic a larger one.

 Automation: It helps in improving the core metrics of a given model. Hyper-parameter optimisation (HPO) is an example of an automation tool where optimising hyper-parameters increases accuracy. Another technique is architecture search, where the model architecture is tuned and the search helps find a model that optimises both loss and accuracy.

Efficient Architectures: These are fundamental blocks designed from scratch. Such models are superior to the baseline methods such as connected, fully connected layers and RNNs.

Infrastructure: Building an efficient model also needs a strong foundation of infrastructure and tools. It includes model training frameworks such as TensorFlow, PyTorch, etc.

Guide to efficiency

Menghani said the ultimate goal is to build Pareto-optimal models. Pareto efficiency means the resources are allocated in the most economically efficient way. To build Pareto-optimal models, practitioners must achieve the best possible result in one dimension while holding the other constant. Typically, one of these dimensions is quality (metrics such as accuracy, precision, and recall) and the other one is footprint (metrics such as model size, latency, RAM)

He suggests two methods to achieve the Pareto-optimal model:

Shrink and improve footprint sensitive models: This is a useful strategy for practitioners who want to reduce the footprint of the model without compromising on the quality. Shrinking can be useful for on-device deployment and server-side optimisation. Ideally, it should be minimally lossy, but in a few cases, reducing capacity can be compensated by the improve phase.

Grow-improve and shrink for quality sensitive models: This strategy can be used when a practitioner wants to deploy better quality models while keeping the same footprint. Here the capacity is first added by growing the model and then improved via learning techniques, automation etc; post this, the model is shrunk back.

Read the full paper here.

Download our Mobile App

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Bangalore

Future Ready | Lead the AI Era Summit

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

20th June | Bangalore

Women in Data Science (WiDS) by Intuit India

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox