MITB Banner

What Is Weight Sharing In Deep Learning And Why Is It Important

Share

Neural architecture search (NAS) deals with the selection of neural models for specific learning problems. NAS, however, is computationally expensive for automating and democratising machine learning. The initial success of NAS was attributed partially to the weight-sharing method, which helped in the dramatic acceleration of probing the architectures. But why is the weight sharing method being criticised?

Brief Overview Of Weight Sharing

Traditionally, NAS methods were expensive due to the combinatorially large search space, requiring to train thousands of neural networks to completion. In 2018,  ENAS (Efficient NAS) paper, introduced the idea of weight-sharing, in which only one shared set of model parameters is trained for all architectures. 

These shared weights were used to compute the validation losses of different architectures which are then used as estimates of their validation losses. Since one had to train only one set of parameters, weight-sharing led to a massive speedup over earlier methods, reducing search time on CIFAR-10 from 2,000-20,000 GPU-hours to just 16. 

The validation accuracies computed using shared weights were sufficient to find good models cheaply. However, this correlation, although sufficient, doesn’t mean that weight-sharing does well.

This method has come under scrutiny due to its poor performance as a substitute for full model-training and is alleged to be inconsistent with results on recent benchmarks. 

Making A Case For Weight Sharing

via Carnegie Mellon University

The technique of sharing parameters among child models allowed efficient NAS to deliver strong empirical performances, whilst using much fewer GPU-hours than existing automatic model design approaches, and notably, 1000x less expensive than standard Neural Architecture.

The most popular implementation of shared weights as substitutes for standalone weights is the Random Search with Weight-Sharing (RS-WS) method, in which the shared parameters are optimised by taking gradient steps using architectures sampled uniformly at random from the search space.

However, practitioners started to wonder if sharing weights between models accelerate NAS. 

In an attempt to address this issue and to make a case for the weight sharing mechanism, the researchers at CMU published a work that lists their findings. The paper states that most of the criticism on weight sharing has the issue of the rank disorder as a common occurrence. The rank disorder occurs when the shared-weight performance of architectures does not correlate well with their standalone performance. 

The rank disorder is a problem for those methods, which rely on the shared-weights performance to rank architectures for evaluation, as it will cause them to ignore networks that achieve high accuracy when their parameters are trained without sharing. 

via CMU

The above picture illustrates rank-disorder issues where shared-weights are on the right, and individual weights trained from scratch are on the left.

To tackle this, the researchers present a unifying framework for designing and analysing gradient-based NAS methods that exploit the underlying problem structure to find high-performance architectures quickly. The geometry-aware framework, wrote the researchers, resulted in the algorithms that:

  •  enjoy faster convergence guarantees than existing gradient-based methods and;
  • achieve state-of-the-art accuracy on the latest NAS benchmarks in computer vision. 

The results show that this new framework outclasses previous best works for both CIFAR and ImageNet on both the DARTS search space and NAS-Bench-201.

Key Takeaways

According to the authors, this work on weight sharing methods tried to establish the following:

  • The success of weight-sharing methods should not be surprising given the fact that the ML community’s inclination towards non-convex optimisation of over-parameterised models.
  • The rank disorder should not be a concern since obtaining high-quality architectures is of higher priority than ranking them.
  • The sometimes-poor performance of weight-sharing is a result of optimisation issues that can be fixed while still using weight-sharing. 
  • To this end, a geometry-aware exponentiated algorithm (GAEA) is proposed that is applicable to many popular NAS methods and achieves state-of-the-art results across several settings.

Link to paper

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.