Last updated July 24, 2020

What Is Weight Sharing In Deep Learning And Why Is It Important

Share

Published on July 25, 2020

by Ram Sagar

Neural architecture search (NAS) deals with the selection of neural models for specific learning problems. NAS, however, is computationally expensive for automating and democratising machine learning. The initial success of NAS was attributed partially to the weight-sharing method, which helped in the dramatic acceleration of probing the architectures. But why is the weight sharing method being criticised?

Brief Overview Of Weight Sharing

Traditionally, NAS methods were expensive due to the combinatorially large search space, requiring to train thousands of neural networks to completion. In 2018, ENAS (Efficient NAS) paper, introduced the idea of weight-sharing, in which only one shared set of model parameters is trained for all architectures.

These shared weights were used to compute the validation losses of different architectures which are then used as estimates of their validation losses. Since one had to train only one set of parameters, weight-sharing led to a massive speedup over earlier methods, reducing search time on CIFAR-10 from 2,000-20,000 GPU-hours to just 16.

The validation accuracies computed using shared weights were sufficient to find good models cheaply. However, this correlation, although sufficient, doesn’t mean that weight-sharing does well.

This method has come under scrutiny due to its poor performance as a substitute for full model-training and is alleged to be inconsistent with results on recent benchmarks.

Making A Case For Weight Sharing

The technique of sharing parameters among child models allowed efficient NAS to deliver strong empirical performances, whilst using much fewer GPU-hours than existing automatic model design approaches, and notably, 1000x less expensive than standard Neural Architecture.

The most popular implementation of shared weights as substitutes for standalone weights is the Random Search with Weight-Sharing (RS-WS) method, in which the shared parameters are optimised by taking gradient steps using architectures sampled uniformly at random from the search space.

However, practitioners started to wonder if sharing weights between models accelerate NAS.

In an attempt to address this issue and to make a case for the weight sharing mechanism, the researchers at CMU published a work that lists their findings. The paper states that most of the criticism on weight sharing has the issue of the rank disorder as a common occurrence. The rank disorder occurs when the shared-weight performance of architectures does not correlate well with their standalone performance.

The rank disorder is a problem for those methods, which rely on the shared-weights performance to rank architectures for evaluation, as it will cause them to ignore networks that achieve high accuracy when their parameters are trained without sharing.

The above picture illustrates rank-disorder issues where shared-weights are on the right, and individual weights trained from scratch are on the left.

To tackle this, the researchers present a unifying framework for designing and analysing gradient-based NAS methods that exploit the underlying problem structure to find high-performance architectures quickly. The geometry-aware framework, wrote the researchers, resulted in the algorithms that:

enjoy faster convergence guarantees than existing gradient-based methods and;
achieve state-of-the-art accuracy on the latest NAS benchmarks in computer vision.

The results show that this new framework outclasses previous best works for both CIFAR and ImageNet on both the DARTS search space and NAS-Bench-201.

Key Takeaways

According to the authors, this work on weight sharing methods tried to establish the following:

The success of weight-sharing methods should not be surprising given the fact that the ML community’s inclination towards non-convex optimisation of over-parameterised models.
The rank disorder should not be a concern since obtaining high-quality architectures is of higher priority than ranking them.
The sometimes-poor performance of weight-sharing is a result of optimisation issues that can be fixed while still using weight-sharing.
To this end, a geometry-aware exponentiated algorithm (GAEA) is proposed that is applicable to many popular NAS methods and achieves state-of-the-art results across several settings.

Link to paper

Access all our open Survey & Awards Nomination forms in one place