Now Reading
Let’s Not Stop At Back-prop! Check Out 5 Alternatives To This Popular Deep Learning Technique 


Let’s Not Stop At Back-prop! Check Out 5 Alternatives To This Popular Deep Learning Technique 


Yoshua Bengio

Back-propagation is the procedure of repeatedly adjusting the weights of the connections in the neural network to minimize the difference between actual output and desired output. These weight adjustments result in making the hidden units of the neural network to represent key features of the data. 

Back-propagation is an ingenious idea that also has its own set of disadvantages like vanishing or exploding gradients. So, in a modest attempt to find an alternative or a way to avoid the use of back-propagation, researchers have been trying to figure out the alternatives to back-propagation for years.



In this article, we take a quick look at a few top contenders for back-prop:

1) Difference Target Propagation

Back-propagation relies on infinitesimal effects (partial derivatives) in order to perform credit assignment. This could become a serious issue while dealing with deeper and more non-linear functions. Inspired by the biological implausibility of back-propagation, a few approaches have been proposed in the past that could play a similar credit assignment role.

The main idea of target propagation is to compute targets rather than gradients, at each layer. Like gradients, they are propagated backwards. In a way that is related but different from previously proposed proxies for back-propagation which rely on a backwards network with symmetric weights, target propagation relies on auto-encoders at each layer. 

Target propagation replaces training signals based on partial derivatives by targets which are propagated based on an auto-encoding feedback loop. Difference target propagation is a linear correction for this imperfect inverse mapping which is effective to make target propagation actually work. The experiments show that target propagation performs comparably to back-propagation on ordinary deep networks and denoising auto-encoders.

Read the original paper here

2) HSIC (Hilbert-Schmidt independence criterion) Bottleneck: Deep Learning without back-propagation

Information bottleneck method itself is at least 20 years old. It was introduced by Naftali Tishby, Fernando C. Pereira, and William Bialek.  Bottleneck method’s main objective is to find the sweet spot between accuracy and complexity. 

The approach here is to train the network by using an approximation of the information bottleneck instead of back-propagation.

In the next step, a substitute for the mutual information between hidden representations and labels is found and is maximised. This simultaneously minimises the mutual dependency between hidden representations and the inputs.

Thus, each hidden representation from the HSIC-trained network may contain different information obtained by optimizing the HSIC bottleneck objective at a particular scale. Then the aggregator sums the hidden representations to form an output representation.

An intuition for the HSIC approach here is provided by the fact that the series expansion of the exponential contains a weighted sum of all moments of the data, and two distributions are equal if and only if their moments are identical.

Read the original paper here.

3) Beyond Back-prop: Online Alternating Minimization with Auxiliary Variables

State-of-the-art methods rely on error back-propagation, which suffers from several well-known issues, such as vanishing and exploding gradients, inability to handle non-differentiable nonlinearities and to parallelize weight-updates across layers, and biological implausibility.

These limitations continue to motivate exploration of alternative training algorithms, including several recently proposed auxiliary-variable methods which break the complex nested objective function into local subproblems. 

However, those techniques are mainly offline (batch), which limits their applicability to extremely large datasets, as well as to online, continual or reinforcement learning. 

The main contribution of this work is a novel online (stochastic/mini-batch) alternating minimization (AM) approach for training deep neural networks, together with the first theoretical convergence guarantees for AM in stochastic settings and promising empirical results on a variety of architectures and datasets.

This work builds upon previously proposed offline methods that break the nested objective into easier-to-solve local subproblems via inserting auxiliary variables corresponding to activations in each layer. Such methods avoid gradient chain computation and potential issues associated with it, including vanishing gradients, lack of cross-layer parallelization, and difficulties handling non-differentiable nonlinearities.

Read the original paper here

4) Decoupled Neural Interfaces Using Synthetic Gradients

See Also

This work gives us a way to allow neural networks to communicate, to learn to send messages between themselves, in a decoupled, scalable manner paving the way for multiple neural networks to communicate with each other or improving the long term temporal dependency of recurrent networks. 

These decoupled neural interfaces allow distributed training of networks, enhance the temporal dependency learnt with RNNs, and speed up hierarchical RNN systems.

This is achieved by using a model to approximate error gradients, rather than by computing error gradients explicitly with back-propagation requiring all modules in a network to wait for all other modules to execute and back-propagate gradients is overly time-consuming or even intractable. If we decouple the interfaces - the connections -  between modules, every module can be updated independently and is not locked to the rest of the network unlike back-prop in a feed-forward network.

In this work, the researchers at DeepMind try to remove the reliance on back-propagation to get error gradients and instead learn a parametric model which predicts what the gradients will be based upon only local information. We call these predicted gradients synthetic gradients.

Read the original paper here

5) Training Neural Networks with Local Error Signals

An alternative approach to train the network is with layer-wise loss functions. In this paper, the authors demonstrate, for the first time, that layer-wise training can approach the state-of-the-art on a variety of image datasets. We use single-layer sub-networks and two different supervised loss functions to generate local error signals for the hidden layers, and we show that the combination of these losses helps with optimization in the context of local learning.

Using local errors could be a step towards more biologically plausible deep learning because the global error does not have to be transported back to hidden layers. A completely back-prop free variant outperforms previously reported results among methods aiming for higher biological plausibility.

Read the original paper here

Back-prop gave neural networks the ability to create new useful features from the same data. Regardless of its drawbacks, its presence will be felt across many domains using AI. However, having a look at how the neural networks can still be good at what they do without the help of back-propagation is a welcoming change. 



Register for our upcoming events:


Enjoyed this story? Join our Telegram group. And be part of an engaging community.

Provide your comments below

comments

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
Scroll To Top