Recurrent Neural Networks (RNNs) have found widespread use across a variety of domains from language modeling and machine translation to speech recognition and recommendation systems.
RNNs are built on the recursive formula, where the new state is a function of the old state and the input. And, RNNs excel for handling time series data.
However, when it comes to training these networks, few challenges surface.
The Need To Revisit And Revamp RNNs
Though RNNs have become immensely popular with the NLP tasks, their reputation of succumbing to exploding and vanishing gradients have put them in the back seat.
The main difficulty arises as error signal back-propagated through time (BPTT) suffers from exponential growth or decay, a dilemma commonly referred to as exploding or vanishing gradient.
The exploding gradients problem refers to the large increase in the norm of the gradient during training. Such events are caused by the explosion of the long term components, which can grow exponentially more than short term ones.
Modelling complex temporal dependencies in sequential data using RNNs, especially the long-term dependencies, remains an open challenge.
Gated variants of RNNs, such as long short-term memory (LSTM) networks and gated recurrent units (GRU) were introduced to alleviate these issues.
Identity and orthogonal initialization is another proposed solution to the exploding or vanishing gradient problem of Deep Neural Networks.
However, some of these approaches come with significant computational overhead and reportedly hinder representation power of these models. Moreover, orthogonal weight matrices alone do not prevent exploding and vanishing gradients, due to the nonlinear nature of deep neural networks.
In order to address the drawbacks of recurrent neural networks, a new framework of neural networks and their connection between ordinary differential equations(ODE) has been exploited to introduce AntisymmetricRNN. By exploiting the underlying differential equation, the researchers at Google Brain try to capture long-term dependencies.
In numerical analysis, stability theory addresses the stability of solutions of ODEs under small perturbations of initial conditions.
An ODE solution is stable if the long-term behaviour of the system does not depend significantly on the initial conditions.
The performance of the proposed antisymmetric networks is evaluated on four image classification tasks with long-range dependencies.
The classification is done by feeding pixels of the images as a sequence to RNNs and sending the last hidden state of the RNNs into a fully-connected layer and a softmax function.
Cross-entropy loss and stochastic gradient descent(SGD) with momentum and Adagrad as optimizers are used here. In this work, the authors try to draw connections between RNNs and the ordinary differential equation theory and design new recurrent architectures by discretizing ODEs.
This new view opens up possibilities to exploit the computational and theoretical success from dynamical systems to understand and improve the trainability of RNNs.
AntisymmetricRNN is a discretization of ODEs. Besides its appealing theoretical properties, this model have competitive performance over strong recurrent baselines on a comprehensive set of benchmark tasks.
- Existing approaches to improving RNN trainability often incur significant computation overhead. In comparison, AntisymmetricRNN achieves the same goal by design.
- AntisymmetricRNN exhibits much more predictable dynamics.
- It outperforms regular LSTM models on tasks requiring long-term memory and matches the performance on tasks where short-term dependencies dominate despite being much simpler.
By establishing a link between recurrent networks and ordinary differential equations, the authors believe that this work will inspire future research. For example, one such aspect of work can be dedicated towards exploring other stable ordinary differential equations and numerical methods that might lead to novel and well-conditioned recurrent architectures.