In a neural network modelled after the human brain, the activation function is one of the most important components, applied to the input before deriving a transformed output, deciding the weighting required for a neuron to be activated and transferred to the next layer. The main job of an activation function is to introduce non-linearity in a neural network. Non-linear activation functions have been key transformers in allowing CNNs to learn complex high dimensional functions. Additionally, the discovery of Rectified Linear Unit activation function, or ReLU, has played a huge role in alleviating the vanishing gradient problem. However, all popular activation functions monotonically increase with a single zero at the origin.
The famous XOR problem is training a neural network to learn the XOR gate function. Papert and Minsky first pointed out that a single neuron cannot learn the XOR function since a single hyperplane (line, in this case) cannot separate the output classes for this function definition. This fundamental limitation of single neurons (or single-layer networks) led to pessimistic predictions for the future of neural network research. It was responsible for a brief hiatus in the history of AI. In a recent paper, these limitations have been shown to be invalid for certain oscillatory activation functions. The XOR problem is the task of learning the following dataset:
Sign up for your weekly dose of what's up in emerging technology.
The paper, ‘Biologically Inspired Oscillating Activation Functions Can Bridge the Performance Gap between Biological and Artificial Neurons’, proposes oscillating activation functions to overcome both the gradient flow problem and the XOR problem, essentially solving “classification problems with fewer neurons and reducing training time”. Analytics India Magazine spoke to Dr Matthew Mithra Noel, Dean, School of Electrical Engineering at VIT and Shubham Bharadwaj to explore the research further. Additionally, Praneet Dutta, an alumnus of the university, volunteered to provide high-level guidance as an independent researcher on the proposal.
Search for activation functions better than ReLU
Dr Noel discussed ReLU and the need to search for better activation functions. “Neuronal layers with non-linear activation functions are essential in real-world applications of neural networks because the composition of any finite number of linear functions is equivalent to a single linear function. Hence an ANN composed of purely linear neurons is equivalent to a single linear layer network capable of learning only linear relationships and solving only linear separable problems,” he explained. “Despite the critical importance of the nature of the activation function in determining the performance of neural networks, simple monotonic non-decreasing non-linear activation functions are universally used. We explored the effects of using non-monotonic and oscillatory non-linear activation functions in deep neural networks. In the past, sigmoidal s-shaped saturating activation functions were popular since these approximated the step or signum function while still being differentiable.
Moreover, the outputs of s-shaped saturating activations have the important property of being interpretable as a binary yes/no decision and hence are useful. However, deep ANNs composed of purely sigmoidal activation functions cannot be trained due to the vanishing gradient problem that arises when saturating activation functions are used. The adoption of non-saturating and non-sigmoidal Recti-Linear Unit (ReLU) activation function to alleviate the vanishing gradient problem is considered a milestone in the evolution of deep neural networks.”
Oscillatory and non-monotonic activation functions have been largely ignored in the past, possibly due to perceived biological implausibility. “Our research explores a variety of complex oscillatory activation functions. Oscillatory and non-monotonic activation functions might be advantageous in solving the vanishing gradient problem since these functions have nonzero derivatives throughout their domain except at isolated points,” Dr Noel stated.
“In our research, we discovered a variety of new activation functions that outperform all known activation functions on Imagenette, CIFAR-100, CIFAR-10 and MNIST datasets. In addition, these new activation functions appear to reduce network size and training time. For example, the XOR function that can only be learnt with a minimum of 3 sigmoidal or ReLU neurons can be learnt with a single GCU neuron.”
Overcoming the vanishing gradient problem with oscillatory activation functions
“By 2017, it was clear that if you replaced saturating sigmoidal activation functions with activation functions that were not saturating, like ReLU, the performance was significantly better,” said Dr Noel. “The only way you can train very deep neural nets is by replacing saturating sigmoidal activation functions with activation functions that only saturate partially.” This realisation led the team to re-think ReLU as the best activation function and the possibility to improve performance beyond it.
“Neural net learning works on the principle of gradient descent, and parameters are updated based on derivatives. So, the vanishing gradient problem is a fundamental problem that all deep nets have to overcome”, Dr Noel continued. “The solution that seemed obvious was to improve ReLU by exploring activation functions that never saturate, for any value. The problem of vanishing and exploding gradients can be alleviated by using oscillatory activation functions with derivatives that neither go to zero or infinity and resemble conventional bipolar sigmoidal activations close to zero.”
Oscillating activation function: the XOR problem
Each neuron in the neural network makes a simple Yes/No decision, a binary classification. If all classical activation functions have only one zero, the decision boundary is a single hyperplane. You need two hyperplanes to separate the dataset for the XOR problem. “Essentially, you need the activation function to be positive, then negative, and positive again,” Dr Noel explained this need to oscillate. In theory, two hyperplanes are required to separate the classes in the XOR dataset. And for two hyperplanes, an activation function with two zeros is needed. Data shows that for small input values, the output of biological neurons increases. In the longer run, the output saturates, and then “the output must decrease to another zero if a biological neuron is capable of learning the XOR function”, the paper explains. The proposed model captures this increasing and decreasing oscillation with multiple zeros. The model has multiple zeros and multiple hyperplanes as part of the decision boundary. This replaces the traditional need for two layers to learn the XOR function with a single neuron with oscillatory activation like Growing Cosine Unit (GCU). In the paper, the researchers have discovered and introduced many oscillating functions that could solve the XOR problem with a single neuron.
Discovery of single biological neurons capable of learning the XOR function
“Human brain demonstrates the highest levels of general intelligence, a quality not found in any other animal”, explained Dr Noel. “If these XOR neurons in the human brain behave similar to the models that we have proposed, independent of the biology, we are on the right track.” Indeed, biological neurons in the human brain can individually solve problems of learning the XOR function. A 2020 study by Albert Gidon et al. identified new classes of neurons in the human cortex with the potential to allow single neurons to solve computational problems, including XOR, that has typically required multilayer neural networks. The paper discussed how the activation function increases and then decreases, essentially oscillates, proving to be a biological counterpart to the theory proposed.
“If we are to bridge the gap between human and machine intelligence, then we must bridge the gap between biological neurons and artificial neurons,” Dr Noel concluded. The oscillatory activation functions were tested on different neural network model architectures, datasets, and benchmarks, resulting in at least one of the new activation functions outperforming the previous functions on all models evaluated.
Currently, the team is testing this solution on various practical problems with students at VIT. Some of the use-cases explored include Cryptocurrency Price Estimation and image tasks related to retinal scans. In the latter research, Shubham noted within the 52 combinations of functions tried for the conV and dense layer, the top five combinations resulted from oscillatory activation function. Furthering on this phenomenon, he explained, “The total number of combinations we are trying on a simple VGG network for the retinopathy task are around 676, and the potential of seeing Activation Functions as a hyperparameter is huge. We have observed this from a sample experimental trial on 52 combinations of activation functions on convolution (conV) and dense layer, the top five combinations that fetched the maximum AUC score included three combinations from oscillatory activation functions in the feature extraction layer. Furthermore, combinations are also being explored with oscillating activations in a dense layer, which is a different angle to what the proposed use case of this new family of oscillating activation functions should look like.”