MITB Banner

Neural Tangent Kernel (NTK): A New Tool For Understanding Machine Learning Training

Share

 

The general consensus in the machine learning community is that making a model smaller would lead to a larger training error, while a bigger model would result in a larger generalisation gap. That is why developers usually hunt for that sweet spot between errors and generalisation. 

However, the best test error is often achieved by the largest model, which is counterintuitive.

Illustration by Belkin et al. (2018)

As one increases the model complexity past the point where the model can perfectly fit the training data (Interpolation Regime), test error continues to drop! 

The inner training dynamics of the neural networks have long been a mystery and unlocking this would lead to a better understanding of the predictions. 

In order to meet these ends, a paper titled Neural Tangent Kernel (NTK) was submitted at the prestigious NIPS conference last year and has been making noise ever since.

It also occupied a majority of the talks at the recently concluded Workshop on Theory of Deep Learning at the Institute for Advanced Study.

Authors of this paper, Arthur Jacot and his colleagues at the Swiss Federal Institute of Technology Lausanne introduced NTK as a new tool to study ANNs, which describes the local dynamics of an ANN during gradient descent. 

“This led to a new connection between ANN training and kernel methods,” wrote the authors in their paper.

In the infinite-width limit, an ANN can be described in the function space directly by the limit of the NTK, an explicit constant kernel, which only depends on its depth, nonlinearity and parameter initialisation variance. 

The practical significance of the NTK theory is that you can actually compute the kernel for infinite width networks exactly, and treat that as a new kernel machine.

The whole idea behind NTK is to find a parallel between the convergence of neural networks and kernel methods, then defining a new kernel to describe the generalisation of neural networks. 

Hence a connection between large models with best test error and kernel methods is established, which in turn paves a way to assessing the performance of neural networks. 

Because at initialisation, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit thus can be likened to kernel methods.

Why Explore Kernel Regime

Source: Rajat’s blog

A kernel method can be thought of as a trick developed to learn patterns or correlations in high dimensional data without the need to learn from a fixed set of parameters. This comes handy in case of complex parameter scenarios or when the data is unlabeled. Kernel methods learn through instances, remembering weights and giving similarity scores without the need to know the whole picture. This makes the training computationally cheap.

Now, in the case of infinite width networks, a neural tangent kernel or NTK consists of the pairwise inner products between the feature maps of the data points at initialisation.

And since the tangent kernel stays constant during training, the training dynamics is now reduced to a simple linear ordinary differential equation.

The authors in the seminal paper insist that the limit of the NTK is a powerful tool to understand the generalisation properties of neural networks, and it allows one to study the influence of the depth and nonlinearity on the learning abilities of the network. 

The key difference between the NTK and previously proposed kernels is that the NTK is defined through the inner product between the gradients of the network outputs with respect to the network parameters. 

Another fruitful direction is to “translate” different tricks of neural networks to kernels and to check their practical performance. 

Researchers at Carnegie Mellon University hope that tricks like batch normalisation, dropout, max-pooling, etc. can also benefit kernels since it has been established that the global average pooling can significantly boost the performance of kernels.

Similarly, they posit that one can try to translate other architectures like recurrent neural networks, graph neural networks, and transformers, to kernels as well.

Drawing insights from the inner workings of over-parameterised deep neural networks is still a challenging theoretical question. However, with the recent developments such as the ones discussed above, the researchers are optimistic that we can now have a better understanding of ultra-wide neural networks that can be captured by neural tangent kernels.

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India