\u00a0\n\n\n\nThe general consensus in the machine learning community is that making a model smaller would lead to a larger training error, while a bigger model would result in a larger generalisation gap. That is why developers usually hunt for that sweet spot between errors and generalisation. \n\n\n\nHowever, the best test error is often achieved by the largest model, which is counterintuitive.\n\n\n\n Illustration by Belkin et al. (2018) \n\n\n\n\n\n\n\nAs one increases the model complexity past the point where the model can perfectly fit the training data (Interpolation Regime), test error continues to drop! \n\n\n\nThe inner training dynamics of the neural networks have long been a mystery and unlocking this would lead to a better understanding of the predictions. \n\n\n\n\n\n\n\nIn order to meet these ends, a paper titled Neural Tangent Kernel (NTK) was submitted at the prestigious NIPS conference last year and has been making noise ever since.\n\n\n\nIt also occupied a majority of the talks at the recently concluded Workshop on Theory of Deep Learning at the Institute for Advanced Study.\n\n\n\nAuthors of this paper, Arthur Jacot and his colleagues at the Swiss Federal Institute of Technology Lausanne introduced NTK as a new tool to study ANNs, which describes the local dynamics of an ANN during gradient descent. \n\n\n\n\u201cThis led to a new connection between ANN training and kernel methods,\u201d wrote the authors in their paper. \n\n\n\nIn the infinite-width limit, an ANN can be described in the function space directly by the limit of the NTK, an explicit constant kernel, which only depends on its depth, nonlinearity and parameter initialisation variance.\u00a0\n\n\n\nThe practical significance of the NTK theory is that you can actually compute the kernel for infinite width networks exactly, and treat that as a new kernel machine.\n\n\n\nThe whole idea behind NTK is to find a parallel between the convergence of neural networks and kernel methods, then defining a new kernel to describe the generalisation of neural networks. \n\n\n\nHence a connection between large models with best test error and kernel methods is established, which in turn paves a way to assessing the performance of neural networks. \n\n\n\nBecause at initialisation, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit thus can be likened to kernel methods.\n\n\n\nWhy Explore Kernel Regime\n\n\n\n\n\n\n\nSource: Rajat\u2019s blog\n\n\n\nA kernel method can be thought of as a trick developed to learn patterns or correlations in high dimensional data without the need to learn from a fixed set of parameters. This comes handy in case of complex parameter scenarios or when the data is unlabeled. Kernel methods learn through instances, remembering weights and giving similarity scores without the need to know the whole picture. This makes the training computationally cheap.\n\n\n\nNow, in the case of infinite width networks, a neural tangent kernel or NTK consists of the pairwise inner products between the feature maps of the data points at initialisation.\n\n\n\nAnd since the tangent kernel stays constant during training, the training dynamics is now reduced to a simple linear ordinary differential equation.\n\n\n\nThe authors in the seminal paper insist that the limit of the NTK is a powerful tool to understand the generalisation properties of neural networks, and it allows one to study the influence of the depth and nonlinearity on the learning abilities of the network. \n\n\n\nThe key difference between the NTK and previously proposed kernels is that the NTK is defined through the inner product between the gradients of the network outputs with respect to the network parameters. \n\n\n\nAnother fruitful direction is to \u201ctranslate\u201d different tricks of neural networks to kernels and to check their practical performance. \n\n\n\nResearchers at Carnegie Mellon University hope that tricks like batch normalisation, dropout, max-pooling, etc. can also benefit kernels since it has been established that the global average pooling can significantly boost the performance of kernels.\n\n\n\nSimilarly, they posit that one can try to translate other architectures like recurrent neural networks, graph neural networks, and transformers, to kernels as well.\n\n\n\nDrawing insights from the inner workings of over-parameterised deep neural networks is still a challenging theoretical question. However, with the recent developments such as the ones discussed above, the researchers are optimistic that we can now have a better understanding of ultra-wide neural networks that can be captured by neural tangent kernels.