In 1989, Yann Lecun introduced a paper, “Backpropagation Applied to Handwritten Zip Code Recognition”. This paper demonstrated how constraints could be integrated into a backpropagation network through its architecture to enhance the ability of the learning network to generalise. In this research, the authors showed how a single network learns the entire recognition operation – from the normalised image of the character to the final classification.
It has been 33 years since the paper was first published. But according to a fun experiment conducted by Tesla’s director of AI, Andrej Karpathy, the paper holds good even now. What’s more – he concluded that the paper will also hold ground 33 years later, that is, in 2055.
Sign up for your weekly dose of what's up in emerging technology.
According to Karpathy, the only restriction with the 1989 paper was that it used a small dataset which consisted of 7291 16×16 grayscale images of digits, along with a tiny neural network that used just 1,000 neurons. Except for these limitations, other factors like the neural network architecture, loss function, optimisation, among others – checks the model as a ‘modern deep learning paper’.
Karpathy wrote in his blog that he re-implemented the whole procedure in PyTorch. The original network was performed on Lisp using the backpropagation simulator SN (proposed by Léon Bottou and Yann LeCun, and later named Lush). On the software design side, Karpathy mentions that it has three main components – a fast general Tensor library for implementing basic mathematical operations; an autograd engine for tracking forward compute graph and generating operations for the backward pass; a scriptable high-level API of common deep learning operations.
While the original network trained for three days on a SUN-4/260 workstation, Karpathy chose to run his implementation on MacBook Air (M1) CPU, which took just 90 seconds, resulting in a 3000x naive speedup. The training process required making 23 passes on the training set of 7291 examples for 167,693 presentations to the neural network. Karpathy suggests that the process could be further sped up if full-batch training was performed instead of per example SGD to maximise GPU utilisation, leading to an extra 100x speedup of training latency.
Karpathy said that he was able to only produce the numbers roughly but not exactly. One of the reasons for this was that the original dataset was not available, and he had to simulate it using larger MNIST data. He considered 28×28 digits and scaled it down to the original 16×16 pixels using bilinear interpolation.
Karpathy also pointed out that the paper was too abstract in terms of the description of the weight initialisation scheme. The specific sparse connectivity between the H1 and H2 layers of the network were chosen by a scheme that is not disclosed in the original 1989 paper; towards this, Karpathy then had to take a ‘sensible guess’ and use an overlapping block sparse structure. Karpathy expressed his doubt over the paper’s claims of using tanh non-linearity instead of normalised tanh, which was trending at the time the original paper was published. Other challenges included formatting errors in the PDF file. “I suspect that there are some formatting errors in the PDF file that, for example, erase dots “.”, making “2.5” look like “2 5”, and potentially (I think?) erasing square roots,” he wrote.
Karpathy concludes that not much has changed in the last 33 years, at least on the macro level, in a way that we are still using differentiable neural net architectures, which are made up of layers of neurons that are optimised end-to-end with backpropagation and stochastic gradient descent. However, the dataset and size of the neural network have grown considerably.
Karpathy managed to achieve better performance in terms of speed and error rate. He mentioned that he was able to cut down 60 per cent on the error rate without changing the dataset and test-time latency of the model. “In particular, if I was transported to 1989, I would have ultimately become upper-bounded in my ability to further improve the system without a bigger computer,” he wrote.
Karpathy predicts that the 2055 neural networks would be the same as the 2022 ones on the macro level. The only observable difference could be the size. The datasets and models are expected to be as much as 10,000,000X larger. Since today’s models are not optimally formulated, just by changing the details of the model, the loss function, augmentation, etc., the error rate could be halved. The gains could be further enhanced by scaling up the dataset.
“In its most extreme extrapolation, you will not want to train any neural networks at all. In 2055, you will ask a 10,000,000X-sized neural net mega brain to perform some task by speaking (or thinking) to it in English. And if you ask nicely enough, it will oblige. Yes, you could train a neural net too… but why would you?” he concluded.