Early artificial intelligence systems were rule-based. These systems applied logic and expert knowledge that has been gathered over time to derive results. With new research, scientists were able to incorporate learnings to set their adjustable parameters, but these were limited in number.
This has changed much with the advent of deep learning models. Deep learning models are over-parameterised. This means that they have more parameters than there are data points available for training. In traditional systems, this leads to overfitting, where the model learns general trends and the random vagaries of the data it was trained upon. Deep learning avoids such overfitting by randomly initialising parameters and then iteratively adjusting them to fit the data better using a method called stochastic gradient descent. This method has also been shown to ensure that the learned model generalises well.
This approach gives deep learning models great flexibility. But on the downside, this flexibility comes at a (computational) cost.
The thumb rule for all statistical models is that to improve performance by a factor of k, the model needs to be trained by k^2 more data points. The second part of the computational cost comes from over parameterisation. When this over parameterisation is accounted for, this yields a total computational cost for improvement of at least k^4. The ‘4’ in the exponent is very expensive. For example, a 10-fold improvement would require a 10,000-fold increase in computation.
In a 2020 research paper titled ‘The Computational Limits of Deep Learning,’ the authors collected data from more than 1,000 research papers on deep learning, which spanned domains like image classification, question-answer, machine translation, and object detection. Taking the example of image classification, the authors found that the bid to reduce image classification errors had come at an enormous computational cost over the years. For example, in 2012, AlexNet was trained for five to six days using only two GPUs. In 2018, another model NASNet-A was able to cut the error rate of AlexNet by half, but it used 1,000 times more computational power to achieve this.
Theoretical expectations say that the computations need to scale at least the fourth power of the improvement in the performance. But the experiment above showed that the actual requirements have scaled with at least the ninth power. This means that to halve the error rate, one would need to have more than 500 times the computational resources. Further, only a six-fold improvement came from better hardware of the 1,000-fold difference computational cost found between AlexNet and NASNet-A. The rest came from factors like using more processors, running them longer; this incurred higher costs. This experiment also helped the researchers to conclude the computational cost-performance curve to reach even more impressive performance benchmarks in the future. For example, to achieve a 5 per cent error rate, one would need 10^19 billion floating-point operations.
Credit: IEEE Spectrum
Huge economic costs and carbon emissions accompany the computational burden of deep learning models. For example, the model assumed above would cost $100 billion to be trained and would produce as much carbon emissions as New York City does in a month. This means that when faced with such sky-rocketing costs (both economic and environmental), the researchers would have to either come up with more efficient ways to solve these problems or abandon working on these problems altogether (which will have obvious repercussions on progress).
Credit: IEEE Spectrum
Interestingly, when OpenAI released GPT-3 (the largest language model when it was released), it decided to open source and instead gave an exclusive license to Microsoft. The explanation given by OpenAI was that training this model cost more than $4 million, and due to such a high cost, it would not be feasible to retrain it.
Businesses outside the tech industry are shying away from the computational expense of deep learning. For example, a large European supermarket chain recently abandoned a deep learning-based model, which would have otherwise improved its ability to predict which products would be purchased. This was because the training and running cost of such a system would be very high.
In the last decade, with technological progress, CPUs gave way to GPUs and, in some cases, even field-programmable gate arrays and application-specific ICs. The same strategy could be applied to design processors which are specifically efficient for deep learning calculations. For a more long term gain, researchers might look at adopting different hardware frameworks.
Another approach could be reducing the computational burden by focusing on generating smaller neural networks when implemented. In such a case, too, it must be ensured that the total cost is larger than just training the model on its own.
Lastly, to evade the computational costs of deep learning, experts suggest that the community, on the whole, could move to yet-undiscovered or underutilised types of machine learning. Currently, neuro-symbolic methods and other techniques are being developed to combine the power of expert knowledge with flexibility that is characteristic of neural networks.