Behind every successful scientific implementation, there is a theory that supports the results or allows one to anticipate the consequences. In the case of machine learning, however, the situation is a bit counterintuitive. Though the number of implementations of ML is spiking every day, one still cannot pinpoint the reason why a particular model is making some predictions.
Machine learning models are called black-box models for a reason!
Why does a certain model work? Is it the number of layers? Is it the depth? Is it the width? There are plenty of unanswered questions such as these. And, trying to answer these questions would lead to the concept of information theory and a bunch of complexities.
The Complexity Of Defining A Theory
Kolmogorov complexity of some function is the length of the shortest possible program which can produce the same outputs as the function for all given inputs.
The human brain has around 10^15 synapses, and 10^9 of these synapses are likely critical to the kind of natural language processing needed to pass the Turing test. The Kolmogorov complexity of the Turing test is expected to be of the order of 10^9 bits. In other words, one cannot solve this problem with a shorter algorithm.
To have a better understanding of this complexity, let’s take a simple example.
If you have to display a series of letters in a sequence, this will suffice:
>>print ab * 6
And if a random string like this ‘ahfu354ht4bjjk5’, has to be printed, the whole string needs to be saved in memory, assigned a variable and has to be called.
>>X = ‘ahfu354ht4bjjk5’
In the previous case, the operator ‘*’ would do the job, and in the latter, memorisation makes it more complicated. Well defined rules cut down computational costs. However, how many rules can we write?
Let’s look at this animation below:
Designing this simulated dancing model comes with many rules, from the folds of dress to shadow in the background to the movement of limbs. The complexity of writing a program that generates a simulation while satisfying the laws of physics gets complex. So, increasing complexity and parallelisation of computational resources don’t always go hand in hand.
However, by applying principal component analysis PCA to data, one can guess how compressible the data is – and how to decompress it.
Along with insights about “complexity” of the simulation, PCA has a special behaviour – it also extracts information regarding movements within the simulation as shown in this informative post.
Deep Learning models can generalise well in practice despite its large capacity, numerical instability, sharp minima, and non-robustness, which is a contradiction — a paradox.
What Makes Deep Learning Successful
We are still groping the walls in the dark, we are moving, but still blind. To shed some light onto the underlying principles of deep learning practical successes, the researchers from EPFL Switzerland, published a paper that mentions three conjectures, giving a direction to ML theory.
Most of the data from the current state of our universe and most of the problems we aim to solve with these data, as well as any good approximations of these data and problems, have a Kolmogorov complexity larger than 10^9 bits.
Most of the data from the current state of our universe and most of the problems we aim to solve with these data, as well as any good approximations of these data and problems, have a large non-parallelisable logical depth.
At equivalent Kolmogorov complexity, deeper neural networks compute functions with larger non-parallelisable logical depth.
Though the reason behind the superior performance of larger models is ambiguous, researchers have shown their inclination towards larger models for their `easy” optimisation with methods like stochastic gradient descent SGD. These methods help the models converge to global minima in over-parameterised regimes.
The researchers at EPFL posit that the success behind deep learning practical applications is connected to a key feature of the data collected from our surrounding universe to feed the machine learning algorithms: large non-parallelisable logical depth.
The authors argue that the formidable logical depth of mathematics has been the key to understanding physical phenomena of large logical depth (and small Kolmogorov complexity), in a manner that human brains cannot match.
Drawing a parallel between the success of deep learning and effectiveness of mathematics, the authors observed that the prevalence of depth is the common denominator.
The authors believed that determining the non-parallelisable logical depth of real data, as well as of specific functions related to this data, would be a significant step towards a theoretical understanding of deep learning.