“The no free lunch theorem calls for prudency when solving ML problems by requiring that you test multiple algorithms and solutions with a clear mind and without prejudice.”
In a paper titled, ‘The Lack of A Priori Distinctions Between Learning Algorithms’, that dates back to 1996, David Wolpert explored the following questions:
- Can we actually get something for nothing in supervised learning?
- Can we get useful, caveat-free theoretical results that link the training set and the learning algorithm to generalisation error, without making assumptions concerning the target?
- Are there useful practical techniques that require no such assumptions?
He showed that for any two algorithms, A and B, there are as many scenarios where A will perform worse than B as there are instances where A will outperform B. In short, for all possible problems, average performance of both the algorithms is the same.
Implications Of NFL In Deep Learning
Although the no free lunch theorem by Wolpert has a more theoretical than practical appeal, there are some implications that should still be taken into account by everyone working with machine learning algorithms. These theorems prove that under a uniform distribution over search problems or learning problems, all algorithms perform equally.
In recent work, Wolpert discusses the importance of NFL theorems for machine learning. Search and learning are key aspects of ML and the NFL theorems have something to deliver here.
To illustrate it, writes Wolpert, choose a set of objective functions on which a certain search algorithm performs better than the purely random search algorithm. Then the NFL for search theorem says that compared to random search, this favourite search algorithm “loses on as many” objective functions as it wins and this is true no matter what performance measure one uses. According to Wolpert, the primary significance of the NFL theorems for search is to give insights about the underlying mathematical ‘skeleton’ of optimisation theory before the ‘flesh’ of the probability distributions of a particular context, and set of optimisation problems are imposed.
For Supervised Learning
The performance of any supervised learning algorithm, the paper states, is governed by an inner product between two vectors, both indexed by the set of all target functions. As long as the loss function is symmetric, it indicates how close the results of the supervised learning algorithm are to the real world. This supervised learning inner product formula results in a set of NFL theorems, which can come in handy.
Why NFL Enables Prudency
Wolpert, who has been instrumental behind the proliferation of NFLs, in his recent work, wrote that there are many avenues of research related to the NFL theorems which are yet to be explored properly. Some of these involve free lunch theorems which concern fields closely related to search, e.g., coevolution.
“NFL theorems can give insights about the underlying mathematical ‘skeleton’ of optimisation theory before the ‘flesh’ of the probability distributions of a particular context and set of optimisation problems are imposed.”David Wolpert
Luca Massaron, a best selling machine learning author and a Google Developer Expert, who has been in the field of machine learning for nearly two decades, says that, though some learning algorithms are usually considered the best-in-class for certain types of problems, such as for instance gradient boosting with tabular data problems, any practitioner should not overlook that the ultimate goal of a learning algorithm is to correctly map an unknown function ruling how the predictions relate to the data inputs available. Since such function is unknown, there is no a-priori guarantee that the best learning algorithm is being chosen for the problem if one picks a state of the art algorithm.
The no free lunch theorem, explains Luca and calls for prudency when solving machine learning problems. Sometimes, by testing multiple solutions, one might even find that simpler solutions may work perfectly well for some problems, without resorting to more complex, state of the art ones. One may discover that in the long run, other algorithms could be more performant than the picked state of the art solution.