In one of the most revealing research papers written recent times, the researchers from Cornell Tech and Facebook AI quash the hype around the success of machine learning. They opine and even demonstrate that the trend appears to be overstated. In other words, the so-called cutting edge research or benchmark work perform similarly to one another even if they are a decade apart. In other words, the authors believe that metric learning algorithms have not made spectacular progress.
In this work, the authors try to demonstrate the significance of assessing algorithms more diligently and how few practices can help reflect ML success in reality.
Sign up for your weekly dose of what's up in emerging technology.
Where Do Things Go Wrong
Over the past decade, deep convolutional networks have made tremendous progress. Their application in computer vision is almost everywhere; from classification to segmentation to object detection and even generative models. But is the metric evaluation carried out to track this progress has been leakproof? Are the techniques employed weren’t affected by the improvement in deep learning methods?
The goal of metric learning is to map data to an embedding space, where similar data are close together, and the rest are far apart. So, the authors begin with the notion that the deep networks have had a similar effect on metric learning. And, the combination of the two is known as deep metric learning.
The authors then examined flaws in the current research papers, including the problem of unfair comparisons and the weaknesses of commonly used accuracy metrics. They then propose a training and evaluation protocol that addresses these flaws and then run experiments on a variety of loss functions.
For instance, one benchmark paper in 2017, wrote the authors, used ResNet50, and then claimed huge performance gains. But the competing methods used GoogleNet, which has significantly lower initial accuracies. Therefore, the authors conclude that much of the performance gain likely came from the choice of network architecture, and not their proposed method. Practices such as these can put ML on headlines, but when we look at how much of these state-of-the-art models are really deployed, the reality is not that impressive.
The authors underline the importance of keeping the parameters constant if one has to prove that a certain new algorithm outperforms its contemporaries.
To carry out the evaluations, the authors introduce settings that cover the following:
- Fair comparisons and reproducibility
- Hyperparameter search via cross-validation
- Informative accuracy metrics
As shown in the above plot, the trends, in reality, aren’t that far from the previous related works and this indicates that those who claim a dramatic improvement might not have been fair in their evaluation.
If a paper attempts to explain the performance gains of its proposed method, and it turns out that those performance gains are non-existent, then their explanation must be invalid as well.
The results show that when hyperparameters are properly tuned via cross-validation, most methods perform similarly to one another. This work, believe the authors, will lead to more investigation into the relationship between hyperparameters and datasets, and the factors related to particular dataset/architecture combinations.
According to the authors, this work exposes the following:
- Changes in network architecture, embedding size, image augmentation method, and optimisers leads to unfair comparisons
- The use of accuracy metrics are either misleading or do not provide a complete picture of the embedding space
- Papers have been inconsistent in their choice of the optimiser, and most papers do not present confidence intervals for their results
- Papers do not check performance at regular intervals and report accuracy after training for a predetermined number of iterations
The authors conclude that if proper machine learning practices are followed, then the results of metric learning papers will better reflect reality, and can lead to better works in most impactful domains like self-supervised learning.