“Even for people with similar levels of experience in tuning, the model performed differently.”
Hyperparameters are usually tuned by a human operator such as an ML engineer. This is still a standard practice despite the great success of AutoML platforms. Though there is no doubt that businesses are more readily embracing AutoML tools, the role of a human operator cannot be disregarded. So, now the question is — does the result of machine learning models depend on the competencies of the human operator. The answer is, of course, a plain YES. But that wouldn’t suffice. Organisations invest heavily in picking the right candidate. So, it is crucial to know about this aspect in more detail.
To find out, researchers from Delft University of Technology, Delft, The Netherlands surveyed a group of ML engineers of varying expertise. The results of this survey were published recently in a paper titled, ‘Black Magic in Deep Learning: How Human Skill Impacts Network Training’.
Sign up for your weekly dose of what's up in emerging technology.
The extraordinary skill of a human expert to tune hyperparameters, wrote the researchers, is informally referred to as “black magic” in deep learning here.
Does Experience Really Matter
For the experiment, the researchers selected the Squeezenet model as they found it to be efficient to train and achieve a reasonable accuracy compared to more complex networks. To prevent exploiting model-specific knowledge, they did not share the network design with the participants.
Participants were given access to 15 common hyperparameters. Mandatory ones were — number of epochs, batch size, loss function, and optimiser. The other 11 optional hyperparameters were set to their default values.
Taking size and difficulty into account, the participants were given an image classification task on a subset of ImageNet. The name was kept under wraps, and only the image classification task was revealed to them along with the dataset statistics that consists of 10 classes, 13,000 training images, 500 validation images, and 500 test images.
The whole experimental procedure can be summarised as follows:
- The participants enter their information.
- Hyperparameter values and evaluates intermediate training results are submitted.
- Once training is finished, the participant can either submit a new hyperparameter configuration or end the experiment.
- This is repeated until the clock ticks 120 minutes.
The researchers segregated the participants based on the number of months of the deep learning experience. They collected a total of 463 different hyperparameter combinations from 31 participants. Of which, the Novice group contained 8 participants with no experience in deep learning, the 12 participants with less than nine months of experience and the rest with more than nine months experience.
Whenever a participant submitted their final choice of hyperparameters, the experiment ended, and the optimal hyperparameter configuration was then trained 10 times. “Each of the 10 repeats has a different random seed, while the seeds are the same for each participant,” stated the researchers.
The results showed that human skills do impact accuracy. Few other key findings from this survey are:
- Even for people with similar levels of experience in tuning the model performed differently.
- Even for experts, there can be an accuracy difference of 5%.
- More experience correlates with optimisation skill.
- The trend shows a strong positive correlation between experience and the final performance of the model.
- Inexperienced participants usually followed a random search strategy, where they often start by tuning optional hyperparameters which may be best left at their defaults initially.
On a concluding note, the team behind this work shared a couple of insightful recommendations. The authors underlined the importance of reproducibility and urged to share the final hyperparameter settings. And, since it is difficult to say if the purported superior performance is due to a massive supercomputer, they advise reviewers to pay more attention to reproducibility, baseline comparisons and put less emphasis on superior performance.
Know more about this survey here.