Automated machine learning – or AutoML – was introduced to fill in for the talent gap in the ML industry and eliminate the mundane tasks of ML engineers. Over the years, many AutoML tools have been released. But, how good are these tools? Do they accomplish what they promise? Have they really become a solution to the dearth of talent in the data science industry?
In order to answer these long-standing questions, researchers from Fraunhofer Institute, Germany have investigated the state-of-the-art AutoML frameworks. To their surprise, they found that the AutoML tools are performing better or on par with their human counterparts.
AutoML was introduced to cut down the time spent in doing iterative tasks concerning model development. AutoML tools have helped developers to build scalable models with minimal domain expertise. So, how do they fare when pitted against humans?
How Does AutoML Perform Against Humans
The researchers considered 12 different popular datasets from OpenML. Of which, six are supervised classification tasks, and the rest are supervised regression ones. For the experiment, the researchers used the open-source tool AutoML benchmark, which comes with full integration of OpenML datasets for many AutoML frameworks as well as automated benchmarking functions.
Supervised classification and supervised regression being the most popular machine learning tasks and the most considered tasks on OpenML, the researchers chose six of each.
The benchmarks were run with default settings as defined in config.yaml in the AutoML Benchmark project. Settings include all cores, 2GiB of memory of the OS, amount of memory computed from OS available memory amongst others.
The researchers considered four AutoML frameworks:
- Auto-sklearn and
The frameworks were chosen as a mix of very recent ones and frameworks that have been around a bit longer. The selection also encompasses Deep Learning-only AutoML frameworks as well as scikit-learn-based AutoML frameworks.
Runtime per fold was selected and was set to one hour. For the supervised classification, the best of the four AutoML frameworks were given a runtime per fold of five hours in order to check their results with those of humans.
For the supervised classification tasks, evaluation methods ROC AUC (auc) and accuracy were used. For the supervised regression tasks, root-mean-square error (rmse) and mean absolute error (mae) were chosen since the metrics preconfigured by AutoML Benchmark included R2.
Hardware Used: The server was equipped with two Intel Xeon Silver 4114 CPUs @2.20Ghz ( 20 cores in total), four 64GB DIMM DDR4 Synchronous 2666MHz memory modules and two NVIDIA GeForce GTX 1080 Ti (more than 22GB VRAM in total).
According to the researchers, the results of this survey can be summarised as follows:
- AutoML performed better or equal than humans in the primary metric in 7 out of 12 cases. All these seven cases are either “easy” classification tasks (meaning tasks that humans, as well as, AutoML solved perfectly) or regression tasks.
- AutoML performed better or equal than humans in both metrics.
- There does not seem to be a significant difference between the primary and the other metric regarding performance.
The researchers conclude that most results achieved by AutoML are only slightly better or worse than the ones by humans. H2O, the best AutoML framework for supervised classification credit-g, achieves an AUC score of 0.7892 using 5 hours time limit per fold instead of 0.799 using 1-hour time limit per fold.
Going forward, researchers believe that there will be major leaps toward bridging the gap between domain expertise and AutoML. Machine learning applications are predominantly used in interdisciplinary cases. So, AutoML tools cannot serve as standalone solutions. AutoML should be seen as complementing the skills of data scientists and not as a magical one-stop-solution.
Check the original paper here.