Researchers Claim Inconsistent Model Performance In Most ML Research Work

The process of benchmarking is considered to be one of the most crucial assets for the progress of AI and machine learning research. The benchmark datasets are usually fixed sets of data, which are manually, semi-automatically as well as automatically generated to form a representative sample for these specific tasks to be solved by a model.

Recently, researchers from the Institute for Artificial Intelligence and Decision Support, Vienna claimed that the considerable part of metrics currently used to evaluate classification AI benchmark tasks might be inconsistent. It may result in a poor reflection in the performance of a classifier, especially when used with imbalanced datasets.

For the research, they analysed the present aspect of performance metrics that are based on data covering more than 3500 ML model performance results from a web-based open platform.


Sign up for your weekly dose of what's up in emerging technology.

Why This Analysis?

According to the researchers, the performance of a machine learning model on a benchmark dataset is most commonly measured by one single or a small set of performance metrics. A single metric enables fast as well as a simple comparison of different models. However, condensing the performance characteristics of a particular model into a single metric involves the risk of providing only one projection of model performance and errors.

One such instance is the performance metric is accuracy. It is widely used by the researchers but shows various shortcomings, including inadequate reflection of a classifiers’ performance when used on unbalanced datasets. These shortcomings led to the replacement or extension of more informative classification performance measures that include precision, recall or the F1 score.

Moreover, the researchers claimed that even after using accuracy with optimally balanced dataset metrics, these metrics do not fulfil the criteria of proper scoring rules, and there still exists extensive criticism.

To address these unclear discussions, researchers conducted this analysis by implementing a comprehensive overview of the current scenario of performance measures that are used by benchmark datasets to track progress in AI and machine learning researches

Behind The Analysis

The researchers extracted data on 3867 ML model performance results that are reported in arXiv submissions, including the peer-reviewed journals from the database called ‘Papers with Code’ (PWC). 

The model performance results were annotated with at least one performance metric as of June 2020. They examined 32209 benchmark results across 2298 distinct benchmark datasets reported in a total number of 3867 machine learning papers. 

The benchmark datasets consisted of 15 higher-level processes including vision process, natural language processing, fundamental AI process, robotic process, classification or detection and more. 

According to the researchers, the raw dataset exported from PWC contained a total number of 812 different metric names. They conducted a manual curation of this raw list of metrics to map performance metrics into a canonical hierarchy. After the manual curation, the resulting list that is covered by the dataset is reduced from 812 to 187 distinct top-level performance metrics.

The top-level metrics were further categorised based on the task types they are usually applied to, such as ‘accuracy was mapped to ‘classification’, ‘mean squared error’ was mapped to ‘regression’, and ‘BLEU’ was mapped to ‘natural language processing’. 

For Classification Tasks

The researchers claimed that accuracy was the most commonly used performance metric. This metric is being used by around 38% of all benchmark datasets covered in this research. Also, the second and third most commonly reported metrics were precision and the F-measure with 16% and 13% respectively of all benchmark datasets using them to evaluate model results.

The analysis suggested that accuracy, F1 score, precision and recall were the most frequently used metrics for evaluating classification tasks to report model performance on benchmark datasets. 

Despite their thorough utilisation in classification tasks, metrics like accuracy, F1 score, precision as well as recall exhibit a number of problematic properties. For instance, the major deficiency of “accuracy” is its inability to yield informative results when dealing with unbalanced datasets; there are also inconsistencies in the definition of F1 score for multi-class classification tasks.

The researchers stated that these shortcomings led to the engagement of a number of alternative confusion matrix-derived metrics including informedness and markedness (MK), Matthews Correlation Coefficient (MCC), macro average arithmetic (MAvA), among others. 

Also, AUC metrics have been proposed as alternatives to accuracy and F1 score when dealing with small or unbalanced datasets. The cost curves have frequently been proposed as an alternative to ROC curves when dealing with imbalanced datasets. 


In the case of NLP, BLEU (Bilingual Evaluation Understudy Score) score is one of the most frequently used metrics for language-generating tasks like machine translation and others.

However, the researchers pointed out various vulnerabilities, such as its sole focus on n-gram precision without considering recall and its reliance on exact n-gram matchings. To address the weaknesses of this metric, METEOR (Metric for Evaluation of Translation with Explicit Ordering) was proposed in 2005.

Also, due to the various shortcomings of currently used automatic evaluation metrics, metric development for language-generation tasks is an open research question. Difficulties associated with the automatic evaluation of machine-generated texts include poor correlation with human judgement, language bias, among others. 

The researchers also stated that many NLP metrics use very specific sets of features, such as specific word embeddings or linguistic elements, which may complicate comparability and replicability. To address the issue of replicability, reference open-source implementations have been published for some metrics, such as ROUGE and sentBleu-moses as part of the Moses toolkit and sacreBLEU.

Wrapping Up

The analysis is focussed on classification metrics and on performance metrics used to evaluate NLP-specific tasks. According to the researchers, the vast majority of metrics currently used to assess classification AI benchmark tasks have properties that may result in a poor reflection of a classifiers’ performance, especially when used with imbalanced datasets. 

Also, the alternative metrics that were proposed to address the problematic properties are currently rarely applied as performance metrics in benchmarking tasks. Furthermore, the researchers noticed that the reporting of metrics was partly inconsistent and partly unspecific, which may further lead to ambiguities when comparing model performances.

Read the paper here.

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM