ML-Based COVID-19 Research Work is Dud. Here’s Why

Organisations, both public and private, turned to artificial intelligence and machine learning to take on COVID-19. Though several ML/AI-based solutions have been developed on a war footing, the clinical utility of most of them is suspect. 

A team from the University of Cambridge and University of Manchester has found many studies, conducted between January 2020 to October 2020, suffered from methodological flaws or underlying bias, and in some cases, both.


The authors of the study analysed papers uploaded to bioRxiv, medRxiv and arXiv along with all entries in EMBASE and MEDLINE on machine learning methods for the detection and prognostication of COVID-19 from the standard-of-care chest radiographs (CXR) and chest computed tomography (CT) images. The study identified 2,212 research works, of which 415 of them were included in the initial screening. Finally, 62 works made the cut for a systematic review. 

The major issues detected in the studies are:

  • Due to the use of public dataset, the issues of duplication became compounded. The data sourced from different public datasets are redistributed to form what the authors call ‘Frankenstein’ datasets. “This repackaging of datasets, although pragmatic, inevitably leads to problems with algorithms being trained and tested on identical or overlapping datasets while believing them to be from distinct sources,” the team observed.
  • The prediction model risk of bias assessment tool (PROBAST) showed most studies ran a high risk of bias in at least one domain. Even in the other studies, results were not clear in at least one domain. The risk for bias was attributed to factors such as:

–Extensive use of public datasets, which have underlying biases as anyone can contribute images.

–Papers used only a subset of original datasets making them difficult to be reproduced.

–Large differences in the demographics of the groups considered for the study.

–The sample size in many cases was very small leading to highly imbalanced datasets.

–Lack of appropriate performance metrics evaluation techniques.

  • There are two main approaches to validating algorithm performance–internal and external validation. In internal evaluation, test data from the same source is used as development data, whereas external evaluation considers different sources. Ideally, including both validation techniques help in better generalisation of the algorithm. But most papers used only internal validation.


The team makes the following recommendations:

  • The authors advised caution over the use of public repositories for datasets which can lead to bias and Frankenstein datasets.
  • They recommended using well-curated external validation datasets for generalisation of the algorithms to other cohorts. “Any useful model for diagnosis or prognostication must be robust enough to give reliable results for any sample from the target population rather than just on the sampled population.” Further, they recommended applying clinical judgement to identify the sensitivity and specificity of the model.
  • A possible ambiguity arises due to updating of publicly available datasets or code. Therefore, the authors recommend that a cached version of the public dataset be saved, or the date/version quoted, and specific versions of data or code are appropriately referenced.
  • The team advises authors to assess their papers against industry standard frameworks such as the checklist for artificial intelligence in medical imaging (CLAIM) check, radiomic quality score (RQS) criteria, transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD), PROBAST and Quality Assessment of Diagnostic Accuracy Studies (QUADAS).

Wrapping Up

Stanford Social Innovation Review, in one of its articles, raised similar concerns. It said, “The algorithms driving these systems are human creations, and as such, they are subject to biases that can deepen societal inequities and pose risks to businesses and society more broadly.” Apart from the technical discrepancies in designing such models, the article also pointed at how these systems exclude racial and ethnic minorities, and can have far reaching consequences.

Download our Mobile App

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week.