Organisations, both public and private, turned to artificial intelligence and machine learning to take on COVID-19. Though several ML/AI-based solutions have been developed on a war footing, the clinical utility of most of them is suspect.
A team from the University of Cambridge and University of Manchester has found many studies, conducted between January 2020 to October 2020, suffered from methodological flaws or underlying bias, and in some cases, both.
The authors of the study analysed papers uploaded to bioRxiv, medRxiv and arXiv along with all entries in EMBASE and MEDLINE on machine learning methods for the detection and prognostication of COVID-19 from the standard-of-care chest radiographs (CXR) and chest computed tomography (CT) images. The study identified 2,212 research works, of which 415 of them were included in the initial screening. Finally, 62 works made the cut for a systematic review.
The major issues detected in the studies are:
- Due to the use of public dataset, the issues of duplication became compounded. The data sourced from different public datasets are redistributed to form what the authors call ‘Frankenstein’ datasets. “This repackaging of datasets, although pragmatic, inevitably leads to problems with algorithms being trained and tested on identical or overlapping datasets while believing them to be from distinct sources,” the team observed.
- The prediction model risk of bias assessment tool (PROBAST) showed most studies ran a high risk of bias in at least one domain. Even in the other studies, results were not clear in at least one domain. The risk for bias was attributed to factors such as:
–Extensive use of public datasets, which have underlying biases as anyone can contribute images.
–Papers used only a subset of original datasets making them difficult to be reproduced.
–Large differences in the demographics of the groups considered for the study.
–The sample size in many cases was very small leading to highly imbalanced datasets.
–Lack of appropriate performance metrics evaluation techniques.
- There are two main approaches to validating algorithm performance–internal and external validation. In internal evaluation, test data from the same source is used as development data, whereas external evaluation considers different sources. Ideally, including both validation techniques help in better generalisation of the algorithm. But most papers used only internal validation.
The team makes the following recommendations:
- The authors advised caution over the use of public repositories for datasets which can lead to bias and Frankenstein datasets.
- They recommended using well-curated external validation datasets for generalisation of the algorithms to other cohorts. “Any useful model for diagnosis or prognostication must be robust enough to give reliable results for any sample from the target population rather than just on the sampled population.” Further, they recommended applying clinical judgement to identify the sensitivity and specificity of the model.
- A possible ambiguity arises due to updating of publicly available datasets or code. Therefore, the authors recommend that a cached version of the public dataset be saved, or the date/version quoted, and specific versions of data or code are appropriately referenced.
- The team advises authors to assess their papers against industry standard frameworks such as the checklist for artificial intelligence in medical imaging (CLAIM) check, radiomic quality score (RQS) criteria, transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD), PROBAST and Quality Assessment of Diagnostic Accuracy Studies (QUADAS).
Stanford Social Innovation Review, in one of its articles, raised similar concerns. It said, “The algorithms driving these systems are human creations, and as such, they are subject to biases that can deepen societal inequities and pose risks to businesses and society more broadly.” Apart from the technical discrepancies in designing such models, the article also pointed at how these systems exclude racial and ethnic minorities, and can have far reaching consequences.