A recent study has revealed glaring gaps in the clinical readiness of image-based diagnostic AI models. The Department of Dermatology at the University of California has assessed the performance of dermatologist-level off-the-shelf CNN models on real-world non-curated images. The scientists believe the findings point to the need for models that meet conventionally reported metrics and validation with computational stress tests to assess clinical readiness.
We try to understand how scientists stress-tested these algorithms, and their recommendations to improve these models.
While many studies have shown CNNs to perform on par with or better than dermatologists, the models can mislead clinicians with incorrect predictions if images get altered slightly. ‘Discrimination’ and ‘calibration’ remain major concerns in the real-world use of CNNs as the proof of practice for published models has not been demonstrated, according to the study.
The discrimination is analysed by doing tests on controlled training datasets (curated datasets) and then comparing it with test data from new and potentially more diverse datasets (non-curated datasets).
The study found that CNNs perform at the same level as dermatologists when it comes to curated benchmark datasets, but fall short when applied to non-curated datasets.
Calibration quantifies how well an ML model can forecast its accuracy. If a model asserts that it can perform with 90% accuracy, then it should correctly predict outcomes 90% of the time. It helps in deciding if there is a need for human intervention when the model displays low confidence. Previous studies have shown that CNNs tend to be overconfident.
Currently, model development in CNN does not consider calibration. However, the model used for this study was calibrated to analyse if it reduces the gap between observed accuracy and predicted accuracy. However, some models remained overconfident. The calibration worsened in the non-curated or real-world datasets at an observed accuracy of 77.2% instead of the expected 90.2%.
Also, instead of rejecting or showing lower confidence when models were used for predicting diseases it is not trained for, one of the models exuded the same confidence as the disease it was trained to predict.
The decision-making needs to be independent of factors like ink markings, hair, zoom, lighting, and focus.
The experiment calculated the algorithms’ robustness by changing image magnification or angles used for testing skin cancer. Almost 30 percent of the image classifications differed from the original predictions. Image transformation also surfaced inconsistencies in model predictions across datasets.
Room For Improvement
The study makes a case for algorithms to have better discrimination capabilities for the target population, express uncertainty when they are likely to be wronged, and produce robust results immune to variations in image capture.
While the study found lower discrimination capability when dealing with non-curated datasets, the differences were not significant. Models trained on dermoscopic images performed comparatively better, even to classify non-dermoscopic images. Hence, the study makes a case for the use of dermoscopic images for training.
The study also showed that optimising calibration performance on the validation dataset was insufficient for optimisation on test datasets, even when the validation dataset and test dataset come from the same source. The study suggests the models may better forecast their accuracy on curated benchmark datasets than real-world datasets.
For improving discrimination and calibration, the study recommends training and calibration of the model using data for the population it is being deployed for.
In terms of robustness, the study recommends diversifying training datasets further by capturing training images in different ways or use specialised computational techniques such as generating modification to CNN architecture, generating adversarial training models, or leveraging unlabelled examples.
Finally, to address the ‘out-of-distribution-problem’, the study recommends allowing models to abstain or express lower confidence levels for disease classes not seen during the training.
Experts have said that letting AI do the decision making leave healthcare professionals over-reliant on algorithms and lead to misdiagnosis. When introduced in a clinical setting, AI models should be able to accurately project their confidence level, lest it will do more harm than good in real world situations.