MITB Banner

Watch More

How To Make Algorithms Really Work For Clinical Trials

“Researchers propose a framework for regulating algorithms that will ensure world-class clinical performance and build trust.”

Machine learning models are considered to gradually take over most of the mundane tasks of clinicians and other medical professionals. In a way, assisting them with insights and cutting the downtime. These models are already getting good at helping radiologists. But how legit are these algorithms? Have they been thoroughly vetted by government bodies like the FDA?

Researchers at Stanford noted a few reasons as to why algorithms are readily accepted into clinical setups:

  • Mix up of the diagnostic task with the diagnostic algorithm
  • Not so thorough treatment of the diagnostic task definition 
  • No mechanism to directly compare similar algorithms 
  • Insufficient characterisation of safety and performance elements
  • Lack of resources to evaluate and inherent conflicts of interest

To address these issues, a team from Stanford Institute for Human-Centered Artificial Intelligence (HAI) in their new paper titled “Regulatory Frameworks for Development and Evaluation of Artificial Intelligence”, propose a framework for FDA regulation of AI-based diagnostic imaging software. 

Challenges Of AI Based Diagnostics

(Source: Larson et al.,)

Though in the US, the FDA has a framework in place for evaluating software developed for medical applications, the authors state that these checkpoints cannot be considered for AI-based tools. According to the head of the FDA, AI algorithms continually learn from the medical images they review. So, the traditional approach to reviewing and approving SaMD (software-as-a-medical-device) upgrades may not apply. There is a need to implement a protocol for regulating the entire lifecycle of AI-based algorithms.  

Explaining this, one of the authors cited the example of the early days of the coronavirus pandemic when medical professionals worldwide proposed multiple different scoring systems to categorise lung scans of patients. Though the scoring systems differ, they are corrected over time, and this type of adjustment is common in clinical trials. However, in the case of AI, the authors wrote, that would have been problematic as hard-coded proprietary diagnostic task definitions will make it difficult to compare the performance of algorithms. 

The authors propose that algorithm manufacturers should be required to go through four phases of development and evaluation similar to what the FDA does. They suggest testing for:

  1. Feasibility of algorithms on a small test set
  2. Capability in a controlled environment simulating real-world conditions
  3. Effectiveness in a real clinic) 
  4. Durability including monitoring and improvement over time

The above flow chart illustrates the proposed linkage of the evaluation of diagnostic algorithm performance from the defined task to local site algorithm performance. Algorithms are developed according to a defined diagnostic task standard. Performance is compared with other algorithms in a controlled environment, which becomes the internal benchmark for general real-world performance and local site performance, which in turn becomes the benchmark for ongoing monitoring.

Key Takeaways

  • Strategies outlined by regulatory bodies address many key aspects to help ensure the safety, effectiveness, and performance of SaMD applications, but a number of gaps remain.
  • Appropriate evaluation and improvement methods should be incorporated into phases of development, similar to what the FDA does. 
  • Algorithms should be thoroughly tested and refined before being deployed in the clinical environment, just as it is expected from other medical devices.
  • Regulatory frameworks should strive to establish conditions that set up a race to the top for consistent excellent algorithm performance at each installed site.

One of the authors likened their proposed regulatory framework to the safety requirements for an aircraft. “… if they stop working, they need to fail in a way that’s not going to hurt people.” In their paper, the authors have listed 12 measures of performance that should be applied to diagnostic algorithms. 

To know more, download the original paper here.

Access all our open Survey & Awards Nomination forms in one place >>

Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories