Listen to this story
While OpenAI capabilities have made its way into every domain possible, there’s one field where LLMs, if utilised correctly, can have the highest impact by directly affecting lives — the medical field. Earlier this year, ChatGPT had even cleared all three parts of the United States Medical Licensing Examination (USMLE) and we even saw how ChatGPT helped save a dog’s life through accurate medical diagnosis. However, we have not seen much practical applications in the medical field. Does GPT-4 capabilities make it a suitable player in the medical field?
A paper released by OpenAI and Microsoft on the Capabilities of GPT-4 on Medical Challenge Problems was released in March, this year. In this research, GPT-4 have shown impressive language understanding and generation abilities in medicine. The study evaluates GPT-4’s performance on medical competency exams and benchmark datasets, even though the model wasn’t specialised for medicine.
The researchers assess GPT-4’s performance on official USMLE practice materials and MultiMedQA datasets. GPT-4 surpasses the USMLE passing score by over 20 points, outperforming previous models (including GPT-3.5) and even models fine-tuned for medical knowledge. Additionally, GPT-4 demonstrates improved probability calibration, implying that it’s better at predicting correct answers. The study also explores how GPT-4 can explain medical reasoning, customise explanations, and create hypothetical scenarios, showcasing its potential for medical education and practice. The findings highlight GPT-4’s capabilities while acknowledging challenges related to accuracy and safety in real-world applications.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
In comparison to its older models, GPT-4 has gotten much better when tested on official medical exams such as USMLE. GPT-4 improved by more than 30 percentage points when compared to GPT-3.5. While GPT-3.5 was getting close to this passing score (60% of multiple-choice questions to be correct), GPT-4 passed the score by a huge number.
Alignment and Safety In Place
When an earlier version of GPT-4, referred to as the base model, was compared with GPT-4, the former had slightly better performance by about 3-5% on some of the tests. This suggests that when the model was made safer and better at following instructions, it might have lost a bit of its raw performance. The researchers suggested that future work could focus on finding ways to balance accuracy and safety more effectively by refining the training process or by using specialised medical data.
Where does Med-PaLM fit in?
The above research did not compare GPT-4 with models such as Med-PaLM and Flan-PaLM 540B, as the models were not available for everyone to try at the time of study.
Google recently launched their multimodal healthcare LLM with Med-PaLMM – a large multimodal generative model that encodes and interprets biomedical data. Its capabilities are far more advanced than GPT-4 considering how it can handle various types of medical data such as clinical language, medical images, genomics and even performs a wide range of tasks. The model can generalise to new medical tasks and perform multimodal reasoning without specific training. It is able to precisely recognize and explain medical conditions in images using just instructions and prompts given in language.
However, GPT-4 applications are not as diverse as the ones Med-PaLM offers. Though GPT-4 was announced with multimodal features, it is not yet available for users. Furthermore, there have been negative observations on GPT-4’s capabilities in medical diagnosis. Problematic and biased results were part of the outcome, and concerns on how GPT-4’s inclination to embed societal biases may hamper its suitability for aiding clinical decisions.
The prevalent problem of hallucinations still persists with GPT-4 spewing incorrect information. The model has been generating incorrect answers for medical citations. GPT-4 produced over 20% errors for medical citations.
While GPT-4 might not be completely reliable as a medical assist for diagnosis with the current performance , there are other functions that the model can assist in. Hospitals are looking at AI to help relieve doctor burnout. With applications that can write notes for electronic health records and drafting empathetic notes to patients, AI can help smoothen the process. Transcribing doctor and patient comments, then creating physician’s summary format for electronic health records is one of the best use cases in the medical field. With the current limitations, GPT-4 still has a long way to go before it can be entirely adopted in the medical field.