Can AI improve the accuracy of medical diagnoses? Researchers at UVA Health, a health care network affiliated with the University of Virginia, set out to answer that question. The result of their study is surprising: while AI can indeed outperform doctors in certain diagnostic tasks, its integration into their workflow did not significantly improve their overall performance.
Large language models (LLMs) have shown promising results in passing medical reasoning exams, whether multiple-choice or open-ended questions. However, their impact on improving the diagnostic reasoning of doctors in real situations remains to be determined.
Andrew S. Parsons, who oversees the teaching of clinical skills to medical students at the University of Virginia School of Medicine and co-leads the Clinical Reasoning Research Collaborative, and his colleagues at UVA Health wanted to bring ChatGPT Plus ( GPT-4) to the test. Their study was published in the scientific journal JAMA Network Open and accepted this month by the 2024 symposium of the American Medical Informatics Association.
Study methodology
Researchers recruited 50 physicians practicing in family medicine, internal medicine and emergency medicine to launch a randomized, controlled clinical trial at three leading hospitals: UVA Health, Stanford and Harvard’s Beth Israel Deaconess Medical Center. Half of them were randomly assigned to use ChatGPT in addition to conventional methods such as Google or medical reference sites like UpToDate, while the other half relied solely on these conventional methods.
Participants were given 60 minutes to review up to 6 clinical vignettes, educational tools used in the medical field to assess and improve the clinical skills of healthcare professionals. These vignettes, based on real cases, included details of patient histories, physical examinations, and laboratory test results.
Results
The study found that doctors using ChatGPT Plus achieved a median diagnostic accuracy of 76.3%, slightly higher than the 73.7% of doctors relying solely on traditional tools. If the difference remains modest, on the other hand, Chat GPT Plus, used independently, achieved an impressive accuracy of 92%.
While trial participants using ChatGPT Plus reached a diagnosis slightly faster overall (519 seconds versus 565 seconds per case), paradoxically they reduced the AI’s diagnostic accuracy.
For the researchers, this drop in accuracy could be due to the prompts used. They highlight the need to train clinicians in the optimal use of AI, in particular by using prompts more effectively. Alternatively, healthcare organizations could purchase predefined prompts to implement into workflow and clinical documentation.
They say ChatGPT Plus would likely perform less well in real life, where many other aspects of clinical reasoning come into play, particularly in determining the downstream effects of diagnoses and treatment decisions. They are calling for additional studies to assess the capabilities of large language models in these areas and are conducting a similar study on management decision making.
Conclusions
The results reveal a key nuance: although LLMs are capable of impressive stand-alone performance, their use in addition to traditional methods has not significantly improved physicians’ diagnostic accuracy.
Researchers warn that “the results of this study should not be interpreted as indicating that LLMs should be used for diagnosis on a stand-alone basis without physician supervision” adding that “further developments in human-machine interactions are needed to realize the potential of AI in clinical decision support systems”.
They have also launched a bicoastal AI evaluation network called ARiSE (AI Research and Science Evaluation) to further evaluate the results of GenAI in healthcare.
Article references
“Influence of a large language model on diagnostic reasoning. A randomized clinical trial” doi :10.1001/jamanetworkopen.2024.40969
Research Team : Ethan Goh, Robert Gallo, Jason Hom, Eric Strong, Yingjie Weng, Hannah Kerman, Joséphine A. Cool, Zahir Kanjee, Andrew S. Parsons, Neera Ahuja, Eric Horvitz, Daniel Yang, Arnold Milstein, Andrew PJ Olson , Adam Rodman et Jonathan H. Chen.
Related News :