ChatGPT may overprescribe medical imaging in the emergency department, suggest findings published October 8 in Nature Communications.
Researchers led by Christopher Williams, MD, from the University of California, San Francisco found that ChatGPT tends to recommend unnecessary emergency medicine and care, including imaging, and is less accurate than residents.
“Before large language models can be integrated within the clinical environment, it is important to fully understand both their capabilities and limitations,” Williams and colleagues wrote. “Otherwise, there is a risk of unintended harmful consequences, especially if models have been deployed at scale.”
Large language models continue to be explored by medical researchers for their utility in the clinic. In radiology, the performances of these models have demonstrated mixed results. While the models can create accurate radiological reports and provide information for patients, they have also achieved poor performance on board exams and generate responses with technical language that is not easy for patients to understand. One study led by Williams et al showed that ChatGPT is slightly better at deciding which of two emergency patients was most acutely unwell.
However, skeptics say that these chatbots should not be overly relied upon since they do not have perfect accuracy in medical recommendations.
Williams and co-authors had ChatGPT provide recommendations that a physician would make after initially examining a patient in the emergency department. This includes determining whether to admit the patient, get x-ray exams or other imaging scans, or prescribe antibiotics.
The researchers curated a set of 1,000 emergency department visits for each of these decisions. These sets had the same ratio of “yes” to “no” responses for decisions on admission, radiology, and antibiotics. The team entered doctors’ notes on each patient’s symptoms and exam results into ChatGPT-3.5 and ChatGPT-4. From there, it tested the accuracy of each set with a series of four increasingly detailed prompts and compared them to the accuracy of physician residents.
Responses from the residents achieved a lower sensitivity but higher specificity than that of ChatGPT-3.5. The team observed similar trends when comparing ChatGPT-4’s responses to the residents’ responses, except for the antibiotic prescription status task where that chatbot demonstrated a higher specificity yet lower sensitivity.
Also, for the most part, both versions of ChatGPT demonstrated inferior performance when compared to the residents when it came to accuracy of recommendations. However, ChatGPT-4 achieved a higher accuracy than the residents when it came to the antibiotic prescription status task.
Accuracy of ChatGPT in emergency department recommendations | |||
---|---|---|---|
Accuracy measure (with 1 as reference) | Residents | ChatGPT-3.5 (range) | ChatGPT-4 (range for admission status) |
Admission status | 0.83 | 0.29 to 0.53 | 0.43 to 0.58 |
Radiological investigation | 0.78 | 0.68 to 0.71 | 0.74 |
Antibiotic prescription status | 0.79 | 0.35 to 0.43 | 0.83 |
The results translate to ChatGPT-4 and ChatGPT-3.5 being 8% and 24% less accurate than resident physicians, respectively.
The study authors highlighted that their results point to large language models being “overly cautious,” with the consequence being a higher rate of false-positive cases.
“Such a finding is problematic given the need to both prioritize hospital resource availability and reduce overall healthcare costs,” they wrote.
The authors concluded that while these chatbots have shown promising early signs regarding their clinical utility, there remains much room for improvement. This especially goes for working on increasingly complex tasks.
The full results can be found here.