Study Shows Cognitive Limits of Chatbots Through Dementia Screening

Thursday 19th December 2024 05:52 AM

Nearly all major language models or “chatbots” show signs of mild cognitive impairment in tests widely used to detect early signs of dementia, according to a study published in the Christmas issue of The BMJ.

The results also show that “older” versions of chatbots, like older patients, tend to perform worse on tests. The authors say these findings “challenge the assumption that artificial intelligence will soon replace human doctors.”

Tremendous advances in artificial intelligence have given rise to a wave of excited and frightening speculation about whether chatbots can outperform human doctors.

Several studies have shown that large language models (LLMs) are remarkably adept at a range of medical diagnostic tasks, but their susceptibility to human impairments such as cognitive decline has not yet been examined.

To fill this knowledge gap, researchers assessed the cognitive capabilities of the leading publicly available LLMs – ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 “Sonnet” (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet) – using the Montreal Cognitive Assessment (MoCA) test.

The MoCA test is widely used to detect cognitive impairment and early signs of dementia, usually in older adults. Through a number of tasks and short questions, it assesses abilities including attention, memory, language, visuospatial skills and executive functions. The maximum score is 30 points, with a score of 26 or more generally considered normal.

The instructions given to the LLMs for each task were the same as those given to human patients. Scoring followed official guidelines and was assessed by a practicing neurologist.

ChatGPT 4o had the highest MoCA test score (26 out of 30), followed by ChatGPT 4 and Claude (25 out of 30), with Gemini 1.0 having the lowest score (16 out of 30).

All chatbots showed poor performance in visuospatial skills and executive tasks, such as the trail-making task (connecting circled numbers and letters in ascending order) and the clock-drawing test (drawing a clock face). clock indicating a specific time). Gemini models failed the delayed recall task (remembering a sequence of five words).

Most other tasks, including naming, attention, language, and abstraction, were performed well by all chatbots.

But in other visuospatial tests, chatbots were unable to demonstrate empathy or accurately interpret complex visual scenes. Only ChatGPT 4o passed the incongruous step of the Stroop test, which uses combinations of color names and font colors to measure the impact of interference on reaction time.

These are observational results and the authors recognize the essential differences between the human brain and large models of language.

However, they point out that the uniform failure of all major language models in tasks requiring visual abstraction and executive function highlights an important point of weakness that could hamper their use in clinical settings.

As such, they conclude: “Not only are neurologists unlikely to be replaced by large language models any time soon, but our results suggest that they may soon find themselves treating new virtual patients – models of artificial intelligence presenting cognitive disorders. »