OpenAI measures the hallucinations of its LLMs (and those of its competitors)

Wednesday 13th November 2024 03:58 AM

To better evaluate the accuracy of its large language models, OpenAI has developed SimpleQA, a benchmark dedicated to factual answers. It’s confirmed: GPT models still have limits.

We all know and have experienced it: LLMs can hallucinate incorrect answers. OpenAI of course hopes to correct the situation and improve the reliability of its models. To do this, the company introduced SimpleQA, its own open-source benchmark to measure the response accuracy of large language models. The development of this new tool has highlighted the current limitations of LLMs for certain questions.

SimpleQA was designed to evaluate the ability of OpenAI models to answer short, focused, fact-based questions. The methodology is based on a set of 4326 clear questions, with verifiable answers, aimed at simplifying the assessment. By limiting the scope of the benchmark to well-defined questions, OpenAI claims that SimpleQA allows for a more precise measurement of factuality.

Deliberately difficult questions

Important clarification: for the selection of interrogative prompts, AI trainers developed factual questions and precise answers, meeting strict criteria: a unique response, which remains constant over time, and a tendency to cause hallucinations. The researchers in fact only selected questions to which at least one version of OpenAI’s LLM had provided an incorrect answer. The results therefore reflect the models’ performance on deliberately difficult questions, not their general ability to answer factual questions.

A maximum of 40% correct answers

The results indicate that GPT-4o, the current version of GPT-4, achieves around 40% correct answers, while the GPT-4 o1-Preview model scores slightly higher. Smaller models perform even lower.

According to OpenAI researchers, SimpleQA could encourage research into more reliable AI. “The correlation between the ability to provide short factual answers and the ability to write detailed responses remains an open question,” they added.

Making LLMs more reliable is probably an even more crucial issue now that OpenAI has launched its own search engine. Read on the topic >> Are you sure you want to use a chatbot as a search engine?

Swiss