close
close

Study suggests that even the best AI models often hallucinate


Study suggests that even the best AI models often hallucinate

All generative AI models hallucinate, from Google’s Gemini to Anthropic’s Claude to the latest stealth version of OpenAI’s GPT-4o. In other words, the models are unreliable narrators – sometimes to hilarious effect, sometimes to problematic effect.

But not all models make things up at the same speed. And the type of untruths they spread depends on what sources of information they have been exposed to.

A recent study by researchers at Cornell University, the Universities of Washington and Waterloo, and the nonprofit research institute AI2 attempted to compare hallucinations by testing models like GPT-4o against trusted sources on topics ranging from law and health to history and geography. They found that no model performed particularly well on all topics, and that models that hallucinated the least did so in part because they refused to answer questions they would otherwise get wrong.

“The most important lesson from our work is that we cannot yet fully trust the results of model generations,” Wenting Zhao, a doctoral student at Cornell University and co-author of the study, told TechCrunch. “Currently, even the best models can only generate hallucination-free text about 35% of the time.”

There have been other academic attempts to test the “factuality” of models, including one by a separate team affiliated with AI2. But Zhao notes that these earlier tests asked the models questions whose answers could easily be found on Wikipedia—not exactly the most difficult question, considering that most models are trained using Wikipedia data.

To make their benchmark more challenging – and to more accurately reflect the types of questions people ask of models – the researchers identified topics on the Internet that not are Wikipedia-related. Just over half of the questions in their test cannot be answered using Wikipedia (some questions with Wikipedia sources were added for good measure) and touch on topics such as culture, geography, astronomy, pop culture, finance, medicine, computer science, and celebrities.

For their study, the researchers examined over a dozen different popular models, many of which were released in the past year. In addition to GPT-4o, they tested “open” models such as Meta’s Llama 3 70B, Mistral’s Mixtral 8x22B and Cohere’s Command R+, as well as gated-behind API models such as Perplexity’s Sonar-Large (which is based on Llama), Google’s Gemini 1.5 Pro and Anthropic’s Claude 3 Opus.

The results suggest that models are not much less likely to hallucinate these days, despite OpenAI, Anthropic and other major players in the generative AI space claiming otherwise.

GPT-4o and OpenAI’s much older flagship GPT-3.5 performed about equally in the benchmark in terms of the percentage of questions they answered factually correctly. (GPT-4o was slightly better.) OpenAI’s models were the least hallucinatory overall, followed by Mixtral 8x22B, Command R, and Perplexity’s Sonar models.

Questions about celebrities and finance gave the models the most difficulty, but questions about geography and computer science were the easiest for the models to answer (perhaps because their training data contained more references to them). In cases where the source of an answer was not Wikipedia, each model answered less factually on average (but especially GPT-3.5 and GPT-4o), suggesting that they are all heavily influenced by Wikipedia content.

Even models that can search the Internet for information, such as Command R and Perplexity’s Sonar models, struggled in the benchmark with “non-wiki” questions. Model size didn’t matter much; smaller models (e.g. Anthropic’s Claude 3 Haiku) hallucinated about as often as larger, seemingly more powerful models (e.g. Claude 3 Opus).

So what does all this mean – and where are the improvements that the providers promised?

Well, we wouldn’t put it past the vendors to exaggerate their claims. But a more charitable view is that the benchmarks they use aren’t fit for purpose. As we’ve written before, many if not most AI assessments are cursory and lacking important context, doomed to fall victim to Goodhart’s Law.

Regardless, Zhao believes that the problem of hallucinations will “remain for a long time.”

“The empirical results of our work show that despite the promise that certain methods reduce or eliminate hallucinations, the actual improvement that can be achieved with these methods is limited,” she said. “In addition, our analysis shows that even the knowledge found on the Internet can often be contradictory, in part because the training data created by humans can also contain hallucinations.”

A workaround might be to simply program the models to refuse to answer questions more often – the technical equivalent of telling a know-it-all to stop.

When the researchers tested it, Claude 3 Haiku answered only about 72% of the questions asked and left the rest unanswered. Taking into account the abstentions, Claude 3 Haiku was actually the most accurate model of all – at least in the sense that it lied the least.

But will people use a model that doesn’t answer many questions? Zhao doesn’t think so, and says vendors should spend more time and effort on research to reduce hallucinations. Eliminating hallucinations entirely may not be possible, but they can be mitigated by human intervention in fact-checking and citation during the development of a model, she claims.

“Policies and regulations need to be developed to ensure that human experts are always involved in the process to verify and validate the information generated by generative AI models,” Zhao added. “There are still numerous opportunities to make a significant impact in this area, such as developing advanced fact-checking tools for any free text, providing citations for factual content, and offering corrections for hallucinated text.”

Leave a Reply

Your email address will not be published. Required fields are marked *