A study of recent, larger versions of three major AI chatbots reveals that they are more likely to provide incorrect answers than to admit their mistakes. The results, published in Nature on Wednesday (Sept. 25), also discovered that people often struggle to identify these errors.
ReadWrite has reported about how chatbots can “hallucinate” answers to queries in the past. Hence José Hernández-Orallo from the Valencian Research Institute for Artificial Intelligence in Spain, along with his colleagues, examined these misfires to understand how they evolve as AI models grow larger and use more training data. It also incorporates more parameters or decision-making nodes, consuming greater computing power.
They also investigated whether the amount of errors aligns with human perceptions of question difficulty and how effectively people can recognize incorrect answers.
Are AI LLMs trustworthy?
The team discovered that larger, more refined versions of large language models (LLMs) are more accurate, largely thanks to fine-tuning methods like reinforcement learning from human feedback. However, they are also less reliable. The researchers found that among all incorrect responses, the proportion of wrong answers has risen because these AI models are now less likely to avoid answering a question—such as admitting they don’t know or diverting the topic.
One of the researchers, Lexin Zhou, wrote on X: “LLMs are indeed less correct on tasks that humans consider difficult, but they still do succeed at difficult tasks before being flawless on easy tasks, leading to no safe operation conditions humans can identify where LLMs can be trusted.”
1/ New paper @Nature!
Discrepancy between human expectations of task difficulty and LLM errors harms reliability. In 2022, Ilya Sutskever @ilyasut predicted: "perhaps over time that discrepancy will diminish" (https://t.co/HADDUztzhu, min 61-64).
We show this is *not* the case! pic.twitter.com/u2HYQbWE4j
— Lexin Zhou (@lexin_zhou) September 25, 2024
He added that it was “concerning” that the latest LLMs improve mainly on the” high-difficulty instances,” exacerbating the discordance between human difficulty expectation and LLM success.
The team evaluated OpenAI’s GPT, Meta’s LLaMA, and BLOOM. They tested early and refined models on prompts covering arithmetic, geography, and information transformation. They found that accuracy improved with model size but dropped with more challenging questions.
Models, including GPT-4, often answered difficult questions, with wrong answers exceeding 60 percent for some refined models. Surprisingly, even easy questions were sometimes answered incorrectly. Volunteers misclassified inaccurate answers as correct 10 percent to 40 percent of the time, showcasing issues with supervising the models.
Hernández-Orallo suggests that developers should “boost AI performance on easy questions” and encourage chatbots to avoid answering difficult ones, allowing users to more accurately assess when AIs are reliable. He states, “We need humans to understand: ‘I can use it in this area, and I shouldn’t use it in that area’.”
Featured image: Ideogram