September 14th, 2024

LLMs Will Always Hallucinate, and We Need to Live with This

The paper by Sourav Banerjee and colleagues argues that hallucinations in Large Language Models are inherent and unavoidable, rooted in computational theory, and cannot be fully eliminated by improvements.

Read original article

FrustrationSkepticismCuriosity

The paper titled "LLMs Will Always Hallucinate, and We Need to Live With This" by Sourav Banerjee and colleagues discusses the inherent limitations of Large Language Models (LLMs), particularly focusing on the phenomenon of hallucinations. The authors argue that hallucinations are not merely occasional errors but an unavoidable characteristic of LLMs due to their fundamental mathematical and logical structures. They assert that improvements in architecture, datasets, or fact-checking will not eliminate these hallucinations, which are rooted in computational theory and concepts such as Gödel's First Incompleteness Theorem. The paper introduces the idea of Structural Hallucination, emphasizing that every phase of the LLM process—from data compilation to text generation—carries a non-zero probability of producing inaccuracies. By establishing the mathematical inevitability of hallucinations, the authors challenge the belief that such errors can be completely mitigated.

- Hallucinations in LLMs are an inherent feature, not just occasional errors.

- Improvements in architecture or datasets cannot fully eliminate hallucinations.

- The concept of Structural Hallucination is introduced as a fundamental aspect of LLMs.

- The paper draws on computational theory and Gödel's First Incompleteness Theorem to support its claims.

- Every stage of the LLM process has a probability of producing hallucinations.

Detecting hallucinations in large language models using semantic entropy

Researchers devised a method to detect hallucinations in large language models like ChatGPT and Gemini by measuring semantic entropy. This approach enhances accuracy by filtering unreliable answers, improving model performance significantly.

Large Language Models are not a search engine

Large Language Models (LLMs) from Google and Meta generate algorithmic content, causing nonsensical "hallucinations." Companies struggle to manage errors post-generation due to factors like training data and temperature settings. LLMs aim to improve user interactions but raise skepticism about delivering factual information.

Overcoming the Limits of Large Language Models

Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.

Have we stopped to think about what LLMs model?

Recent discussions critique claims that large language models understand language, emphasizing their limitations in capturing human linguistic complexities. The authors warn against deploying LLMs in critical sectors without proper regulation.

GPTs and Hallucination

Large language models, such as GPTs, generate coherent text but can produce hallucinations, leading to misinformation. Trust in their outputs is shifting from expert validation to crowdsourced consensus, affecting accuracy.

AI: What people are saying

The discussion surrounding the article on hallucinations in Large Language Models (LLMs) reveals several key themes and points of contention.

Many commenters argue that hallucinations are an inherent feature of LLMs, not a malfunction, and suggest that this is a fundamental aspect of their design.
There is a consensus that both LLMs and humans experience hallucinations, raising questions about the nature of intelligence and learning.
Some participants emphasize the importance of understanding and managing hallucinations rather than attempting to eliminate them entirely.
Several comments highlight the need for better calibration of model confidence and the implications of hallucinations in critical fields like law and science.
The conversation also touches on the limitations of current LLM architectures and the potential for future improvements.

50 comments

By @lolinder - 8 months

> By establishing the mathematical certainty of hallucinations, we challenge the prevailing notion that they can be fully mitigated

Having a mathematical proof is nice, but honestly this whole misunderstanding could have been avoided if we'd just picked a different name for the concept of "producing false information in the course of generating probabilistic text".

"Hallucination" makes it sound like something is going awry in the normal functioning of the model, which subtly suggests that if we could just identify what went awry we could get rid of the problem and restore normal cognitive function to the LLM. The trouble is that the normal functioning of the model is simply to produce plausible-sounding text.

A "hallucination" is not a malfunction of the model, it's a value judgement we assign to the resulting text. All it says is that the text produced is not fit for purpose. Seen through that lens it's obvious that mitigating hallucinations and creating "alignment" are actually identical problems, and we won't solve one without the other.

By @leobg - 8 months

Isn’t hallucination just the result of speaking out loud the first possible answer to the question you’ve been asked?

A human does not do this.

First of all, most questions we have been asked before. We have made mistakes in answering them before, and we remember these, so we don’t repeat them.

Secondly, we (at least some of us) think before we speak. We have an initial reaction to the question, and before expressing it, we relate that thought to other things we know. We may do “sanity checks“ internally, often habitually without even realizing it.

Therefore, we should not expect an LLM to generate the correct answer immediately without giving it space for reflection.

In fact, if you observe your thinking, you might notice that your thought process often takes on different roles and personas. Rarely do you answer a question from just one persona. Instead, most of your answers are the result of internal discussion and compromise.

We also create additional context, such as imagining the consequences of saying the answer we have in mind. Thoughts like that are only possible once an initial “draft” answer is formed in your head.

So, to evaluate the intelligence of an LLM based on its first “gut reaction” to a prompt is probably misguided.

Let me know if you need any further revisions!

By @jampekka - 8 months

I'm of the opinion that the current architectures are fundamentally ridden with "hallucinations" that will severely limit their practical usage (including very much what the hype thinks they could do). But this article puts an impossible limit to what it is to "not-hallucinate".

It essentially restates well known fundamental limitations of formal systems and mechanistic computation and then presents the trivial result that LLMs also share these limitations.

Unless some dualism or speculative supercomputational quantum stuff is invoked, this holds very much to humans too.

By @ninetyninenine - 8 months

Incomplete training data is kind of a pointless thing to measure.

Isn’t incomplete data the whole point of learning in general? The reason why we have machine learning is because data was incomplete. If we had complete data we don’t need ml. We just build a function that maps the input to output based off the complete data. Machine learning is about filling in the gaps based off of a prediction.

In fact this is what learning in general is doing. It means this whole thing about incomplete data applies to human intelligence and learning as well.

Everything this theory is going after basically has application learning and intelligence in general.

So sure you can say that LLMs will always hallucinate. But humans will also always hallucinate.

The real problem that needs to be solved is: how do we get LLMs to hallucinate in the same way humans hallucinate?

By @davesque - 8 months

The way that LLMs hallucinate now seems to have everything to do with the way in which they represent knowledge. Just look at the cost function. It's called log likelihood for a reason. The only real goal is to produce a sequence of tokens that are plausible in the most abstract sense, not consistent with concepts in a sound model of reality.

Consider that when models hallucinate, they are still doing what we trained them to do quite well, which is to at least produce a text that is likely. So they implicitly fall back onto more general patterns in the training data i.e. grammar and simple word choice.

I have to imagine that the right architectural changes could still completely or mostly solve the hallucination problem. But it still seems like an open question as to whether we could make those changes and still get a model that can be trained efficiently.

Update: I took out the first sentence where I said "I don't agree" because I don't feel that I've given the paper a careful enough read to determine if the authors aren't in fact agreeing with me.

By @simonw - 8 months

A key skill necessary to work effectively with LLMs is learning how to use technology that is fundamentally unreliable and non-deterministic.

A lot of people appear to find this hurdle almost impossible to overcome.

By @feverzsj - 8 months

Maybe, it's time for the bubble to burst.

By @ndespres - 8 months

We don’t need to “live with this”. We can just not use them, ignore them, or argue against their proliferation and acceptance, as I will continue doing.

By @namaria - 8 months

LLMs will go the way of the 'expert systems'. We're gonna wonder why we ever thought that was gonna happen.

I just recommend you don't pidgeonhole yourself and an AI professional because it's gonna be awfully cold outside pretty soon.

By @gdiamos - 8 months

Disagree - https://arxiv.org/abs/2406.17642

We cover halting problem and intractable problems in the related work.

Of course LLMs cannot give answers to intractable problems.

I also don’t see why you should call an answer of “I cannot compute that” to a halting problem question a hallucination.

By @willcipriano - 8 months

When will I see AI dialogue in video games? Imagine a RPG where instead of picking from a series of pre recorded dialogues, you could just talk to that villager. If it worked it would be mind blowing. The first studio to really pull it off in the AAA game would rake in the cash.

That seems like the lowest hanging fruit to me, like we would do that long before we have AI going over someone's medical records.

If the major game studios aren't confident enough in the tech to have it write dialogue for a Disney character for fear of it saying the wrong thing, I'm not ready for it to anything in the real world.

By @bicx - 8 months

I treat LLMs like a fallible being, the same way I treat humans. I don’t just trust output implicitly, and I accept help with tasks knowing I am taking a certain degree of risk. Mostly, my experience has been very positive with GPT-4o / ChatGPT and GitHub copilot with that in mind. I use each constantly throughout the day.

By @fsndz - 8 months

We can't get rid of hallucinations. Hallucinations are a feature not a bug. A recent study by researchers Jim Waldo and Soline Boussard highlights the risks associated with this limitation. In their analysis, they tested several prominent models, including ChatGPT-3.5, ChatGPT-4, Llama, and Google’s Gemini. The researchers found that while the models performed well on well-known topics with a large body of available data, they often struggled with subjects that had limited or contentious information, resulting in inconsistencies and errors.

This challenge is particularly concerning in fields where accuracy is critical, such as scientific research, politics, or legal matters. For instance, the study noted that LLMs could produce inaccurate citations, misattribute quotes, or provide factually wrong information that might appear convincing but lacks a solid foundation. Such errors can lead to real-world consequences, as seen in cases where professionals have relied on LLM-generated content for tasks like legal research or coding, only to discover later that the information was incorrect. https://www.lycee.ai/blog/llm-hallucinations-report

By @irrational - 8 months

I prefer confabulate over hallucinate.

Confabulate - To fill in gaps in one's memory with fabrications that one believes to be facts.

Hallucinate - To wander; to go astray; to err; to blunder; -- used of mental processes

Confabulation sounds a lot more like what LLMs actually do.

By @data_maan - 8 months

Just from the way this paper is written (badly, all kinds of LaTeX errors), my belief that something meaningful was proved here, that some nice mathematical theory has been developed, is low.

Example: The first 10 pages are meaningless bla

By @zer00eyz - 8 months

Shakes fist at clouds... Back in my day we called these "bugs" and if you didn't fix them your program didn't work.

Jest aside, there is a long list of "flaws" in LLMS that no one seems to be addressing. Hallucinations, Cut off dates, Lack of true reasoning (the parlor tricks to get there don't cut it), size/cost constraints...

LLM's face the same issues as expert systems, without the constant input of experts (subject matter) your llm becomes quickly outdated and useless, for all but the most trivial of tasks.

By @advael - 8 months

It's crazy to me that we managed to get such an exciting technology both theoretically and as a practical tool and still managed to make it into a bubbly hype wave because business people want it to be an automation technology, which is just a poor fit for what they actually do

It's kind of cool that we can make mathematical arguments for this, but the idea that generative models can function as universal automation is a fiction mostly being pushed by non-technical business and finance people, and it's a good demonstration of how we've let such people drive the priorities of technological development and adoption for far too long

A common argument I see folks make is that humans are fallible too. Yes, no shit. No automation even close to as fallible as a human at its task could function as an automation. When we automate, we remove human accountability and human versatility from the equation entirely, and can scale the error accumulation far beyond human capability. Thus, an automation that actually works needs drastically superhuman reliability, which is why functioning automations are usually narrow-domain machines

By @lsy - 8 months

LLM and other generative output can only be useful for a purpose or not useful. Creating a generative model that only produces absolute truths (as if this was possible, or there even were such a thing) would make them useless for creative pursuits, jokes, and many of the other purposes to which people want to put them. You can’t generate a cowboy frog emoji with a perfectly reality-faithful model.

To me this means two things:

1. Generative models can only be helpful for tasks where the user can already decide whether the output is useful. Retrieving a fact the user doesn’t already know is not one of those use cases. Making memes or emojis or stories that the user finds enjoyable might be. Writing pro forma texts that the user can proofread also might be.

2. There’s probably no successful business model for LLMs or generative models that is not already possible with the current generation of models. If you haven’t figured out a business model for an LLM that is “60% accurate” on some benchmark, there won’t be anything acceptable for an LLM that is “90% accurate”, so boiling yet another ocean to get there is not the golden path to profit. Rather, it will be up to companies and startups to create features that leverage the existing models and profit that way rather than investing in compute, etc.

By @mrjin - 8 months

LLMs can neither understand nor hallucinate. All LLMs are just picking tokens based on probability. So doesn't matter how plausible the outputs look, the reasons lead to the output are absolutely NOT what we expect them to be. But such ugly fact cannot be admitted or the party would be stopped.

By @danenania - 8 months

Perplexity does a pretty good job on this. I find myself reaching for it first when looking for a factual answer or doing research. It can still make mistakes but the hallucination rate is very low. It feels comparable to a google search in terms of accuracy.

Pure LLMs are better for brainstorming or thinking through a task.

By @jongjong - 8 months

The hallucinations seem to be related to AI's agreeableness. They always seem to tell you what you want to hear except when it goes against significant social narratives.

It's like LLMs know all possible alternative theories (including contradictory ones) and which one it brings up depends on how you phrase the question and how much you already know about the subject.

The more accurate information you bring into the question, the more accurate information you get out of it.

If you're not very knowledgeable, you will only be able to tap into junior level knowledge. If you ask the kinds of questions that an expert would ask, then it will answer like an expert.

By @badsandwitch - 8 months

Due to the limitations of gradient descent and training data we are limited in the architectures that are viable. All the top LLM's are decoder-only for efficiency reasons and all models train on the production of text because we are not able to train on the thoughts behind the text.

Something that often gives me pause is the consideration that it is actually possible to come up with an architecture which has a good chance of being capable of being an AGI (RNNs, transformers etc as dynamical systems) but the model weights that would allow it to happen cannot be found because gradient descent will fail or not even be viable.

By @Animats - 8 months

Oh, not that again. Didn't we see this argument about three weeks ago.

A 100% correct LLM may be impossible. A LLM checker that produces a confidence value may be possible. We sure need one. Although last week's proposal for one wasn't very good.

When someone says something practical can't be done because of the halting problem, they're probably going in the wrong direction.

The authors are all from something called "UnitedWeCare", which offers "AI-Powered Holistic Mental Health Solutions". Not sure what to make of that.

By @seydor - 8 months

Isn't that obvious without invoking godel's theorem etc?

By @zyklonix - 8 months

We might as well embrace them: https://github.com/DivergentAI/dreamGPT

By @rapatel0 - 8 months

Been saying this from the beginning. Let's look at comparitor of a human result.

What is the likelihood that a junior college student with access to google will generate a "hallucination" after reading a textbook and doing some basic research on a given topic. Probably pretty high.

In our culture, we're often told to fake it till you make it. How many of us are probabilistic-ly hallucinating knowledge we've regurgitate from other sources?

By @reliableturing - 8 months

I’m not sure what this paper is supposed to prove and find it rather trivial.

> All of the LLMs knowledge comes from data. Therefore,… a larger more complete dataset is a solution for hallucination.

Not being able to include everything in the training data is the whole point of intelligence. This also holds for humans. If sufficiently intelligent it should be able to infer new knowledge, refuting the very first assumption at the core of the work.

By @renjimen - 8 months

Models are often wrong but sometimes useful. Models that provide answers couched in a certain level of confidence are miscalibrated when all answers are given confidently. New training paradigms attempt to better calibrate model confidence in post-training, but clearly there are competing incentives to give answers confidently given the economics of the AI arms race.

By @99112000 - 8 months

Humans try to replicate the human brain in software and are surprised it sometimes spits out dumb things a human could have said.

By @nybsjytm - 8 months

Does it matter that, like so much in the Math for AI sphere, core details seem to be totally bungled? e.g. see various comments in this thread https://x.com/waltonstevenj/status/1834327595862950207

By @throwawaymaths - 8 months

LLMs hallucinate because probs -> tokens erase confidence values and it's difficult to assign confidences to strings of tokens, especially if you don't know where to start and stop counting (one word? one sentence?)

Is there a reason to believe this is not solvable as literally an API change? The necessary data are all there.

By @pkphilip - 8 months

Hallucinations in LLM will severely affect its usage in scenarios where such hallucinations are completely unacceptable - and there are many such scenarios. This is a good thing because it will mean that human intelligence and oversight will continue to be needed.

By @aptsurdist - 8 months

But humans also hallucinate.

And humans habitually stray from the “truth” too. It’s always seemed to me that getting AI to be more accurate isn’t a math problem, it’s getting AI to “care” about what is true - aka better defining what truth is- aka what sources should be cited with what weights.

We can’t even keep humans in society from believing in the stupidest conspiracy theories. When humans get their knowledge from sources indiscriminately, they also parrot stupid shit that isn’t real.

Now enter Gödel’s incompleteness Theorem: there is no perfect tie between language and reality. Super interesting. But this isn’t the issue. Or at least it’s not more of an issue for robots than it is for humans.

If/when humans deliver “accurate” results in our dialogs, it’s because we’ve been trained to care about what is “accuracy” (as defined by society’s chosen sources)

Remember that AI “doesn’t live here.” It’s swimming in a mess of noisy context without guidance for what it should care about.

IMHO, as soon as we train AI to “care” at a basic level about what we culturally agree is “true” the hallucinations will diminish to be far smaller than the hallucinations of most humans.

I’m honestly not sure if that will be a good thing or the start of something horrifying.

By @nailuj - 8 months

To tire a comparison to human thinking, you can conceive of it as hallucinations too, we just have another layer behind the hallucinations that evaluates each one and tries to integrate them with what we believe to be true. You can observe this when you're about to fall asleep or are snoozing, sometimes you go down wild thought paths until the critical thinking part of your brain kicks in with "everything you've been thinking about these past 10 seconds is total incoherent nonsense". Dream logic.

In that sense, a hallucinating system seems like a promising step towards stronger AI. AI systems simply are lacking a way to test their beliefs against a real world in the way we can, so natural laws, historical information, art and fiction exist on the same epistemological level. This is a problem when integrating them into a useful theory because there is no cost to getting the fundamentals wrong.

By @russfink - 8 months

Sounds like a missed STTNG story line. I can imagine that such a “Data,” were we ever to build one, would hallucinate from time to time.

By @m3kw9 - 8 months

They did say each token is generated using probability, not certainty, given that there is a chance it produces wrong tokens

By @mxwsn - 8 months

OK - there's always a nonzero chance of hallucination. There's also a non-zero chance that macroscale objects can do quantum tunnelling, but no one is arguing that we "need to live with this" fact. A theoretical proof of the impossibility of reaching 0% probability of some event is nice, but in practice it says little about whether we can exponentially decrease the probability of it happening or not to effectively mitigate risk.

By @mrkramer - 8 months

So hallucinations are something like cancer, it will have sooner or later, in another words, it is inevitable.

By @ramshanker - 8 months

Few Humans (all?) will always Hallucinate, and we already live with this. ;)

By @TMWNN - 8 months

How goes the research on whether hallucinations are the AI equivalent of human imagination, or daydreaming?

By @treebeard901 - 8 months

If everyone else can hallucinate along with it then problem solved.

By @rw_panic0_0 - 8 months

since it doesn't have emotions I believe

By @jmakov - 8 months

Maybe time for getting synth minds from guessing to reason.

By @cynicalpeace - 8 months

Humans do it too. We just call it “being wrong”

By @OutOfHere - 8 months

This seems to miss the point, which is how to minimize hallucinations to a desirable level. Good prompts refined over time can minimize hallucinations by a significant degree, but they cannot fully eliminate them.

By @reilly3000 - 8 months

Better given them some dried frog pills.

By @_cs2017_ - 8 months

Useless trash paper. It's like saying any object can disappear and reappear anywhere in the universe due to quantum physics, so there's no point studying physics or engineering. Just maybe we care if the probability of that happening is 10%, 0.00001%, or 1e-1000%.

LLMs Will Always Hallucinate, and We Need to Live with This

Related

Detecting hallucinations in large language models using semantic entropy

Large Language Models are not a search engine

Overcoming the Limits of Large Language Models

Have we stopped to think about what LLMs model?

GPTs and Hallucination

Related

Detecting hallucinations in large language models using semantic entropy

Large Language Models are not a search engine

Overcoming the Limits of Large Language Models

Have we stopped to think about what LLMs model?

GPTs and Hallucination