LLMs Will Always Hallucinate, and We Need to Live with This
The paper by Sourav Banerjee and colleagues argues that hallucinations in Large Language Models are inherent and unavoidable, rooted in computational theory, and cannot be fully eliminated by improvements.
Read original articleThe paper titled "LLMs Will Always Hallucinate, and We Need to Live With This" by Sourav Banerjee and colleagues discusses the inherent limitations of Large Language Models (LLMs), particularly focusing on the phenomenon of hallucinations. The authors argue that hallucinations are not merely occasional errors but an unavoidable characteristic of LLMs due to their fundamental mathematical and logical structures. They assert that improvements in architecture, datasets, or fact-checking will not eliminate these hallucinations, which are rooted in computational theory and concepts such as Gödel's First Incompleteness Theorem. The paper introduces the idea of Structural Hallucination, emphasizing that every phase of the LLM process—from data compilation to text generation—carries a non-zero probability of producing inaccuracies. By establishing the mathematical inevitability of hallucinations, the authors challenge the belief that such errors can be completely mitigated.
- Hallucinations in LLMs are an inherent feature, not just occasional errors.
- Improvements in architecture or datasets cannot fully eliminate hallucinations.
- The concept of Structural Hallucination is introduced as a fundamental aspect of LLMs.
- The paper draws on computational theory and Gödel's First Incompleteness Theorem to support its claims.
- Every stage of the LLM process has a probability of producing hallucinations.
Related
Detecting hallucinations in large language models using semantic entropy
Researchers devised a method to detect hallucinations in large language models like ChatGPT and Gemini by measuring semantic entropy. This approach enhances accuracy by filtering unreliable answers, improving model performance significantly.
Large Language Models are not a search engine
Large Language Models (LLMs) from Google and Meta generate algorithmic content, causing nonsensical "hallucinations." Companies struggle to manage errors post-generation due to factors like training data and temperature settings. LLMs aim to improve user interactions but raise skepticism about delivering factual information.
Overcoming the Limits of Large Language Models
Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.
Have we stopped to think about what LLMs model?
Recent discussions critique claims that large language models understand language, emphasizing their limitations in capturing human linguistic complexities. The authors warn against deploying LLMs in critical sectors without proper regulation.
GPTs and Hallucination
Large language models, such as GPTs, generate coherent text but can produce hallucinations, leading to misinformation. Trust in their outputs is shifting from expert validation to crowdsourced consensus, affecting accuracy.
- Many commenters argue that hallucinations are an inherent feature of LLMs, not a malfunction, and suggest that this is a fundamental aspect of their design.
- There is a consensus that both LLMs and humans experience hallucinations, raising questions about the nature of intelligence and learning.
- Some participants emphasize the importance of understanding and managing hallucinations rather than attempting to eliminate them entirely.
- Several comments highlight the need for better calibration of model confidence and the implications of hallucinations in critical fields like law and science.
- The conversation also touches on the limitations of current LLM architectures and the potential for future improvements.
Having a mathematical proof is nice, but honestly this whole misunderstanding could have been avoided if we'd just picked a different name for the concept of "producing false information in the course of generating probabilistic text".
"Hallucination" makes it sound like something is going awry in the normal functioning of the model, which subtly suggests that if we could just identify what went awry we could get rid of the problem and restore normal cognitive function to the LLM. The trouble is that the normal functioning of the model is simply to produce plausible-sounding text.
A "hallucination" is not a malfunction of the model, it's a value judgement we assign to the resulting text. All it says is that the text produced is not fit for purpose. Seen through that lens it's obvious that mitigating hallucinations and creating "alignment" are actually identical problems, and we won't solve one without the other.
A human does not do this.
First of all, most questions we have been asked before. We have made mistakes in answering them before, and we remember these, so we don’t repeat them.
Secondly, we (at least some of us) think before we speak. We have an initial reaction to the question, and before expressing it, we relate that thought to other things we know. We may do “sanity checks“ internally, often habitually without even realizing it.
Therefore, we should not expect an LLM to generate the correct answer immediately without giving it space for reflection.
In fact, if you observe your thinking, you might notice that your thought process often takes on different roles and personas. Rarely do you answer a question from just one persona. Instead, most of your answers are the result of internal discussion and compromise.
We also create additional context, such as imagining the consequences of saying the answer we have in mind. Thoughts like that are only possible once an initial “draft” answer is formed in your head.
So, to evaluate the intelligence of an LLM based on its first “gut reaction” to a prompt is probably misguided.
Let me know if you need any further revisions!
It essentially restates well known fundamental limitations of formal systems and mechanistic computation and then presents the trivial result that LLMs also share these limitations.
Unless some dualism or speculative supercomputational quantum stuff is invoked, this holds very much to humans too.
Isn’t incomplete data the whole point of learning in general? The reason why we have machine learning is because data was incomplete. If we had complete data we don’t need ml. We just build a function that maps the input to output based off the complete data. Machine learning is about filling in the gaps based off of a prediction.
In fact this is what learning in general is doing. It means this whole thing about incomplete data applies to human intelligence and learning as well.
Everything this theory is going after basically has application learning and intelligence in general.
So sure you can say that LLMs will always hallucinate. But humans will also always hallucinate.
The real problem that needs to be solved is: how do we get LLMs to hallucinate in the same way humans hallucinate?
Consider that when models hallucinate, they are still doing what we trained them to do quite well, which is to at least produce a text that is likely. So they implicitly fall back onto more general patterns in the training data i.e. grammar and simple word choice.
I have to imagine that the right architectural changes could still completely or mostly solve the hallucination problem. But it still seems like an open question as to whether we could make those changes and still get a model that can be trained efficiently.
Update: I took out the first sentence where I said "I don't agree" because I don't feel that I've given the paper a careful enough read to determine if the authors aren't in fact agreeing with me.
A lot of people appear to find this hurdle almost impossible to overcome.
I just recommend you don't pidgeonhole yourself and an AI professional because it's gonna be awfully cold outside pretty soon.
We cover halting problem and intractable problems in the related work.
Of course LLMs cannot give answers to intractable problems.
I also don’t see why you should call an answer of “I cannot compute that” to a halting problem question a hallucination.
That seems like the lowest hanging fruit to me, like we would do that long before we have AI going over someone's medical records.
If the major game studios aren't confident enough in the tech to have it write dialogue for a Disney character for fear of it saying the wrong thing, I'm not ready for it to anything in the real world.
This challenge is particularly concerning in fields where accuracy is critical, such as scientific research, politics, or legal matters. For instance, the study noted that LLMs could produce inaccurate citations, misattribute quotes, or provide factually wrong information that might appear convincing but lacks a solid foundation. Such errors can lead to real-world consequences, as seen in cases where professionals have relied on LLM-generated content for tasks like legal research or coding, only to discover later that the information was incorrect. https://www.lycee.ai/blog/llm-hallucinations-report
Confabulate - To fill in gaps in one's memory with fabrications that one believes to be facts.
Hallucinate - To wander; to go astray; to err; to blunder; -- used of mental processes
Confabulation sounds a lot more like what LLMs actually do.
Example: The first 10 pages are meaningless bla
Jest aside, there is a long list of "flaws" in LLMS that no one seems to be addressing. Hallucinations, Cut off dates, Lack of true reasoning (the parlor tricks to get there don't cut it), size/cost constraints...
LLM's face the same issues as expert systems, without the constant input of experts (subject matter) your llm becomes quickly outdated and useless, for all but the most trivial of tasks.
It's kind of cool that we can make mathematical arguments for this, but the idea that generative models can function as universal automation is a fiction mostly being pushed by non-technical business and finance people, and it's a good demonstration of how we've let such people drive the priorities of technological development and adoption for far too long
A common argument I see folks make is that humans are fallible too. Yes, no shit. No automation even close to as fallible as a human at its task could function as an automation. When we automate, we remove human accountability and human versatility from the equation entirely, and can scale the error accumulation far beyond human capability. Thus, an automation that actually works needs drastically superhuman reliability, which is why functioning automations are usually narrow-domain machines
To me this means two things:
1. Generative models can only be helpful for tasks where the user can already decide whether the output is useful. Retrieving a fact the user doesn’t already know is not one of those use cases. Making memes or emojis or stories that the user finds enjoyable might be. Writing pro forma texts that the user can proofread also might be.
2. There’s probably no successful business model for LLMs or generative models that is not already possible with the current generation of models. If you haven’t figured out a business model for an LLM that is “60% accurate” on some benchmark, there won’t be anything acceptable for an LLM that is “90% accurate”, so boiling yet another ocean to get there is not the golden path to profit. Rather, it will be up to companies and startups to create features that leverage the existing models and profit that way rather than investing in compute, etc.
Pure LLMs are better for brainstorming or thinking through a task.
It's like LLMs know all possible alternative theories (including contradictory ones) and which one it brings up depends on how you phrase the question and how much you already know about the subject.
The more accurate information you bring into the question, the more accurate information you get out of it.
If you're not very knowledgeable, you will only be able to tap into junior level knowledge. If you ask the kinds of questions that an expert would ask, then it will answer like an expert.
Something that often gives me pause is the consideration that it is actually possible to come up with an architecture which has a good chance of being capable of being an AGI (RNNs, transformers etc as dynamical systems) but the model weights that would allow it to happen cannot be found because gradient descent will fail or not even be viable.
A 100% correct LLM may be impossible. A LLM checker that produces a confidence value may be possible. We sure need one. Although last week's proposal for one wasn't very good.
When someone says something practical can't be done because of the halting problem, they're probably going in the wrong direction.
The authors are all from something called "UnitedWeCare", which offers "AI-Powered Holistic Mental Health Solutions". Not sure what to make of that.
What is the likelihood that a junior college student with access to google will generate a "hallucination" after reading a textbook and doing some basic research on a given topic. Probably pretty high.
In our culture, we're often told to fake it till you make it. How many of us are probabilistic-ly hallucinating knowledge we've regurgitate from other sources?
> All of the LLMs knowledge comes from data. Therefore,… a larger more complete dataset is a solution for hallucination.
Not being able to include everything in the training data is the whole point of intelligence. This also holds for humans. If sufficiently intelligent it should be able to infer new knowledge, refuting the very first assumption at the core of the work.
Is there a reason to believe this is not solvable as literally an API change? The necessary data are all there.
And humans habitually stray from the “truth” too. It’s always seemed to me that getting AI to be more accurate isn’t a math problem, it’s getting AI to “care” about what is true - aka better defining what truth is- aka what sources should be cited with what weights.
We can’t even keep humans in society from believing in the stupidest conspiracy theories. When humans get their knowledge from sources indiscriminately, they also parrot stupid shit that isn’t real.
Now enter Gödel’s incompleteness Theorem: there is no perfect tie between language and reality. Super interesting. But this isn’t the issue. Or at least it’s not more of an issue for robots than it is for humans.
If/when humans deliver “accurate” results in our dialogs, it’s because we’ve been trained to care about what is “accuracy” (as defined by society’s chosen sources)
Remember that AI “doesn’t live here.” It’s swimming in a mess of noisy context without guidance for what it should care about.
IMHO, as soon as we train AI to “care” at a basic level about what we culturally agree is “true” the hallucinations will diminish to be far smaller than the hallucinations of most humans.
I’m honestly not sure if that will be a good thing or the start of something horrifying.
In that sense, a hallucinating system seems like a promising step towards stronger AI. AI systems simply are lacking a way to test their beliefs against a real world in the way we can, so natural laws, historical information, art and fiction exist on the same epistemological level. This is a problem when integrating them into a useful theory because there is no cost to getting the fundamentals wrong.
Related
Detecting hallucinations in large language models using semantic entropy
Researchers devised a method to detect hallucinations in large language models like ChatGPT and Gemini by measuring semantic entropy. This approach enhances accuracy by filtering unreliable answers, improving model performance significantly.
Large Language Models are not a search engine
Large Language Models (LLMs) from Google and Meta generate algorithmic content, causing nonsensical "hallucinations." Companies struggle to manage errors post-generation due to factors like training data and temperature settings. LLMs aim to improve user interactions but raise skepticism about delivering factual information.
Overcoming the Limits of Large Language Models
Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.
Have we stopped to think about what LLMs model?
Recent discussions critique claims that large language models understand language, emphasizing their limitations in capturing human linguistic complexities. The authors warn against deploying LLMs in critical sectors without proper regulation.
GPTs and Hallucination
Large language models, such as GPTs, generate coherent text but can produce hallucinations, leading to misinformation. Trust in their outputs is shifting from expert validation to crowdsourced consensus, affecting accuracy.