Internal representations of LLMs encode information about truthfulness
The study examines hallucinations in large language models, revealing that their internal states contain truthfulness information that can enhance error detection, though this encoding is complex and dataset-specific.
Read original articleThe paper titled "LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations" explores the phenomenon of hallucinations in large language models (LLMs), which include factual inaccuracies and reasoning failures. The authors reveal that LLMs' internal states contain significant information about the truthfulness of their outputs, which can be harnessed to improve error detection. They find that this truthfulness information is concentrated in specific tokens, enhancing the performance of error detection systems. However, the study also indicates that these error detectors do not generalize well across different datasets, suggesting that truthfulness encoding is not universal but rather complex. Additionally, the research demonstrates that internal representations can predict the types of errors LLMs are likely to make, aiding in the development of targeted mitigation strategies. A notable finding is the discrepancy between the internal encoding of correct answers and the incorrect outputs generated by the models. Overall, these insights provide a deeper understanding of LLM errors from an internal perspective, which could inform future research aimed at improving error analysis and mitigation techniques.
- LLMs encode significant information about the truthfulness of their outputs.
- Error detection performance can be enhanced by focusing on specific tokens.
- Truthfulness encoding in LLMs is complex and not universally applicable across datasets.
- Internal representations can help predict the types of errors LLMs may produce.
- There is often a mismatch between the correct internal encoding and the generated outputs.
Related
Overcoming the Limits of Large Language Models
Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.
Large language models don't behave like people, even though we expect them to
Researchers from MIT proposed a framework to evaluate large language models (LLMs) based on human perceptions, revealing users often misjudge LLM capabilities, especially in high-stakes situations, affecting performance expectations.
Have we stopped to think about what LLMs model?
Recent discussions critique claims that large language models understand language, emphasizing their limitations in capturing human linguistic complexities. The authors warn against deploying LLMs in critical sectors without proper regulation.
GPTs and Hallucination
Large language models, such as GPTs, generate coherent text but can produce hallucinations, leading to misinformation. Trust in their outputs is shifting from expert validation to crowdsourced consensus, affecting accuracy.
LLMs Will Always Hallucinate, and We Need to Live with This
The paper by Sourav Banerjee and colleagues argues that hallucinations in Large Language Models are inherent and unavoidable, rooted in computational theory, and cannot be fully eliminated by improvements.
GPT-4 logits calibration pre RLHF - https://imgur.com/a/3gYel9r
Language Models (Mostly) Know What They Know - https://arxiv.org/abs/2207.05221
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets - https://arxiv.org/abs/2310.06824
The Internal State of an LLM Knows When It's Lying - https://arxiv.org/abs/2304.13734
LLMs Know More Than What They Say - https://arjunbansal.substack.com/p/llms-know-more-than-what-...
Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback - https://arxiv.org/abs/2305.14975
Teaching Models to Express Their Uncertainty in Words - https://arxiv.org/abs/2205.14334
Some arguments seem to tacitly hold LLMs to a standard of full-on brain-in-a-vat solipsism, asking them to prove their way out, where they'll obviously fail. The more interesting and practical questions, just like in humans, seem to be a bit removed from that though.
To me the research around solving “hallucination” is a dead end. The models will always hallucinate, and merely reducing the probability that they do so only makes the mistakes more dangerous. The question then becomes “for what purposes (if any) are the models profitable, even if they occasionally hallucinate?” Whoever solves that problem walks away with the market.
Is "LLMs know" a true sentence in the sense of the article? Is it not? Can LLMs know something? We will never know.
Check out the "CoT Deception Monitoring" section. In 0.38% of cases, o1's CoT shows that it knows it's providing incorrect information.
Going beyond hallucinations, models can actually be intentionally deceptive.
Such a reductionist view of the issue, the mere suggestion that hallucinations can be fixed by tweaking some variable or fixing some bug immediately discredits the resrarchers.
Also, 100% truthfulness then is plagiarism?
A third of the discussion follows a pattern of people re-asserting their belief that LLMs can't possibly have knowledge and almost bragging about how they'll ignore any evidence pointing in another direction. They'll ignore it because computers can't possibly understand things in a "real" way and anyone seriously considering the opposite must be deluded about what intelligence is, and they know better.
These discussions are fundamentally sterile. They're not about considering ideas or examining evidence, they're about enforcing orthodoxy. Or rather, complaining very loudly that most people don't tightly adhere to their preferred orthodoxy.
Related
Overcoming the Limits of Large Language Models
Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.
Large language models don't behave like people, even though we expect them to
Researchers from MIT proposed a framework to evaluate large language models (LLMs) based on human perceptions, revealing users often misjudge LLM capabilities, especially in high-stakes situations, affecting performance expectations.
Have we stopped to think about what LLMs model?
Recent discussions critique claims that large language models understand language, emphasizing their limitations in capturing human linguistic complexities. The authors warn against deploying LLMs in critical sectors without proper regulation.
GPTs and Hallucination
Large language models, such as GPTs, generate coherent text but can produce hallucinations, leading to misinformation. Trust in their outputs is shifting from expert validation to crowdsourced consensus, affecting accuracy.
LLMs Will Always Hallucinate, and We Need to Live with This
The paper by Sourav Banerjee and colleagues argues that hallucinations in Large Language Models are inherent and unavoidable, rooted in computational theory, and cannot be fully eliminated by improvements.