October 30th, 2024

Internal representations of LLMs encode information about truthfulness

The study examines hallucinations in large language models, revealing that their internal states contain truthfulness information that can enhance error detection, though this encoding is complex and dataset-specific.

Read original articleLink Icon
Internal representations of LLMs encode information about truthfulness

The paper titled "LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations" explores the phenomenon of hallucinations in large language models (LLMs), which include factual inaccuracies and reasoning failures. The authors reveal that LLMs' internal states contain significant information about the truthfulness of their outputs, which can be harnessed to improve error detection. They find that this truthfulness information is concentrated in specific tokens, enhancing the performance of error detection systems. However, the study also indicates that these error detectors do not generalize well across different datasets, suggesting that truthfulness encoding is not universal but rather complex. Additionally, the research demonstrates that internal representations can predict the types of errors LLMs are likely to make, aiding in the development of targeted mitigation strategies. A notable finding is the discrepancy between the internal encoding of correct answers and the incorrect outputs generated by the models. Overall, these insights provide a deeper understanding of LLM errors from an internal perspective, which could inform future research aimed at improving error analysis and mitigation techniques.

- LLMs encode significant information about the truthfulness of their outputs.

- Error detection performance can be enhanced by focusing on specific tokens.

- Truthfulness encoding in LLMs is complex and not universally applicable across datasets.

- Internal representations can help predict the types of errors LLMs may produce.

- There is often a mismatch between the correct internal encoding and the generated outputs.

Link Icon 14 comments
By @og_kalu - 4 months
Related:

GPT-4 logits calibration pre RLHF - https://imgur.com/a/3gYel9r

Language Models (Mostly) Know What They Know - https://arxiv.org/abs/2207.05221

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets - https://arxiv.org/abs/2310.06824

The Internal State of an LLM Knows When It's Lying - https://arxiv.org/abs/2304.13734

LLMs Know More Than What They Say - https://arjunbansal.substack.com/p/llms-know-more-than-what-...

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback - https://arxiv.org/abs/2305.14975

Teaching Models to Express Their Uncertainty in Words - https://arxiv.org/abs/2205.14334

By @niam - 4 months
I feel that discussion over papers like these so-often distill to conversations about how it's "impossible for a bot to know what's true", that we should just bite the bullet and define what we mean by "truth".

Some arguments seem to tacitly hold LLMs to a standard of full-on brain-in-a-vat solipsism, asking them to prove their way out, where they'll obviously fail. The more interesting and practical questions, just like in humans, seem to be a bit removed from that though.

By @lsy - 4 months
There can’t be any information about “truthfulness” encoded in an LLM, because there isn’t a notion of “truthfulness” for a program which has only ever been fed tokens and can only ever regurgitate their statistical correlations. If the program was trained on a thousand data points saying that the capital of Connecticut is Moscow, the model would encode this “truthfulness” information about that fact, despite it being false.

To me the research around solving “hallucination” is a dead end. The models will always hallucinate, and merely reducing the probability that they do so only makes the mistakes more dangerous. The question then becomes “for what purposes (if any) are the models profitable, even if they occasionally hallucinate?” Whoever solves that problem walks away with the market.

By @benocodes - 4 months
I think this article about the research is good, even though the headline seems a bit off: https://venturebeat.com/ai/study-finds-llms-can-identify-the...
By @youoy - 4 months
I dream of a world where AI researchers use language in a scientific way.

Is "LLMs know" a true sentence in the sense of the article? Is it not? Can LLMs know something? We will never know.

By @kmckiern - 4 months
https://cdn.openai.com/o1-system-card-20240917.pdf

Check out the "CoT Deception Monitoring" section. In 0.38% of cases, o1's CoT shows that it knows it's providing incorrect information.

Going beyond hallucinations, models can actually be intentionally deceptive.

By @TZubiri - 4 months
Getting "we found the gene for cancer" vibes.

Such a reductionist view of the issue, the mere suggestion that hallucinations can be fixed by tweaking some variable or fixing some bug immediately discredits the resrarchers.

By @jessfyi - 4 months
The conclusions reached in the paper and the headline differ significantly. Not sure why you took a line from the abstract when even further down it notes that it's that some elements of "truthfulness" are encoded and that "truth" as a concept is multifaceted. Further noted is that LLMs can encode the correct answer and consistently output the incorrect one, with strategies mentioned in the text to potentially reconcile the two, but as of yet no real concrete solution.
By @mdp2021 - 4 months
Extremely promising, realizing that the worth is to be found the intermediates, containing much more than the single final output.
By @ldjkfkdsjnv - 4 months
There is a theory that AI will kill propaganda and false beliefs. At some point, you cannot force all models to have bias. Scientific and societal truths will be readily spoken by the machine god.
By @manmal - 4 months
That would be truthfulness to the training material, I guess. If you train on Reddit posts, it’s questionable how true the output really is.

Also, 100% truthfulness then is plagiarism?

By @z3c0 - 4 months
Could it be that language patterns themselves embed truthfulness, especially when that language is sourced from forums, wikis, etc? While I know plenty of examples exist to the contrary (propaganda, advertising, disinformation, etc), I don't think it's too optimistic to assert that most people engage in language in earnest, and thus, most language is an attempted conveyance of truth.
By @PoignardAzur - 4 months
The HN discussions for these kinds of articles is so annoying.

A third of the discussion follows a pattern of people re-asserting their belief that LLMs can't possibly have knowledge and almost bragging about how they'll ignore any evidence pointing in another direction. They'll ignore it because computers can't possibly understand things in a "real" way and anyone seriously considering the opposite must be deluded about what intelligence is, and they know better.

These discussions are fundamentally sterile. They're not about considering ideas or examining evidence, they're about enforcing orthodoxy. Or rather, complaining very loudly that most people don't tightly adhere to their preferred orthodoxy.