June 23rd, 2024

Detecting hallucinations in large language models using semantic entropy

Researchers devised a method to detect hallucinations in large language models like ChatGPT and Gemini by measuring semantic entropy. This approach enhances accuracy by filtering unreliable answers, improving model performance significantly.

Read original article

Detecting hallucinations in large language models using semantic entropy

Researchers have developed a method to detect hallucinations, specifically confabulations, in large language models (LLMs) like ChatGPT and Gemini. These hallucinations lead to incorrect and arbitrary outputs, posing risks in various fields. The new method focuses on measuring semantic entropy to identify when an LLM is likely to generate unreliable answers. By clustering answers with similar meanings, the method can pinpoint confabulations without prior knowledge of the task. It improves question-answering accuracy by avoiding questions prone to confabulations. The approach is evaluated across different domains and LLM sizes, showing robustness in detecting confabulations without the need for labeled examples. The method outperforms supervised techniques and enhances model performance by filtering out uncertain responses. By using semantic entropy to predict incorrect answers and improve accuracy, this method offers a valuable tool for enhancing the reliability of LLMs in free-form text generation tasks.

Optimizing AI Inference at Character.ai

Character.AI optimizes AI inference for LLMs, handling 20,000+ queries/sec globally. Innovations like Multi-Query Attention and int8 quantization reduced serving costs by 33x since late 2022, aiming to enhance AI capabilities worldwide.

GitHub – Karpathy/LLM101n: LLM101n: Let's Build a Storyteller

The GitHub repository "LLM101n: Let's build a Storyteller" offers a course on creating a Storyteller AI Large Language Model using Python, C, and CUDA. It caters to beginners, covering language modeling, deployment, programming, data types, deep learning, and neural nets. Additional chapters and appendices are available for further exploration.

Lessons About the Human Mind from Artificial Intelligence

In 2022, a Google engineer claimed AI chatbot LaMDA was self-aware, but further scrutiny revealed it mimicked human-like responses without true understanding. This incident underscores AI limitations in comprehension and originality.

Delving into ChatGPT usage in academic writing through excess vocabulary

A study by Dmitry Kobak et al. examines ChatGPT's impact on academic writing, finding increased usage in PubMed abstracts. Concerns arise over accuracy and bias despite advanced text generation capabilities.

Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]

The video discusses limitations of large language models in AI, emphasizing genuine understanding and problem-solving skills. A prize incentivizes AI systems showcasing these abilities. Adaptability and knowledge acquisition are highlighted as crucial for true intelligence.

24 comments

By @program_whiz - 11 months

Everyone in the comments seems to be arguing over the semantics of the words and anthropomorphization of LLMs. Putting that aside, there is a real problem with this approach that lies at the mathematical level.

For any given input text, there is a corresponding output text distribution (e.g. the probabilities of all words in a sequence which the model draws samples from).

The approach of drawing several samples and evaluating the entropy and/or disagreement between those draws is that it relies on already knowing the properties of the output distribution. It may be legitimate that one distribution is much more uniformly random than another, which has high certainty. Its not clear to me that they have demonstrated the underlying assumption.

Take for example celebrity info, "What is Tom Cruise known for?". The phrases "movie star", "katie holmes", "topgun", and "scientology" are all quite different in terms of their location in the word vector space, and would result in low semantic similarity, but are all accurate outputs.

On the other hand, "What is Taylor Swift known for?" the answers "standup comedy", "comedian", and "comedy actress" are semantically similar but represent hallucinations. Without knowing the distribution characteristics (e.g multivariate moments and estimates) we couldn't say for certain these are correct merely by their proximity in vector space.

As some have pointed out in this thread, knowing the correct distribution of word sequences for a given input sequence is the very job the LLM is solving, so there is no way of evaluating the output distribution to determine its correctness.

There are actual statistical models to evaluate the amount of uncertainty in output from ANNs (albeit a bit limited), but they are probably not feasible at the scale of LLMs. Perhaps a layer or two could be used to create a partial estimate of uncertainty (e.g. final 2 layers), but this would be a severe truncation of overall network uncertainty.

Another reason I mention this is most hallucinations I encounter are very plausible and often close to the right thing (swapping a variable name, confabulating a config key), which appear very convincing and "in sample", but are actually incorrect.

By @badrunaway - 11 months

Current architecture of LLMs focus mainly on the retrieval part and the weights learned are just converged to get best outcome for next token prediction. Whereas, ability to put this data into a logical system should also have been a training goal IMO. Next token prediction + Formal Verification of knowledge during training phase itself = that would give LLM ability to keep consistency in it's knowledge generation and see the right hallucinations (which I like to call imagination)

The process can look like-

1. Use existing large models to convert the same previous dataset they were trained on into formal logical relationships. Let them generate multiple solutions

2. Take this enriched dataset and train a new LLM which not only outputs next token but also a the formal relationships between previous knowledge and the new generated text

3. Network can optimize weights until the generated formal code get high accuracy on proof checker along with the token generation accuracy function

In my own mind I feel language is secondary - it's not the base of my intelligence. Base seems more like a dreamy simulation where things are consistent with each other and language is just what i use to describe it.

By @MikeGale - 11 months

One formulation is that these are hallucinations. Another is that these systems are "orthogonal to truth". They have nothing to do with truth or falsity.

One expression of that idea is in this paper: https://link.springer.com/article/10.1007/s10676-024-09775-5

By @jasonlfunk - 11 months

Isn’t it true that the only thing that LLM’s do is “hallucinate”?

The only way to know if it did “hallucinate” is to already know the correct answer. If you can make a system that knows when an answer is right or not, you no longer need the LLM!

By @caseyy - 11 months

Maybe for the moment it would be better if the AI companies simply presented their chatbots as slightly-steered text generation tools. Then people could use them appropriately.

Yes, there seems to be a little bit of grokking and the models can be made to approximate step-by-step reasoning a little bit. But 95% of the function of these black boxes is text generation. Not fact generation, not knowledge generation. They are more like improv partners than encyclopedias and everyone in tech knows it.

I don’t know if LLMs misleading people needs a clever answer entropy solution. And it is a very interesting solution that really seems like it would improve things — effectively putting certainty scores to statements. But what if we just stopped marketing machine learning text generators as near-AGI, which they are not? Wouldn’t that undo most of the damage, and arguably help us much more?

By @Animats - 11 months

"We show how to detect confabulations by developing a quantitative measure of when an input is likely to cause an LLM to generate arbitrary and ungrounded answers. ... Intuitively, our method works by sampling several possible answers to each question and clustering them algorithmically into answers that have similar meanings."

That's reasonable for questions with a single objective answer. It probably won't help when multiple, equally valid answers are possible.

However, that's good enough for search engine applications.

By @rbanffy - 11 months

The concept of semantic entropy reminds me of a bank, whose name I can't remember, that, in the aftermath of the Enron catastrophe, did make a "bullshitometer" to measure the level of bullshit in press-releases. In that case, they applied it to the Enron press releases before the company's implosion and showed it could have predicted the collapse.

By @foota - 11 months

There's a concept in statistics called sensitivity analysis. It seems like this is somewhat similar, but an alternative approach that might be interesting would be to modify the input in a way that you think should preserve the semantic meaning, and see how that alters the meaning of the output.

Of course, altering the input without changing the meaning is the hard part, but doesn't seem entirely infeasible. At the least, you could just ask the LLM to try to alter the input without changing the meaning, although you might end up in a situation where it alters the input in a way that aligns with its own faulty understanding of an input, meaning it could match the hallucinated output better after modification.

By @jostmey - 11 months

So, I can understand how their semantic entropy (which seems to require a LLM trained to detect semantic equivalence) might be better at catching hallucinations. However, I don't see how semantic equivalence directly tackles the problem of hallucinations. Currently, I naively suspect it is just a heuristic for catching hallucinations. Furthermore, the requirement of a second LLM trained at detecting semantic equivalence to catch these events seems like an unnecessary complication. If I had a dataset of semantic equivalence to train a second LLM, I would directly incorporate this into the training process of my primary LLM

By @curious_cat_163 - 11 months

It’s a pretty clever idea: “check” if the model answers “differently” when asked the same question again and again and again.

“checking” is being done with another model.

“differently” is being measured with entropy.

By @caseyy - 11 months

This makes sense. Low semantic entropy probably means the answer was more represented in the unsupervised learning training data, or in later tuning. And I understand this is a tool to indirectly measure how much it was represented?

It’s an interesting idea to measure certainty this way. The problem remains that the model can be certain in this way and wrong. But the author did say this was a partial solution.

Still, wouldn’t we be able to already produce a confidence score at the model level like this? Instead of a “post-processor”?

By @trafalgar_law - 11 months

Anyone noticed this: They have already published the same basic idea "Semantic Entropy" in ICLR 2023 (https://openreview.net/forum?id=VD-AYtP0dve), but they did not cite this prior work in their Nature paper. From the submission record, they submitted this Nature paper after their ICLR paper got accepted and published. According to the Nature submission policy regarding conference papers (https://www.nature.com/nature/editorial-policies/preprints-c...), it is clearly stated that "Authors must provide details of the conference proceedings paper with their submission including relevant citation in the submitted manuscript." So this seems a clear-cut violation of Nature policy to me. Any thought?

By @iandanforth - 11 months

The semantic equivalence of possible outputs is already encoded in the model. While it is not necessarily recoverable from the logits of a particular sampling rollout it exists throughout prior layers.

So this is basically saying we shouldn't try to estimate entropy over logits, but should be able to learn a function from activations earlier in the network to a degree of uncertainty that would signal (aka be classifiable as) confabulation.

By @Havoc - 11 months

Won’t this catch creativity too? ie write me a story about a horse. LLMs freestyle that sort of thing quite hard so won’t that look the same under the hood?

By @avivallssa - 11 months

While I cannot argue about a Specific approach, I could say that hallucinations can only be minimized but with a various level of measures while working with large language models.

As an example, while we were building our AI Chatbot for Ora2Pg, the main challenge was that we used OpenAI and several other models to begin with. To avoid hallucinations to the most possible extent, we went through various levels including PDR and then Knowledge Graphs and added FAQs and then used an Agentic approach to support it with as much as information as possible from all possible contexts.

As it is very challenging for anybody and everybody to build their own models trained with their data set, it is not something possible to avoid hallucination with generic purpose LLM's unless they are trained with our data sets.

The chatbot that we built to avoid hallucination as much as we can.

https://ora2pgsupport.hexacluster.ai/

By @gmerc - 11 months

This seems to do the same as this paper from last year but getting more press.

https://arxiv.org/abs/2303.08896

By @k__ - 11 months

How big of a problem are hallucinations right now?

I use LLMs daily and get crappy results more often than not, but I had the impression that would be normal, as the training data can be contradictory.

By @imchillyb - 11 months

Lies lie at the center of common discourse.

The trick isn't in how to spot the lies, but how to properly apply them. We cannot teach the AI how not to lie, without first teaching it when it must lie, and then how to apply the lie properly.

"AI, tell me, do these jeans make me look fat?"

AI: NO. You are fat. The jeans are fine.

Is not an acceptable discourse. Learning when and how to apply semantical truth stretching is imperative.

They must first understand where and when, then how, and finally why.

It's how we teach our young. Isn't it?

By @farceSpherule - 11 months

Hallucination is a combination of two conscious states of brain wakefulness and REM sleep.

Computers cannot "hallucinate."

By @3abiton - 11 months

I skimmed through the paper, but don't LLMs most of the time guess, sometimes these guesses contains noise that might be on point or not. I wonder if "confabulation" had a more formal definition.

By @klysm - 11 months

The intersection into epistemology is very interesting

By @more_corn - 11 months

This is huge though not a hundred percent there.

By @lopkeny12ko - 11 months

The best way to detect if something was written by an LLM, which has not failed me to date, is to check for any ocurrences of the word "delve."

Detecting hallucinations in large language models using semantic entropy

Related

Optimizing AI Inference at Character.ai

GitHub – Karpathy/LLM101n: LLM101n: Let's Build a Storyteller

Lessons About the Human Mind from Artificial Intelligence

Delving into ChatGPT usage in academic writing through excess vocabulary

Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]

Related

Optimizing AI Inference at Character.ai

GitHub – Karpathy/LLM101n: LLM101n: Let's Build a Storyteller

Lessons About the Human Mind from Artificial Intelligence

Delving into ChatGPT usage in academic writing through excess vocabulary

Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]