Detecting hallucinations in large language models using semantic entropy
Researchers devised a method to detect hallucinations in large language models like ChatGPT and Gemini by measuring semantic entropy. This approach enhances accuracy by filtering unreliable answers, improving model performance significantly.
Read original articleResearchers have developed a method to detect hallucinations, specifically confabulations, in large language models (LLMs) like ChatGPT and Gemini. These hallucinations lead to incorrect and arbitrary outputs, posing risks in various fields. The new method focuses on measuring semantic entropy to identify when an LLM is likely to generate unreliable answers. By clustering answers with similar meanings, the method can pinpoint confabulations without prior knowledge of the task. It improves question-answering accuracy by avoiding questions prone to confabulations. The approach is evaluated across different domains and LLM sizes, showing robustness in detecting confabulations without the need for labeled examples. The method outperforms supervised techniques and enhances model performance by filtering out uncertain responses. By using semantic entropy to predict incorrect answers and improve accuracy, this method offers a valuable tool for enhancing the reliability of LLMs in free-form text generation tasks.
Related
Optimizing AI Inference at Character.ai
Character.AI optimizes AI inference for LLMs, handling 20,000+ queries/sec globally. Innovations like Multi-Query Attention and int8 quantization reduced serving costs by 33x since late 2022, aiming to enhance AI capabilities worldwide.
GitHub – Karpathy/LLM101n: LLM101n: Let's Build a Storyteller
The GitHub repository "LLM101n: Let's build a Storyteller" offers a course on creating a Storyteller AI Large Language Model using Python, C, and CUDA. It caters to beginners, covering language modeling, deployment, programming, data types, deep learning, and neural nets. Additional chapters and appendices are available for further exploration.
Lessons About the Human Mind from Artificial Intelligence
In 2022, a Google engineer claimed AI chatbot LaMDA was self-aware, but further scrutiny revealed it mimicked human-like responses without true understanding. This incident underscores AI limitations in comprehension and originality.
Delving into ChatGPT usage in academic writing through excess vocabulary
A study by Dmitry Kobak et al. examines ChatGPT's impact on academic writing, finding increased usage in PubMed abstracts. Concerns arise over accuracy and bias despite advanced text generation capabilities.
Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]
The video discusses limitations of large language models in AI, emphasizing genuine understanding and problem-solving skills. A prize incentivizes AI systems showcasing these abilities. Adaptability and knowledge acquisition are highlighted as crucial for true intelligence.
For any given input text, there is a corresponding output text distribution (e.g. the probabilities of all words in a sequence which the model draws samples from).
The approach of drawing several samples and evaluating the entropy and/or disagreement between those draws is that it relies on already knowing the properties of the output distribution. It may be legitimate that one distribution is much more uniformly random than another, which has high certainty. Its not clear to me that they have demonstrated the underlying assumption.
Take for example celebrity info, "What is Tom Cruise known for?". The phrases "movie star", "katie holmes", "topgun", and "scientology" are all quite different in terms of their location in the word vector space, and would result in low semantic similarity, but are all accurate outputs.
On the other hand, "What is Taylor Swift known for?" the answers "standup comedy", "comedian", and "comedy actress" are semantically similar but represent hallucinations. Without knowing the distribution characteristics (e.g multivariate moments and estimates) we couldn't say for certain these are correct merely by their proximity in vector space.
As some have pointed out in this thread, knowing the correct distribution of word sequences for a given input sequence is the very job the LLM is solving, so there is no way of evaluating the output distribution to determine its correctness.
There are actual statistical models to evaluate the amount of uncertainty in output from ANNs (albeit a bit limited), but they are probably not feasible at the scale of LLMs. Perhaps a layer or two could be used to create a partial estimate of uncertainty (e.g. final 2 layers), but this would be a severe truncation of overall network uncertainty.
Another reason I mention this is most hallucinations I encounter are very plausible and often close to the right thing (swapping a variable name, confabulating a config key), which appear very convincing and "in sample", but are actually incorrect.
The process can look like-
1. Use existing large models to convert the same previous dataset they were trained on into formal logical relationships. Let them generate multiple solutions
2. Take this enriched dataset and train a new LLM which not only outputs next token but also a the formal relationships between previous knowledge and the new generated text
3. Network can optimize weights until the generated formal code get high accuracy on proof checker along with the token generation accuracy function
In my own mind I feel language is secondary - it's not the base of my intelligence. Base seems more like a dreamy simulation where things are consistent with each other and language is just what i use to describe it.
One expression of that idea is in this paper: https://link.springer.com/article/10.1007/s10676-024-09775-5
The only way to know if it did “hallucinate” is to already know the correct answer. If you can make a system that knows when an answer is right or not, you no longer need the LLM!
Yes, there seems to be a little bit of grokking and the models can be made to approximate step-by-step reasoning a little bit. But 95% of the function of these black boxes is text generation. Not fact generation, not knowledge generation. They are more like improv partners than encyclopedias and everyone in tech knows it.
I don’t know if LLMs misleading people needs a clever answer entropy solution. And it is a very interesting solution that really seems like it would improve things — effectively putting certainty scores to statements. But what if we just stopped marketing machine learning text generators as near-AGI, which they are not? Wouldn’t that undo most of the damage, and arguably help us much more?
That's reasonable for questions with a single objective answer. It probably won't help when multiple, equally valid answers are possible.
However, that's good enough for search engine applications.
Of course, altering the input without changing the meaning is the hard part, but doesn't seem entirely infeasible. At the least, you could just ask the LLM to try to alter the input without changing the meaning, although you might end up in a situation where it alters the input in a way that aligns with its own faulty understanding of an input, meaning it could match the hallucinated output better after modification.
“checking” is being done with another model.
“differently” is being measured with entropy.
It’s an interesting idea to measure certainty this way. The problem remains that the model can be certain in this way and wrong. But the author did say this was a partial solution.
Still, wouldn’t we be able to already produce a confidence score at the model level like this? Instead of a “post-processor”?
So this is basically saying we shouldn't try to estimate entropy over logits, but should be able to learn a function from activations earlier in the network to a degree of uncertainty that would signal (aka be classifiable as) confabulation.
As an example, while we were building our AI Chatbot for Ora2Pg, the main challenge was that we used OpenAI and several other models to begin with. To avoid hallucinations to the most possible extent, we went through various levels including PDR and then Knowledge Graphs and added FAQs and then used an Agentic approach to support it with as much as information as possible from all possible contexts.
As it is very challenging for anybody and everybody to build their own models trained with their data set, it is not something possible to avoid hallucination with generic purpose LLM's unless they are trained with our data sets.
The chatbot that we built to avoid hallucination as much as we can.
I use LLMs daily and get crappy results more often than not, but I had the impression that would be normal, as the training data can be contradictory.
The trick isn't in how to spot the lies, but how to properly apply them. We cannot teach the AI how not to lie, without first teaching it when it must lie, and then how to apply the lie properly.
"AI, tell me, do these jeans make me look fat?"
AI: NO. You are fat. The jeans are fine.
Is not an acceptable discourse. Learning when and how to apply semantical truth stretching is imperative.
They must first understand where and when, then how, and finally why.
It's how we teach our young. Isn't it?
Computers cannot "hallucinate."
Related
Optimizing AI Inference at Character.ai
Character.AI optimizes AI inference for LLMs, handling 20,000+ queries/sec globally. Innovations like Multi-Query Attention and int8 quantization reduced serving costs by 33x since late 2022, aiming to enhance AI capabilities worldwide.
GitHub – Karpathy/LLM101n: LLM101n: Let's Build a Storyteller
The GitHub repository "LLM101n: Let's build a Storyteller" offers a course on creating a Storyteller AI Large Language Model using Python, C, and CUDA. It caters to beginners, covering language modeling, deployment, programming, data types, deep learning, and neural nets. Additional chapters and appendices are available for further exploration.
Lessons About the Human Mind from Artificial Intelligence
In 2022, a Google engineer claimed AI chatbot LaMDA was self-aware, but further scrutiny revealed it mimicked human-like responses without true understanding. This incident underscores AI limitations in comprehension and originality.
Delving into ChatGPT usage in academic writing through excess vocabulary
A study by Dmitry Kobak et al. examines ChatGPT's impact on academic writing, finding increased usage in PubMed abstracts. Concerns arise over accuracy and bias despite advanced text generation capabilities.
Francois Chollet – LLMs won't lead to AGI – $1M Prize to find solution [video]
The video discusses limitations of large language models in AI, emphasizing genuine understanding and problem-solving skills. A prize incentivizes AI systems showcasing these abilities. Adaptability and knowledge acquisition are highlighted as crucial for true intelligence.