GPTs and Hallucination
Large language models, such as GPTs, generate coherent text but can produce hallucinations, leading to misinformation. Trust in their outputs is shifting from expert validation to crowdsourced consensus, affecting accuracy.
Read original articleLarge language models (LLMs), such as GPTs, have transformed human-AI interactions by generating coherent text and performing various tasks. However, they are prone to "hallucinations," where they produce responses that appear realistic but are factually incorrect or nonsensical. This phenomenon can lead to the spread of misinformation, particularly in critical decision-making contexts. The underlying mechanism of LLMs involves training on vast datasets, resulting in probabilistic models that predict word sequences based on co-occurrence rather than factual accuracy. This raises questions about epistemic trust—how we determine the truth of language claims. Traditional trust mechanisms rely on expert validation, while crowdsourcing offers a more democratic approach, allowing collective input to shape knowledge. The article posits that LLMs represent a shift from expert-based to crowd-based trust, generating responses based on the most common answers found online. The likelihood of hallucinations increases with obscure or controversial topics, as these areas lack sufficient training data. An experiment tested various prompts across different models to assess their accuracy, revealing that responses were more reliable when consensus existed in the training data. The findings suggest that while LLMs can often provide accurate information, their limitations become apparent in less common or contentious subjects.
- LLMs like GPTs can generate coherent text but are prone to hallucinations.
- Hallucinations can lead to misinformation, especially in critical contexts.
- Trust in language claims has evolved from expert validation to crowdsourced consensus.
- The likelihood of hallucinations increases with obscure or controversial topics.
- Experimentation shows that LLMs perform better on widely accepted topics.
Related
Detecting hallucinations in large language models using semantic entropy
Researchers devised a method to detect hallucinations in large language models like ChatGPT and Gemini by measuring semantic entropy. This approach enhances accuracy by filtering unreliable answers, improving model performance significantly.
Large Language Models are not a search engine
Large Language Models (LLMs) from Google and Meta generate algorithmic content, causing nonsensical "hallucinations." Companies struggle to manage errors post-generation due to factors like training data and temperature settings. LLMs aim to improve user interactions but raise skepticism about delivering factual information.
Overcoming the Limits of Large Language Models
Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.
> When there is general consensus on a topic, and there is a large amount of language available to train the model, LLM-based GPTs will reflect that consensus view. But in cases where there are not enough examples of language about a subject, or the subject is controversial, or there is no clear consensus on the topic, relying on these systems will lead to questionable results.
This makes a lot of intuitive sense, just from trying to use these tools to accelerate Terraform module development in a production setting - Terraform, particularly HCL, should be something LLM's are extremely good at. It's very structured, the documentation is broadly available, and tons of examples and oodles of open source stuff exists out there.
It is pretty good at parsing/generating HCL/terraform for most common providers. However, about 10-20% of the time, it will completely make up fields or values that don't exist or work but look plausible enough to be right - e.g., mixing up a resource ARN with an resource id, or things like "ssl_config" may become something like "ssl_configuration" and leave you puzzling for 20 minutes what's wrong with it.
Another thing it will constantly do is mix up versions - terraform providers change often, deprecate things all the time, and there are a lot of differences in how to do things even between different terraform versions. So, by my observation in this specific scenario, the author's intuition rings completely correct. I'll let people better at math than me pick it apart though.
final edit: Although I love the idea of this experiment, it seems like it's definitely missing a "control" response - a response that isn't supposed to change over time.
Papers like this really need to include the actual version numbers. GPT-4 or GPT-4o, and which dated version? Llama 2 or 3 or 3.1, quantized or not? Google Gemini 1.0 or 1.5?
Also, what's Llama-lib? Do they mean llama.cpp?
Even more importantly: was this the Gemini model or was it Gemini+Google Search? The "through the free Google service" part could mean either.
UPDATE: They do clarify that a little bit here:
> Each of these prompts was posed to each model every week from March 27, 2024, to April 29, 2024. The prompts were presented sequentially in a single chat session and were also tested in an isolated chat session to view context dependency.
Llama 3 came out 18th of April, so I guess they used Llama 2?
(Testing the prompts sequentially in a single chat feels like an inadvisable choice to me - they later note that things like "answer in three words" sometimes leaked through to the following prompt, which isn't surprising given how LLM chat sessions work.)
Not sure if this is because of better training, Claude Sonnet 3.5 being better about hallucinations (previously I've used ChatGPT 4 almost exclusively), or what.
Why would a language model do anything other than "hallucinate" (i.e. generate words without any care about truthiness) ? These aren't expert systems dealing in facts, they are statistical word generators dealing in word statistics.
The useful thing of course is that LLMs often do generate "correct" continuations/replies, specifically when that's predicted by the training data, but it's not like they have a choice of not answering or saying "I don't know" in other cases. They are just statistical word generators - sometimes that's useful, and sometimes it's not, but it's just what they are.
This is not accurate. LLMs will always "hallucinate" because the size of the model they can encode is orders of magnitude smaller than the factual information they can contain from the training set. Even granting that semantic compression could reduce the model to smaller than the theoretical compression limit, Shannon entropy still applies. You cannot fit the informational content required for them to be accurate into these model sizes.
This will obviously apply to chain of thought or N-shot reasoning as well. Intermediate steps chained together still can only contain this fixed amount of entropy. It slightly amazes me that the community most likely to talk about computational complexity will call these general reasoners when we know that reasoning has computational complexity and LLMs' cost is purely linear based upon tokens emitted.
Those claiming LLMs will overcome hallucinations have to argue that P or NP time complexity of intermediate reasoning steps will be well-covered by a fixed size training set. That's a bet I wouldn't take, because it's obviously impossible, both on information storage and computational complexity grounds.
When an LLM is prompted, it generates a response by predicting the most probable continuation or completion of the input. It considers the context provided by the input and generates a response that is coherent, relevant, and contextually appropriate but not necessarily correct.
I like the crowdsourcing metaphor. Back when crowdsourcing was the next big think in application development, there was always a curatorial process that filters out low quality content then distills the "wisdom of the crowds" into more actionable results. For AI, that would be called supervised learning which definitely increases the costs.
I think that unbiased and authentic experimentation and measurement of hallucinations in generative AI is important and hope that this effort continues. I encourage the folks here to participate in that in order to monitor the real value that LLMs provide and also as an ongoing reminder that human review and supervision will always be a necessity.
It seems very natural to me that large advances in reasoning and logic in AI should come at the expense of output predictability and absolute precision.
Creativity is hallucination when you do want it.
A lot of the "reduction" of hallucination is management of logprobs, of which fancy samplers like min_p do more to improve LLM performance than most, despite no one in the VC world knowing or caring about this technique.
If you don't believe me, you should check out how radically different an LLMs outputs are with even slightly different sampling settings: https://artefact2.github.io/llm-sampling/index.xhtml
In short story, the weights of the LLM are a brain scan.
But same situation. People could use multiple copies of the AI. But each time, they would have to 'talk it into' doing what they wanted
Indeed, the subjects on which it "hallucinates" are often mundane topics which in humans we would attribute to ignorance, i.e. code that doesn't work, facts that are wrong, etc. Not like "laser beams from jesus are controlling the president's thoughts" as a very contrived example of something which in humans we'd attribute to hallucination.
idk, I'd rather speculatively invest in "a troubled genius" than "a stupid liar" so there's that
We built a model to detect this, and it does pretty well! Given a context and a claim, it tells how well the context supports the claim. You can check out a demo at https://playground.bespokelabs.ai
> Once understood in this way, the question to ask is not, "Why do GPTs hallucinate?", but rather, "Why do they get anything right at all?"
This is the right question. The answers here are entirely unsatisfactory, both from this paper and from the general field of research. We have almost no idea how these things work -- we're at the stage where we learn more from the "golden-gate-bridge" crippled network than we do from understanding how they are trained and how they are architected.
LLMs are clearly not conscious or sentient, but they show emergent behavior that we are not capable of explaining yet. Ten years ago the statement "what distinguishes Man from Animal is that Man has Language" would seem totally reasonable, but now we have a second example of a system that uses language, and it is dumbfounding.
The hype around LLMs is just hype -- LLMs are a solution in search of a problem -- but the emergent features of these models is a tantalizing glimpse of what it means to "think" in an evolved system.
The "stochastic continuation" ie parrot model is pernicious. It's doing active harm now to advancing understanding.
It's pernicious, and I mean that precisely, because it is both technically accurate yet deeply unhelpful indeed actively, intentionally AFAICT, misleading.
Humans could be described in the same way, just as accurately, and just as unhelpfully.
What's missing? What's missing is one of the gross features of LLM: their interior layers.
If you don't understand what is necessarily transpiring in those layers, you don't understand what they're doing; and treating them as black box that does something you imagine to be glorified Markov chain computation, leads you deep into the wilderness of cognitive error. You're reasoning from a misleading model.
If you want a better mental model for what they are doing, you need to take seriously that the "tokens" LLM consume and emit are being converted into something else, processed, and then the output of that process, re-serialized and rendered into tokens. In lay language it's less misleadly and more helpful to put this directly: they extract semantic meaning as propositions or descriptions about a world they have an internalized world model of; compute a solution (answer) to questions or requests posed with respect to that world model; and then convert their solution into a serialized token stream.
The complaint that they do not "understand" is correct, but not in the way people usually think. It's not that they do not have understanding in some real sense; it's that the world model they construct, inhabit, and reason about, is a flatland: it's static and one dimensional.
My rant here leads to a very testable proposition: that deep multi-modal models, particularly those for whom time-base media are native, will necessarily have a much richer (more multidimensional) derived world-model, one that understands (my word) that a shoe is not just an opaque token, but a thing of such and such scale and composition and utility and application, representing a function as much as a design.
When we teach models about space, time, the things that inhabit that, and what it means to have agency among them—well, what we will have, using technology we already have, is something which I will contentedly assert is undeniably a mind.
What's more provocative yet is that systems of this complexity, which necessarily construct a world model, are only able to do what they do because they have a self-model within it.
And having a self-model, within a world model, and agency?
That is self-hood. That is personhood. That is the substrate as best we understand for self-awareness.
Scoff if you like, bookmark if you will—this will be commonly accepted within five years.
> Each of these prompts was posed to each model every week from March 27, 2024, to April 29, 2024. The prompts were presented sequentially in a single chat session
Oh my god... rather than starting a new chat for each different prompt in their test, and each week, it sounds like they did the prompts back to back in a single chat. What a complete waste of a potentially good study. The results are fundamentally flawed by the biases that are introduced by past content in the context window.
The verb has always meant experiencing false sensations or perceptions, not saying false things. If a person were to speak to you without regard for whether what they said was true, you'd say they were bulshitting you, not hallucinating.
Related
Detecting hallucinations in large language models using semantic entropy
Researchers devised a method to detect hallucinations in large language models like ChatGPT and Gemini by measuring semantic entropy. This approach enhances accuracy by filtering unreliable answers, improving model performance significantly.
Large Language Models are not a search engine
Large Language Models (LLMs) from Google and Meta generate algorithmic content, causing nonsensical "hallucinations." Companies struggle to manage errors post-generation due to factors like training data and temperature settings. LLMs aim to improve user interactions but raise skepticism about delivering factual information.
Overcoming the Limits of Large Language Models
Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.