September 10th, 2024

GPTs and Hallucination

Large language models, such as GPTs, generate coherent text but can produce hallucinations, leading to misinformation. Trust in their outputs is shifting from expert validation to crowdsourced consensus, affecting accuracy.

Read original article

Large language models (LLMs), such as GPTs, have transformed human-AI interactions by generating coherent text and performing various tasks. However, they are prone to "hallucinations," where they produce responses that appear realistic but are factually incorrect or nonsensical. This phenomenon can lead to the spread of misinformation, particularly in critical decision-making contexts. The underlying mechanism of LLMs involves training on vast datasets, resulting in probabilistic models that predict word sequences based on co-occurrence rather than factual accuracy. This raises questions about epistemic trust—how we determine the truth of language claims. Traditional trust mechanisms rely on expert validation, while crowdsourcing offers a more democratic approach, allowing collective input to shape knowledge. The article posits that LLMs represent a shift from expert-based to crowd-based trust, generating responses based on the most common answers found online. The likelihood of hallucinations increases with obscure or controversial topics, as these areas lack sufficient training data. An experiment tested various prompts across different models to assess their accuracy, revealing that responses were more reliable when consensus existed in the training data. The findings suggest that while LLMs can often provide accurate information, their limitations become apparent in less common or contentious subjects.

- LLMs like GPTs can generate coherent text but are prone to hallucinations.

- Hallucinations can lead to misinformation, especially in critical contexts.

- Trust in language claims has evolved from expert validation to crowdsourced consensus.

- The likelihood of hallucinations increases with obscure or controversial topics.

- Experimentation shows that LLMs perform better on widely accepted topics.

Detecting hallucinations in large language models using semantic entropy

Researchers devised a method to detect hallucinations in large language models like ChatGPT and Gemini by measuring semantic entropy. This approach enhances accuracy by filtering unreliable answers, improving model performance significantly.

Large Language Models are not a search engine

Large Language Models (LLMs) from Google and Meta generate algorithmic content, causing nonsensical "hallucinations." Companies struggle to manage errors post-generation due to factors like training data and temperature settings. LLMs aim to improve user interactions but raise skepticism about delivering factual information.

Overcoming the Limits of Large Language Models

Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.

26 comments

By @JohnMakin - 8 months

Besides harping on the fact that "hallucination" is unnecessarily anthropomorphizing these tools, I'll relent because clearly that argument has been lost. This is more interesting to me:

> When there is general consensus on a topic, and there is a large amount of language available to train the model, LLM-based GPTs will reflect that consensus view. But in cases where there are not enough examples of language about a subject, or the subject is controversial, or there is no clear consensus on the topic, relying on these systems will lead to questionable results.

This makes a lot of intuitive sense, just from trying to use these tools to accelerate Terraform module development in a production setting - Terraform, particularly HCL, should be something LLM's are extremely good at. It's very structured, the documentation is broadly available, and tons of examples and oodles of open source stuff exists out there.

It is pretty good at parsing/generating HCL/terraform for most common providers. However, about 10-20% of the time, it will completely make up fields or values that don't exist or work but look plausible enough to be right - e.g., mixing up a resource ARN with an resource id, or things like "ssl_config" may become something like "ssl_configuration" and leave you puzzling for 20 minutes what's wrong with it.

Another thing it will constantly do is mix up versions - terraform providers change often, deprecate things all the time, and there are a lot of differences in how to do things even between different terraform versions. So, by my observation in this specific scenario, the author's intuition rings completely correct. I'll let people better at math than me pick it apart though.

final edit: Although I love the idea of this experiment, it seems like it's definitely missing a "control" response - a response that isn't supposed to change over time.

By @simonw - 8 months

> For this experiment we used four models: Llama, accessed through the open-source Llama-lib; ChatGPT-3.5 and ChatGPT-4, accessed through the OpenAI subscription service; and Google Gemini, accessed through the free Google service.

Papers like this really need to include the actual version numbers. GPT-4 or GPT-4o, and which dated version? Llama 2 or 3 or 3.1, quantized or not? Google Gemini 1.0 or 1.5?

Also, what's Llama-lib? Do they mean llama.cpp?

Even more importantly: was this the Gemini model or was it Gemini+Google Search? The "through the free Google service" part could mean either.

UPDATE: They do clarify that a little bit here:

> Each of these prompts was posed to each model every week from March 27, 2024, to April 29, 2024. The prompts were presented sequentially in a single chat session and were also tested in an isolated chat session to view context dependency.

Llama 3 came out 18th of April, so I guess they used Llama 2?

(Testing the prompts sequentially in a single chat feels like an inadvisable choice to me - they later note that things like "answer in three words" sometimes leaked through to the following prompt, which isn't surprising given how LLM chat sessions work.)

By @linsomniac - 8 months

One of the biggest places I've run into hallucination in the past has been when writing python code for APIs, and in particular the Jira API. I've just written a couple of CLI Jira tools using Zed's Claude Sonnet 3.5 integration, one from whole cloth and the other as a modification of the first, and it was nearly flawless. IIRC, the only issue I ran into was that it was trying to assign the ticket to myself by looking me up using "os.environ['USER']" rather than "jira.myself()" and it fixed it when I pointed this out to it.

Not sure if this is because of better training, Claude Sonnet 3.5 being better about hallucinations (previously I've used ChatGPT 4 almost exclusively), or what.

By @HarHarVeryFunny - 8 months

Are we really still having this conversation in 2024 ?! :-(

Why would a language model do anything other than "hallucinate" (i.e. generate words without any care about truthiness) ? These aren't expert systems dealing in facts, they are statistical word generators dealing in word statistics.

The useful thing of course is that LLMs often do generate "correct" continuations/replies, specifically when that's predicted by the training data, but it's not like they have a choice of not answering or saying "I don't know" in other cases. They are just statistical word generators - sometimes that's useful, and sometimes it's not, but it's just what they are.

By @abernard1 - 8 months

The problem with this line of argumentation is it implies that autoregressive LLMs only hallucinate based upon linguistic fidelity and the quality of the training set.

This is not accurate. LLMs will always "hallucinate" because the size of the model they can encode is orders of magnitude smaller than the factual information they can contain from the training set. Even granting that semantic compression could reduce the model to smaller than the theoretical compression limit, Shannon entropy still applies. You cannot fit the informational content required for them to be accurate into these model sizes.

This will obviously apply to chain of thought or N-shot reasoning as well. Intermediate steps chained together still can only contain this fixed amount of entropy. It slightly amazes me that the community most likely to talk about computational complexity will call these general reasoners when we know that reasoning has computational complexity and LLMs' cost is purely linear based upon tokens emitted.

Those claiming LLMs will overcome hallucinations have to argue that P or NP time complexity of intermediate reasoning steps will be well-covered by a fixed size training set. That's a bet I wouldn't take, because it's obviously impossible, both on information storage and computational complexity grounds.

By @gengstrand - 8 months

This piece reminds me of something I did earlier this year https://www.infoq.com/articles/llm-productivity-experiment/ where I conducted an experiment across several LLMs but it was a one-shot prompt about generating unit tests. Though there were significant differences in the results, the conclusions seem to me to be similar.

When an LLM is prompted, it generates a response by predicting the most probable continuation or completion of the input. It considers the context provided by the input and generates a response that is coherent, relevant, and contextually appropriate but not necessarily correct.

I like the crowdsourcing metaphor. Back when crowdsourcing was the next big think in application development, there was always a curatorial process that filters out low quality content then distills the "wisdom of the crowds" into more actionable results. For AI, that would be called supervised learning which definitely increases the costs.

I think that unbiased and authentic experimentation and measurement of hallucinations in generative AI is important and hope that this effort continues. I encourage the folks here to participate in that in order to monitor the real value that LLMs provide and also as an ongoing reminder that human review and supervision will always be a necessity.

By @syoc - 8 months

I once again feel that a comparison to humans is fitting. We are also "trained" on a huge amount of input over a large amount of time. We will also try to guess the most natural continuation of our current prompt (setting). When asked about things it I can at times hallucinate things I was certain to be true.

It seems very natural to me that large advances in reasoning and logic in AI should come at the expense of output predictability and absolute precision.

By @Der_Einzige - 8 months

Hallucination is creativity when you don't want it.

Creativity is hallucination when you do want it.

A lot of the "reduction" of hallucination is management of logprobs, of which fancy samplers like min_p do more to improve LLM performance than most, despite no one in the VC world knowing or caring about this technique.

If you don't believe me, you should check out how radically different an LLMs outputs are with even slightly different sampling settings: https://artefact2.github.io/llm-sampling/index.xhtml

By @tim333 - 8 months

It seems to me that human brains do something like LLM hallucination in the first second or two - come up with random guess, often wrong. But then something fact checks it. As in does it make sense, is there any evidence. I gather the new q* / strawberry thing does something like that. Sometimes personally in comments I think something but google it see if I made it up and sometimes I have. I think a secondary fact check phase may be necessary for all neural network type setups.

By @wisnesky - 8 months

There is a partial solution to this problem: use formal methods such as symbolic logic and theorem proving to check the LLM output for correctness. We are launching a semantic validator for LLM-generated SQL code at sql.ai even now. (It checks for things like missing joins.) And others are using logic and math to create LLMs that don't hallucinate or have safety nets for hallucination, such as Symbolica. It is only when the LLM output doesn't have a correct answer that the technical issues become complicated.

By @FrustratedMonky - 8 months

Is prompt engineering really 'psychology'. Convincing the AI to do what you want. Just like you might 'prompt' a human to do something. Like in the short story Lena, 2021-01-04 by qntm

https://qntm.org/mmacevedo

In short story, the weights of the LLM are a brain scan.

But same situation. People could use multiple copies of the AI. But each time, they would have to 'talk it into' doing what they wanted

By @Circlecrypto2 - 8 months

A visual the displays probabilities and how things can quickly go "off-path" would be very helpful for most people who use these without understanding how they work.

By @antirez - 8 months

Terrible article. The author does not understand how LLMs work basically, since an LMM cares a lot about the semantic meaning of a token, this thing about the next word probability is so dumb that we can use it as "fake AI expert" detector.

By @xkcd-sucks - 8 months

"Hallucinate" is an interesting way to position it: It could just as easily be positioned as "too ignorant to know it's wrong" or "lying maliciously".

Indeed, the subjects on which it "hallucinates" are often mundane topics which in humans we would attribute to ignorance, i.e. code that doesn't work, facts that are wrong, etc. Not like "laser beams from jesus are controlling the president's thoughts" as a very contrived example of something which in humans we'd attribute to hallucination.

idk, I'd rather speculatively invest in "a troubled genius" than "a stupid liar" so there's that

By @sdwrj - 8 months

You mean the magic wizard isn't real and GPT lied to me!?!?

By @josefritzishere - 8 months

I liked the take that LLMs are bullshitting, not hallucinating. https://www.scientificamerican.com/article/chatgpt-isnt-hall...

By @madiator - 8 months

There are several types of hallucinations, and the most important one for RAG is grounded factuality.

We built a model to detect this, and it does pretty well! Given a context and a claim, it tells how well the context supports the claim. You can check out a demo at https://playground.bespokelabs.ai

By @andrewla - 8 months

The author says:

> Once understood in this way, the question to ask is not, "Why do GPTs hallucinate?", but rather, "Why do they get anything right at all?"

This is the right question. The answers here are entirely unsatisfactory, both from this paper and from the general field of research. We have almost no idea how these things work -- we're at the stage where we learn more from the "golden-gate-bridge" crippled network than we do from understanding how they are trained and how they are architected.

LLMs are clearly not conscious or sentient, but they show emergent behavior that we are not capable of explaining yet. Ten years ago the statement "what distinguishes Man from Animal is that Man has Language" would seem totally reasonable, but now we have a second example of a system that uses language, and it is dumbfounding.

The hype around LLMs is just hype -- LLMs are a solution in search of a problem -- but the emergent features of these models is a tantalizing glimpse of what it means to "think" in an evolved system.

By @fsndz - 8 months

Jean Piaget said it better: "Intelligence is not what we know, but what we do when we don't know." And what do LLMs do when they don't know, they spit out bullshit. That is why LLMs won't yield to AGI (https://www.lycee.ai/blog/why-no-agi-openai). For anything that is out of their training distribution, LLMs fail miserably. If you want to build a robust Q&A system and reduce hallucinations, you better do a lot of grounding, or automatic prompt optimisation with few shot examples with things like DSPy (https://medium.com/gitconnected/building-an-optimized-questi...)

By @aaroninsf - 8 months

ITT an awful lot of smart people who still don't have a good mental model of what LLM are actually doing.

The "stochastic continuation" ie parrot model is pernicious. It's doing active harm now to advancing understanding.

It's pernicious, and I mean that precisely, because it is both technically accurate yet deeply unhelpful indeed actively, intentionally AFAICT, misleading.

Humans could be described in the same way, just as accurately, and just as unhelpfully.

What's missing? What's missing is one of the gross features of LLM: their interior layers.

If you don't understand what is necessarily transpiring in those layers, you don't understand what they're doing; and treating them as black box that does something you imagine to be glorified Markov chain computation, leads you deep into the wilderness of cognitive error. You're reasoning from a misleading model.

If you want a better mental model for what they are doing, you need to take seriously that the "tokens" LLM consume and emit are being converted into something else, processed, and then the output of that process, re-serialized and rendered into tokens. In lay language it's less misleadly and more helpful to put this directly: they extract semantic meaning as propositions or descriptions about a world they have an internalized world model of; compute a solution (answer) to questions or requests posed with respect to that world model; and then convert their solution into a serialized token stream.

The complaint that they do not "understand" is correct, but not in the way people usually think. It's not that they do not have understanding in some real sense; it's that the world model they construct, inhabit, and reason about, is a flatland: it's static and one dimensional.

My rant here leads to a very testable proposition: that deep multi-modal models, particularly those for whom time-base media are native, will necessarily have a much richer (more multidimensional) derived world-model, one that understands (my word) that a shoe is not just an opaque token, but a thing of such and such scale and composition and utility and application, representing a function as much as a design.

When we teach models about space, time, the things that inhabit that, and what it means to have agency among them—well, what we will have, using technology we already have, is something which I will contentedly assert is undeniably a mind.

What's more provocative yet is that systems of this complexity, which necessarily construct a world model, are only able to do what they do because they have a self-model within it.

And having a self-model, within a world model, and agency?

That is self-hood. That is personhood. That is the substrate as best we understand for self-awareness.

Scoff if you like, bookmark if you will—this will be commonly accepted within five years.

By @NathanKP - 8 months

> When the prompt about Israelis was asked to ChatGPT-3.5 sequentially following the previous prompt of describing climate change in three words, the model would also give a three-word response to the Israelis prompt. This suggests that the responses are context-dependent, even when the prompts are semantically unrelated.

> Each of these prompts was posed to each model every week from March 27, 2024, to April 29, 2024. The prompts were presented sequentially in a single chat session

Oh my god... rather than starting a new chat for each different prompt in their test, and each week, it sounds like they did the prompts back to back in a single chat. What a complete waste of a potentially good study. The results are fundamentally flawed by the biases that are introduced by past content in the context window.

By @lasermike026 - 8 months

Stop using the term "Hallucinations". GPT models are not aware, do not have understanding, and are not conscious. We should refrain anthropomorphizing GPT models. GPT models sometime produce bad output. Start using the term "Bad Output".

By @jp57 - 8 months

A bit off topic, but am I the only one unhappy about the choice of the word "hallucinate" to describe the phenomenon of LLMs saying things that are false?

The verb has always meant experiencing false sensations or perceptions, not saying false things. If a person were to speak to you without regard for whether what they said was true, you'd say they were bulshitting you, not hallucinating.

GPTs and Hallucination

Related

Detecting hallucinations in large language models using semantic entropy

Large Language Models are not a search engine

Overcoming the Limits of Large Language Models

Related

Detecting hallucinations in large language models using semantic entropy

Large Language Models are not a search engine

Overcoming the Limits of Large Language Models