July 18th, 2024

Overcoming the Limits of Large Language Models

Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.

Read original articleLink Icon
HallucinationsSkepticismImprovement
Overcoming the Limits of Large Language Models

Large language models (LLMs) face challenges such as hallucinations, lack of confidence estimates, and citations. Hallucinations occur when LLMs generate content that is convincing but inaccurate. Confidence estimates are crucial for assessing factuality, while citations provide sources for text. Improving LLM chatbots involves addressing these limitations. Strategies include excluding contradicting training data, supervising the training process, and curating a consistent text corpus for training. MIT researchers have explored similar approaches to mitigate biases in LLMs. By gradually expanding training data with consistent text, a more reliable LLM can be developed. Further advancements could involve training models with diverse worldviews. Research is ongoing to enhance LLMs through consistent data bootstrapping methods. Various references provide insights into these challenges and potential solutions for improving LLM performance.

Related

AI: What people are saying
The article on enhancing LLM performance through curated training data and diverse worldviews sparked a discussion on various challenges and potential solutions for LLMs.
  • Hallucinations: Many commenters argue that hallucinations are inherent to LLMs due to their nature of remixing and interpolating data, and cannot be fully eliminated by curated data alone.
  • Data Quality: Some believe that creating a universally coherent dataset is impossible due to the contextual nature of "truth" and the immense effort required for manual curation.
  • Alternative Approaches: Suggestions include using logical reasoning capabilities, agentic systems for feedback loops, and combining different models or parameters to reduce hallucinations.
  • Practical Applications: Commenters discuss the practical use of LLMs in various fields, noting improvements over time and the importance of effective prompting and iterative querying.
  • Philosophical Concerns: There is a debate on whether focusing resources on developing AI chatbots is truly beneficial for society.
Link Icon 23 comments
By @mitthrowaway2 - 6 months
LLMs don't only hallucinate because of mistaken statements in their training data. It just comes hand-in-hand with the model's ability to remix, interpolate, and extrapolate answers to other questions that aren't directly answered in the dataset. For example if I ask ChatGPT a legal question, it might cite as precedent a case that doesn't exist at all (but which seems plausible, being interpolated from cases that do exist). It's not necessarily because it drew that case from a TV episode. It works the same way that GPT-3 wrote news releases that sounded convincing, matching the structure and flow of real articles.

Training only on factual data won't solve this.

Anyway, I can't help but feel saddened sometimes to see our talented people and investment resources being drawn in to developing these AI chatbots. These problems are solvable, but are we really making a better world by solving them?

By @RodgerTheGreat - 6 months
One of the main factors that makes LLMs popular today is that scaling up the models is a simple and (relatively) inexpensive matter of buying compute capacity and scraping together more raw text to train them. Without large and highly diverse training datasets to construct base models, LLMs cannot produce even the superficial appearance of good results.

Manually curating "tidy", properly-licensed and verified datasets is immensely more difficult, expensive, and time-consuming than stealing whatever you can find on the open internet. Wolfram Alpha is one of the more successful attempts in that curation-based direction (using good-old-fashioned heuristic techniques instead of opaque ML models), and while it is very useful and contains a great deal of factual information, it does not conjure appealing fantasies of magical capabilities springing up from thin air and hands-off exponential improvement.

By @nyrikki - 6 months
> ...manually curate a high-quality (consistent) text corpus based on undisputed, well curated wikipedia articles and battle tested scientific literature.

This assumption is based on the mistaken assumption that science is about objective truth.

It is confusing the map for the territory. Scientific models are intended to be useful, not perfect.

Statistical learning, vs symbolic learning is about existential quantification vs universal quantification respectively.

All models are wrong some are useful, this applies to even the most unreasonably accurate versions like QFT and GR.

Spherical cows, no matter how useful are hotly debated outside of the didactic half truths of low level courses.

The corpus that the above seeks doesn't exist in academic circles, only in popular science where people don't see that practical, useful models are far more important that 'correct' ones.

By @lsy - 6 months
We can't develop a universally coherent data set because what we understand as "truth" is so intensely contextual that we can't hope to cover the amount of context needed to make the things work how we want, not to mention the numerous social situations where writing factual statements would be awkward or disastrous.

Here are a few examples of statements that are not "factual" in the sense of being derivable from a universally coherent data set, and that nevertheless we would expect a useful intelligence to be able to generate:

"There is a region called Hobbiton where someone named Frodo Baggins lives."

"We'd like to announce that Mr. Ousted is transitioning from his role as CEO to an advisory position while he looks for a new challenge. We are grateful to Mr. Ousted for his contributions and will be sad to see him go."

"The earth is round."

"Nebraska is flat."

By @darby_nine - 6 months
Man it seems like the ship has sailed on "hallucination" but it's such a terrible name for the phenomenon we see. It is a major mistake to imply the issue is with perception rather than structural incompetence. Why not just say "incoherent output"? It's actually descriptive and doesn't require bastardizing a word we already find meaningful to mean something completely different.
By @ainoobler - 6 months
The article suggests a useful line of research. Train an LLM to detect logical fallacies and then see if that can be bootstrapped into something useful because it's pretty clear that all the issues with LLMs is the lack of logical capabilities. If an LLM was capable of logical reasoning then it would be obvious when it was generating made-up nonsense instead of referencing existing sources of consistent information.
By @RamblingCTO - 6 months
My biggest problem with them is that I can't quite get it to behave like I want it to. I built myself a "therapy/coaching" telegram bot (I'm healthy, but like to reflect a lot, no worries). I even built a self-reflecting memory component that generates insights (sometimes spot on, sometimes random af). But the more I use it, the more I notice that neither the memory nor the prompt matters much. I just can't get it to behave like a therapist would. So in other words: I can't find the inputs to achieve a desirable prediction from the SOTA LLMs. And I think that's a bigger problem for them not to be a shallow hype.
By @trte9343r4 - 6 months
> One could spin this idea even further and train several models with radically different world views by curating different training corpi that represent different sets of beliefs / world views.

You can get good results by combining different models in chat, or even the same model with different parameters. Model usually gives up on hallucinations when challenged. Sometime it pushes back and provides explanation with sources.

I have a script that puts models into dialog, moderates discussion and takes notes. I run this stuff overnight, so getting multiple choices speeds up iteration.

By @fatbird - 6 months
In my mind LLMs are already fatally compromised. Proximity matching via vector embeddings that offer no guarantees of completeness or correctness have already surrendered the essential advantage of technological advances.

Imagine a dictionary where the words are only mostly in alphabetical order. If you look up a word and don't find it, you can't be certain it's not in there. It's as useful as asking someone else, or several other people, but it's value as a reference is zero, and there's no shortage of other people on the planet.

By @FrameworkFred - 6 months
I'm playing around with LangChain and LangGraph (https://www.langchain.com/) and it seems like these enable just the sort of mechanisms mentioned.
By @wokwokwok - 6 months
Does anyone really believe that having a good corpus will remove hallucinations?

Is this article even written by a person? Hard to know; they have a real blog with real article, but stuff like this reads strangely. Maybe it's just not a native english speaker?

> Hallucinations are certainly the toughest nut to crack and their negative impact is basically only slightly lessened by good confidence estimates and reliable citations (sources).

> The impact of contradictions in the training data.

(was this a prompt header you forget to remove?)

> LLM are incapable of "self-inspection" on their training data to find logical inconsistencies in it but in the input context window they should be able to find logical inconsistencies.

Annnnyway...

Hallucinations cannot be fixed by a good corpus in a non-deterministic (ie. temp > 0) LLM system where you've introduced a random factor.

Period. QED. If you think it can, do more reading.

The idea that a good corpus can significantly improve the error rate is an open question, but the research I've seen tends to fall on the side of "to some degree, but curating a 'perfect' dataset like that, of a sufficiently large size, is basically impossible'".

So, it's a pipe dream.

Yes, if you could have a perfect corpus, absolutely, you would get a better model.

...but how do you plan to get that perfect corpus of training data?

If it was that easy, the people spending millions and millions of dollars making LLMs would have, I guess, probably come up with a solution for it. They're not stupid. If you could easily do it, it would already have been done.

my $0.02:

This is a dead end of research, because it's impossible.

Using LLMs which are finetuned to evaluate the output of other LLMs and using multi-sample / voting to reduce the incidence of halluciations that make it past the API barrier is both actively used and far, far more effective.

(ie. it doesn't matter if your LLM hallucinates 1 time in 10; if you can reliably detect that 1 instance, sample again, and return a non hallucination).

Other solutions... I'm skeptical; most of the ones I've seen haven't worked when you actually try to use them.

By @luke-stanley - 6 months
As I understand it: the Phi models, are trained with a much more selective training data, the Tiny Stories research was one of the starts of that, they used GPT-4 to make stories and encyclopedia like training data for Phi to learn from and code, which probably helps with logical structuring too. I think they did add in real web data too though but I think it was fairly selective.

Maybe something between Cyc and Google's math and geometry LLM's could help.

By @thntk - 6 months
We knew high quality data can help as evidenced by the \Phi models. However, this alone can never eliminate hallucination because data can never be both consistent and complete. Moreover, hallucination is an inherent flaw of intelligence in general if we think of intelligence as (lossy) compression.
By @xarope - 6 months
I do feel like we've reached a local maxima with the current state of LLMs, and researchers need to find something completely different to hit a new maxima (whether that is the global maxima or not, we'll know when we hail our new AI overlords).
By @DolphinAsa - 6 months
I'm surprised he didn't mention the way, that we are solving the issue at amazon. It's not an secret at this point, giving the LLM's hands or agentic systems to run code or do things that get feedback in a loop DRAMATICALLY REDUCE Hallucinations.
By @fsndz - 6 months
The thing is we probably can't build AGI: https://www.lycee.ai/blog/why-no-agi-openai
By @Carrok - 6 months
I wish he went into how to improve confidence scores, though I guess training on better data to begin with should improve results and thus confidence.
By @MR4D - 6 months
Q: is hallucination a milestone towards consciousness?

Given how inevitable it is, it seems to me that it might be.

By @jillesvangurp - 6 months
There has been steady improvement since the release of chat gpt into the wild, which is still only less than two years ago (easy to forget). I've been getting a lot of value out of chat gpt 4o, like lots of other people. I find with each model generation my dependence on this stuff for day to day work goes up as the soundness of its answers and reasoning improve.

There are still lots of issues and limitations but it's a very different experience than with gpt 3 early on. A lot of the smaller OSS models are a bit of a mixed bag in terms of hallucinations and utility. But they can be useful if you apply some skills. Half the success is actually learning to prompt these things and learning to spot when it starts to hallucinate.

One thing I find useful is to run ideas by it in kind of a socratic mode where I try to get it to flesh out brain farts I have for algorithms or other kinds of things. This can be coding related topics but also non technical kinds of things. It will get some things wrong and when you spot it, you can often get a better answer simply by pointing it out and maybe nudging it in a different direction. A useful trick with code is to also let it generate tests for its own code. When the tests fail to run, you can ask it to fix it. Or you can ask it for some alternative implementation of the same thing. Often you get something that is 95% close to what you asked for and then you can just do the remaining few percent yourself.

Doing TDD with an LLM is a power move. Good tests are easy enough to understand and once they pass, it's hard to argue with the results. And you can just ask it to identify edge cases and add more tests for those. LLMs take a lot of the tediousness out of writing tests. I'm a big picture kind of guy and my weakness is skipping unit tests to fast forward to having working code. Spelling out all the stupid little assertions is mindnumbingly stupid work that I don't have to bother with anymore. I just let AI generate good test cases. LLMs make TDD a lot less tedious. It's like having a really diligent junior pair programmer doing all the easy bits.

And if you apply SOLID principles to your own code (which is a good thing in any case), a lot of code is self contained enough that you can easily fit it in a small file that is small enough to fit into the context window of chat gpt (which is quite large these days). So, a thing I often do is just gather relevant code, copy past it and then tell it to make some reasonable assumptions about missing things and make some modifications to the code. Add a function that does X; how would I need to modify this code to address Y; etc. I also get it to iterate on its own code. And a neat trick is to ask it to compare its solution to other solutions out there and then get it to apply some of the same principles and optimizations.

One thing with RAG is that we're still under utilizing LLMs for this. It's a lot easier to get an LLM to ask good questions than it is to get them to provide the right answers. With RAG, you can use good old information retrieval to answer the questions. IMHO limiting RAG to just vector search is a big mistake. It actually doesn't work that well for structured data and you could just ask it to query some API based on a specification of use some sql, xpath, or whatever query language. And why just ask 1 question? Maybe engage in a dialog where it zooms in on the solution via querying and iteratively coming up with better questions until the context has all the data needed to come up with the answer.

If you think about it, this is how most knowledge workers address problems themselves. They are not oracles of wisdom that know everything but merely aggregators and filters of external knowledge. A good knowledge worker / researcher / engineer is one that knows how to ask the right questions in order to come up with an iterative process that converges on a solution.

Once you stop using LLMs as one shot oracles that give you an answer given a question, they become a lot more useful.

As for AGI, a human AI enhanced by AGI is a powerful combination. I kind of like the vision behind neuralink where the core idea is basically improving the bandwidth between our brains and external tools and intelligence. Using a chat bot is a low bandwidth kind of thing. I actually find it tedious.

By @Animats - 6 months
Plausible idea which needs a big training budget. Was it funded?
By @simplysparsh - 6 months
I came here thinking I will learn how to make LLMs better. But leaving with more complicated questions:

1. Do I want LLMs to be trained with licensed data, that's arguably well curated. Or, do I want LLM to scrape the web because it is more democratic in opinions?

2. If hallucination is not about training data but how LLM uses that data to extrapolate info that's not directly present in training data - can we teach it this skill to make better choices?

3. It's easy to define good data for facts. How to define good data for subjective topics?

4. For subjective topics, is it better to have separate LLMs trained with each theme of opinions or one big LLM with a mix of all opinions?

5. Is using LLM to improve its own training data truly helpful as the author claims? If yes - is this recursion method better or it's better to use multiple LLMs together?

Dang! If I interview for a position that requires knowledge of AI - every question they ask will be answered with more questions. smh!