Overcoming the Limits of Large Language Models
Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.
Read original articleLarge language models (LLMs) face challenges such as hallucinations, lack of confidence estimates, and citations. Hallucinations occur when LLMs generate content that is convincing but inaccurate. Confidence estimates are crucial for assessing factuality, while citations provide sources for text. Improving LLM chatbots involves addressing these limitations. Strategies include excluding contradicting training data, supervising the training process, and curating a consistent text corpus for training. MIT researchers have explored similar approaches to mitigate biases in LLMs. By gradually expanding training data with consistent text, a more reliable LLM can be developed. Further advancements could involve training models with diverse worldviews. Research is ongoing to enhance LLMs through consistent data bootstrapping methods. Various references provide insights into these challenges and potential solutions for improving LLM performance.
Related
- Hallucinations: Many commenters argue that hallucinations are inherent to LLMs due to their nature of remixing and interpolating data, and cannot be fully eliminated by curated data alone.
- Data Quality: Some believe that creating a universally coherent dataset is impossible due to the contextual nature of "truth" and the immense effort required for manual curation.
- Alternative Approaches: Suggestions include using logical reasoning capabilities, agentic systems for feedback loops, and combining different models or parameters to reduce hallucinations.
- Practical Applications: Commenters discuss the practical use of LLMs in various fields, noting improvements over time and the importance of effective prompting and iterative querying.
- Philosophical Concerns: There is a debate on whether focusing resources on developing AI chatbots is truly beneficial for society.
Training only on factual data won't solve this.
Anyway, I can't help but feel saddened sometimes to see our talented people and investment resources being drawn in to developing these AI chatbots. These problems are solvable, but are we really making a better world by solving them?
Manually curating "tidy", properly-licensed and verified datasets is immensely more difficult, expensive, and time-consuming than stealing whatever you can find on the open internet. Wolfram Alpha is one of the more successful attempts in that curation-based direction (using good-old-fashioned heuristic techniques instead of opaque ML models), and while it is very useful and contains a great deal of factual information, it does not conjure appealing fantasies of magical capabilities springing up from thin air and hands-off exponential improvement.
This assumption is based on the mistaken assumption that science is about objective truth.
It is confusing the map for the territory. Scientific models are intended to be useful, not perfect.
Statistical learning, vs symbolic learning is about existential quantification vs universal quantification respectively.
All models are wrong some are useful, this applies to even the most unreasonably accurate versions like QFT and GR.
Spherical cows, no matter how useful are hotly debated outside of the didactic half truths of low level courses.
The corpus that the above seeks doesn't exist in academic circles, only in popular science where people don't see that practical, useful models are far more important that 'correct' ones.
Here are a few examples of statements that are not "factual" in the sense of being derivable from a universally coherent data set, and that nevertheless we would expect a useful intelligence to be able to generate:
"There is a region called Hobbiton where someone named Frodo Baggins lives."
"We'd like to announce that Mr. Ousted is transitioning from his role as CEO to an advisory position while he looks for a new challenge. We are grateful to Mr. Ousted for his contributions and will be sad to see him go."
"The earth is round."
"Nebraska is flat."
You can get good results by combining different models in chat, or even the same model with different parameters. Model usually gives up on hallucinations when challenged. Sometime it pushes back and provides explanation with sources.
I have a script that puts models into dialog, moderates discussion and takes notes. I run this stuff overnight, so getting multiple choices speeds up iteration.
Imagine a dictionary where the words are only mostly in alphabetical order. If you look up a word and don't find it, you can't be certain it's not in there. It's as useful as asking someone else, or several other people, but it's value as a reference is zero, and there's no shortage of other people on the planet.
Is this article even written by a person? Hard to know; they have a real blog with real article, but stuff like this reads strangely. Maybe it's just not a native english speaker?
> Hallucinations are certainly the toughest nut to crack and their negative impact is basically only slightly lessened by good confidence estimates and reliable citations (sources).
> The impact of contradictions in the training data.
(was this a prompt header you forget to remove?)
> LLM are incapable of "self-inspection" on their training data to find logical inconsistencies in it but in the input context window they should be able to find logical inconsistencies.
Annnnyway...
Hallucinations cannot be fixed by a good corpus in a non-deterministic (ie. temp > 0) LLM system where you've introduced a random factor.
Period. QED. If you think it can, do more reading.
The idea that a good corpus can significantly improve the error rate is an open question, but the research I've seen tends to fall on the side of "to some degree, but curating a 'perfect' dataset like that, of a sufficiently large size, is basically impossible'".
So, it's a pipe dream.
Yes, if you could have a perfect corpus, absolutely, you would get a better model.
...but how do you plan to get that perfect corpus of training data?
If it was that easy, the people spending millions and millions of dollars making LLMs would have, I guess, probably come up with a solution for it. They're not stupid. If you could easily do it, it would already have been done.
my $0.02:
This is a dead end of research, because it's impossible.
Using LLMs which are finetuned to evaluate the output of other LLMs and using multi-sample / voting to reduce the incidence of halluciations that make it past the API barrier is both actively used and far, far more effective.
(ie. it doesn't matter if your LLM hallucinates 1 time in 10; if you can reliably detect that 1 instance, sample again, and return a non hallucination).
Other solutions... I'm skeptical; most of the ones I've seen haven't worked when you actually try to use them.
Maybe something between Cyc and Google's math and geometry LLM's could help.
Given how inevitable it is, it seems to me that it might be.
There are still lots of issues and limitations but it's a very different experience than with gpt 3 early on. A lot of the smaller OSS models are a bit of a mixed bag in terms of hallucinations and utility. But they can be useful if you apply some skills. Half the success is actually learning to prompt these things and learning to spot when it starts to hallucinate.
One thing I find useful is to run ideas by it in kind of a socratic mode where I try to get it to flesh out brain farts I have for algorithms or other kinds of things. This can be coding related topics but also non technical kinds of things. It will get some things wrong and when you spot it, you can often get a better answer simply by pointing it out and maybe nudging it in a different direction. A useful trick with code is to also let it generate tests for its own code. When the tests fail to run, you can ask it to fix it. Or you can ask it for some alternative implementation of the same thing. Often you get something that is 95% close to what you asked for and then you can just do the remaining few percent yourself.
Doing TDD with an LLM is a power move. Good tests are easy enough to understand and once they pass, it's hard to argue with the results. And you can just ask it to identify edge cases and add more tests for those. LLMs take a lot of the tediousness out of writing tests. I'm a big picture kind of guy and my weakness is skipping unit tests to fast forward to having working code. Spelling out all the stupid little assertions is mindnumbingly stupid work that I don't have to bother with anymore. I just let AI generate good test cases. LLMs make TDD a lot less tedious. It's like having a really diligent junior pair programmer doing all the easy bits.
And if you apply SOLID principles to your own code (which is a good thing in any case), a lot of code is self contained enough that you can easily fit it in a small file that is small enough to fit into the context window of chat gpt (which is quite large these days). So, a thing I often do is just gather relevant code, copy past it and then tell it to make some reasonable assumptions about missing things and make some modifications to the code. Add a function that does X; how would I need to modify this code to address Y; etc. I also get it to iterate on its own code. And a neat trick is to ask it to compare its solution to other solutions out there and then get it to apply some of the same principles and optimizations.
One thing with RAG is that we're still under utilizing LLMs for this. It's a lot easier to get an LLM to ask good questions than it is to get them to provide the right answers. With RAG, you can use good old information retrieval to answer the questions. IMHO limiting RAG to just vector search is a big mistake. It actually doesn't work that well for structured data and you could just ask it to query some API based on a specification of use some sql, xpath, or whatever query language. And why just ask 1 question? Maybe engage in a dialog where it zooms in on the solution via querying and iteratively coming up with better questions until the context has all the data needed to come up with the answer.
If you think about it, this is how most knowledge workers address problems themselves. They are not oracles of wisdom that know everything but merely aggregators and filters of external knowledge. A good knowledge worker / researcher / engineer is one that knows how to ask the right questions in order to come up with an iterative process that converges on a solution.
Once you stop using LLMs as one shot oracles that give you an answer given a question, they become a lot more useful.
As for AGI, a human AI enhanced by AGI is a powerful combination. I kind of like the vision behind neuralink where the core idea is basically improving the bandwidth between our brains and external tools and intelligence. Using a chat bot is a low bandwidth kind of thing. I actually find it tedious.
1. Do I want LLMs to be trained with licensed data, that's arguably well curated. Or, do I want LLM to scrape the web because it is more democratic in opinions?
2. If hallucination is not about training data but how LLM uses that data to extrapolate info that's not directly present in training data - can we teach it this skill to make better choices?
3. It's easy to define good data for facts. How to define good data for subjective topics?
4. For subjective topics, is it better to have separate LLMs trained with each theme of opinions or one big LLM with a mix of all opinions?
5. Is using LLM to improve its own training data truly helpful as the author claims? If yes - is this recursion method better or it's better to use multiple LLMs together?
Dang! If I interview for a position that requires knowledge of AI - every question they ask will be answered with more questions. smh!