Hallucinations in code are the least dangerous form of LLM mistakes
Hallucinations in code generated by large language models are less harmful than in prose, as errors are quickly detected. Active testing, context provision, and improved code review skills are essential for developers.
Read original articleHallucinations in code generated by large language models (LLMs) are a common issue faced by developers, often leading to a loss of confidence in these tools. However, these hallucinations, which involve the creation of non-existent methods or libraries, are considered the least harmful mistakes made by LLMs. Unlike errors in prose, which require careful scrutiny to identify, coding errors are typically caught immediately when the code is executed, allowing for quick corrections. Developers are encouraged to actively test and validate the code produced by LLMs, as relying solely on its appearance can lead to false confidence. To mitigate hallucinations, developers can experiment with different models, provide context through example code, and choose established libraries that are more likely to be recognized by LLMs. The author emphasizes the importance of developing skills in reviewing and understanding code, suggesting that those who find LLM-generated code untrustworthy may need to improve their coding review abilities. Overall, while LLMs can produce impressive code, manual testing and validation remain essential to ensure functionality and correctness.
- Hallucinations in code are less dangerous than those in prose due to immediate error detection.
- Active testing of LLM-generated code is crucial for ensuring its correctness.
- Developers should explore different models and provide context to reduce hallucinations.
- Choosing well-established libraries can improve LLM performance.
- Improving code review skills is essential for effectively using LLMs in programming.
Related
Overcoming the Limits of Large Language Models
Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.
LLMs Will Always Hallucinate, and We Need to Live with This
The paper by Sourav Banerjee and colleagues argues that hallucinations in Large Language Models are inherent and unavoidable, rooted in computational theory, and cannot be fully eliminated by improvements.
A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs
The paper analyzes package hallucinations in code-generating LLMs, revealing a 5.2% rate in commercial models and 21.7% in open-source models, urging the research community to address this issue.
AI hallucinations: Why LLMs make things up (and how to fix it)
AI hallucinations in large language models can cause misinformation and ethical issues. A three-layer defense strategy and techniques like chain-of-thought prompting aim to enhance output reliability and trustworthiness.
Hallucinations in code are the least dangerous form of LLM mistakes
Hallucinations in code from large language models are less harmful than in prose. Manual testing is essential, and developers should engage with and review LLM-generated code to enhance their skills.
- Many commenters express skepticism about the reliability of LLM-generated code, highlighting issues such as hallucinated libraries and subtle logical errors that can lead to significant problems.
- There is a consensus that while LLMs can assist in coding, they require careful review and understanding from developers, as relying solely on them can lead to misunderstandings and errors.
- Some users feel that the time spent reviewing LLM-generated code may negate the time saved by using the LLM in the first place, suggesting that it might be faster to write code manually.
- Concerns are raised about the potential for LLMs to introduce security vulnerabilities, especially if they generate code that relies on non-existent libraries.
- Commenters emphasize the importance of human oversight and the need for developers to maintain a deep understanding of the code they work with, regardless of LLM assistance.
As much as I've agreed with the author's other posts/takes, I find myself resisting this one:
> I'll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people.
No, that does not follow.
1. Reviewing depends on what you know about the expertise (and trust) of the person writing it. Spending most of your day reviewing code written by familiar human co-workers is very different from the same time reviewing anonymous contributions.
2. Reviews are not just about the code's potential mechanics, but inferring and comparing the intent and approach of the writer. For LLMs, that ranges between non-existent and schizoid, and writing it yourself skips that cost.
3. Motivation is important, for some developers that means learning, understanding and creating. Not wanting to do code reviews all day doesn't mean you're bad at them. Also, reviewing an LLM's code has no social aspect.
However you do it, somebody else should still be reviewing the change afterwards.
Perhaps I'm just not that great of a coder, but I do have lots of code where if someone took a look it, it might look crazy but it really is the best solution I could find. I'm concerned LLMs won't do that, they won't take risks a human would or understand the implications of a block of code beyond its application in that specific context.
Other times, I feel like I'm pretty good at figuring out things and struggling in a time-efficient manner before arriving at a solution. LLM generated code is neat but I still have to spend similar amounts of time, except now I'm doing more QA and clean up work instead of debugging and figuring out new solutions, which isn't fun at all.
I would have stated this a bit differently: No amount of running or testing can prove the code correct. You actually have to reason through it. Running/testing is merely a sanity/spot check of your reasoning.
They provide a task well-represented in the LLM's training data, so development should be easy. The task is presented as a cumulative series of modifications to a codebase:
https://www.youtube.com/watch?v=NW6PhVdq9R8
This is the actual reality of LLM code generators in practice: iterative development converging on useless code, with the LLM increasingly unable to make progress.
This seems like a very flawed assumption to me. My take is that people look at hallucinations and say "wow, if it can't even get the easiest things consistently right, no way am I going to trust it with harder things".
[0] https://www.qut.edu.au/news/realfocus/deaths-linked-to-chatb...
[1] https://www.theguardian.com/uk-news/2023/jul/06/ai-chatbot-e...
This is an appeal against innovation.
> I’ll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.
As someone who has spent [an incredible amount of time reviewing other people's code](https://github.com/ziglang/zig/pulls?q=is%3Apr+is%3Aclosed), my perspective is that reviewing code is fundamentally slower than writing it oneself. The purpose of reviewing code is mentorship, investing in the community, and building trust, so that those reviewees can become autonomous and eventually help out with reviewing.
You get none of that from reviewing code generated by an LLM.
The specification was to only look at clinical appointments, and find the most recent appointment. However if the patient didn't have a clinical appointment, it was supposed to find the most recent appointment of any sort.
I wrote the code by sorting the data (first by clinical-non-clinical and then by date). I asked chatgpt to document it. It misunderstood the code and got the sorting backwards.
I was pretty surprised, and after testing with foo-bar examples eventually realised that I had called the clinical-non-clinical column "Clinical", which confused the LLM.
This is the kind of mistake that is a lot worse than "code doesn't run" - being seemingly right but wrong is much worse than being obviously wrong.
For example, I had it generate some C code to be used with ZeroMQ a few months ago. The code looked absolutely fine, and it mostly worked fine, but it made a mistake with its memory allocation stuff that caused it to segfault sometimes, and corrupt memory other times.
Fortunately, this was such a small project and I already know how to write code, so it wasn't too hard for me to find and fix, though I am slightly concerned that some people are copypasting large swaths of code from ChatGPT that looks mostly fine but hides subtle bugs.
If I have to spend lots of time learning how to use something, fix its errors, review its output, etc., it may just be faster and easier to just write it myself from scratch.
The burden of proof is not on me to justify why I choose not to use something. It's on the vendor to explain why I should turn the software development process into perpetually reviewing a junior engineer's hit-or-miss code.
It is nice that the author uses the word "assume" -- there is mixed data on actual productivity outcomes of LLMs. That is all you are doing -- making assumptions without conclusive data.
This is not nearly as strong an argument as the author thinks it is.
> As a Python and JavaScript programmer my favorite models right now are Claude 3.7 Sonnet with thinking turned on, OpenAI’s o3-mini-high and GPT-4o with Code Interpreter (for Python).
This is similar to Neovim users who talk about "productivity" while ignoring all the time spent tweaking dofiles that could be spent doing your actual job. Every second I spend toying with models is me doing something that does not directly accomplish my goals.
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.
You have no idea how much code I read, so how can you make such claims? Anyone who reads plenty of code knows that it often feels like reading other people's code is often harder than just writing it yourself.
The level of hostility towards just sitting down and thinking through something without having an LLM insert text into your editor is unwarranted and unreasonable. A better policy is: if you like using coding assistants, great. If you don't and you still get plenty of work done, great.
That certainly punctures the hype. What are LLMs good for, if the best you can hope for is to spend years learning to prompt it for unreliable results?
If you’re writing code in Python against well documented APIs, sure. But it’s an issue for less popular languages and frameworks, when you can’t immediately tell if the missing method is your fault due to a missing dependency, version issue, etc.
Now I am at the point that I am cleaning up the code and making it pretty. My script is less than 300 lines and Chatgpt regularly just leaves out whole chunks of the script when it suggests improvements. The first couple times this led to tons of head scratching over why some small change to make one thing more resilient would make something totally unrelated break.
Now I've learned to take Chatgpt's changes and diff it with the working version before I try to run it.
Rather than the positive (code compiles), the negative (forgets about a core feature), can be extremely difficult to tell. Worse still, the feature can slightly drift, based upon code that's expected to be outside of the dialogue / context window.
I've had multiple times where the model completely forgot about features in my original piece of code, after it makes a modification. I didn't notice these missing / subtle changes until much later.
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.
Not only is this a massive bundle of assumptions but it's also just wrong on multiple angles. Maybe if you're only doing basic CRUDware you can spend five seconds and give a thumbs up but in any complex system you should be spending time deeply reading code. Which is naturally going to take longer than using what knowledge you already have to throw out a solution.
Ok sure it writes test code boiler plate for me.
Honestly the kind of work im doing requires that I understand the code im reading, more than have the ability to quickly churn out more of it.
I think probably an llm is going to greatly speed up Web development, or anything else where the impetus is on adding to a codebase quickly, as for maintaining older code, performing precise upgrades, and fixing bugs, so far ive seen zero benefits. And trust me, I would like my job to be easier! Its not like I've not tried to use these
The real problem with hallucination is that we started using LLMs as search engines, so when it invents a function, you have to go and actually search the API on a real search engine.
Interestingly though, this only works if there is an error. There are cases where you will not get an error; consider a loosely typed programming language like JS or Python, or simply any programming language when some of the API interface is unstructured, like using stringly-typed information (e.g. Go struct tags.) In some cases, this will just silently do nothing. In other cases, it might blow up at runtime, but that does still require you to hit the code path to trigger it, and maybe you don't have 100% test coverage.
So I'd argue hallucinations are not always safe, either. The scariest thing about LLMs in my mind is just the fact that they have completely different failure modes from humans, making it much harder to reason about exactly how "competent" they are: even humans are extremely difficult to compare with regards to competency, but when you throw in the alien behavior of LLMs, there's just no sense of it.
And btw, it is not true that feeding an error into an LLM will always result in it correcting the error. I've been using LLMs experimentally and even trying to guide it towards solving problems I know how to solve, sometimes it simply can't, and will just make a bigger and bigger mess. Due to the way LLMs confidently pretend to know the exact answer ahead of time, presumably due to the way they're trained, they will confidently do things that would make more sense to try and then undo when they don't work, like trying to mess with the linker order or add dependencies to a target to fix undefined reference errors (which are actually caused by e.g. ABI issues.) I still think LLMs are a useful programming tool, but we could use a bit more reality. If LLMs were as good as people sometimes imply, I'd expect an explosion in quality software to show up. (There are exceptions of course. I believe the first versions of Stirling PDF were GPT-generated so long ago.) I mean, machine-generated illustrations have flooded the Internet despite their shortcomings, but programming with AI assistance remains tricky and not yet the force multiplier it is often made out to be. I do not believe AI-assisted coding has hit its Stable Diffusion moment, if you will.
Now whether it will or not, is another story. Seems like the odds aren't that bad, but I do question if the architectures we have today are really the ones that'll take us there. Either way, if it happens, I'll see you all at the unemployment line.
Why am I reminded of people who say you first have to become a biblical scholar before you can criticize the bible?
1. I know that a problem requires a small amount of code, but I also know it's difficult to write (as I am not an expert in this particular subfield) and it will take me a long time, like maybe a day. Maybe it's not worth doing at all, as the effort is not worth the result.
2. So why not ask the LLM, right?
3. It gives me some code that doesn't do exactly what is needed, and I still don't understand the specifics, but now I have a false hope that it will work out relatively easily.
4. I spend a day until I finally manage to make it work the way it's supposed to work. Now I am also an expert in the subfield and I understand all the specifics.
5. After all I was correct in my initial assessment of the problem, the LLM didn't really help at all. I could have taken the initial version from Stack Overflow and it would have been the same experience and would have taken the same amount of time. I still wasted a whole day on a feature of questionable value.
But that's for methods. For libraries, the scenario is different, and possibly a lot more dangerous. For example, the LLM generates code that imports a library that does not exist. An attacker notices this too while running tests against the LLM. The attacker decides to create these libraries on the public package registry and injects malware. A developer may think: "oh, this newly generated code relies on an external library, I will just install it," and gets owned, possibly without even knowing for a long time (as is the case with many supply chain attacks).
And no, I'm not looking for a way to dismiss the technology, I use LLMs all the time myself. But what I do think is that we might need something like a layer in between the code generation and the user that will catch things like this (or something like Copilot might integrate safety measures against this sort of thing).
I'm tempted to pay someone in Poland or whatever another 500$ to just finish the project. Claude code is like a temp that has a code quota to reach. After they reach it, they're done. You've reached the context limit.
A lot of stuff is just weird. For example I'm basically building a website with Supabase. Claude does not understand the concept of shared style sheets, instead it will just re-implement the same style sheets over and over again on like every single page and subcomponent.
Multiple incorrect implementations of relatively basic concepts. Over engineering all over the place.
A part of this might be on Supabase though. I really want to create a FOSS project, so firebase, while probably being a better fit, is out.
Not wanting to burn out, I took a break after a 4 hour Claude session. It's like reviewing code for a living.
However, I'm optimistic soon a competitor will emerge with better pricing. I would absolutely love to run three coding agents at once, maybe it even a fourth that can run integration tests against the first three.
Even if one is very good at code review, I'd assume the vast majority of people would still end up with pretty different kinds of bugs they are better at finding while writing vs reviewing. Writing code and having it reviewed by a human gets both classes, whereas reviewing LLM code gets just one half of that. (maybe this can be compensated-ish by LLM code review, maybe not)
And I'd be wary of equating reviewing human vs LLM code; sure, the explicit goal of LLMs is to produce human-like text, but they also have prompting to request being "correct" over being "average human" so they shouldn't actually "intentionally" reproduce human-like bugs from training data, resulting in the main source of bugs being model limitations, thus likely producing a bug type distribution potentially very different to that of humans.
Should we even be asking AI to write code? Shouldn't we just be building and training AI to solve these problems without writing any code at all? Replace every app with some focused, trained, and validated AI. Want to find the cheapest flights? Who cares what algorithm the AI uses to find them, just let it do that. Want to track your calorie intake, process payroll every two weeks, do your taxes, drive your car, keep airplanes from crashing into each other, encrypt your communications, predict the weather? Don't ask AI to clumsily write code to do these things. Just tell it to do them!
Isn't that the real promise of AI?
Um. No.
This is oversimplification that falls apart in any at minimum level system.
Over my career I’ve encountered plenty of reliability caused consequences. Code that would run but side effects of not processing something, processing it too slow or processing it twice would have serious consequences - financial and personal ones.
And those weren’t „nuclear power plant management” kind of critical. I often reminisce about educational game that was used at school and cost of losing a single save progress meant couple thousand dollars of reimbursement.
https://xlii.space/blog/network-scenarios/
This a cheatsheet I made for my colleagues. This is the thing we need to keep in mind when designing system I’m working on. Rarely any LLM thinks about it. It’s not a popular engineering by any sort, but it it’s here.
As for today I’ve yet to name single instance where any of ChatGPT produced code actually would save me time. I’ve seen macro generation code recommendation for Go (Go doesnt have macros), object mutations for Elixir (Elixir doesn’t have objects but immutable structs), list splicing in Fennel (Fennel doesn’t have splicing), language feature pragma ported from another or pure byte representation of memory in Rust and the code used UTF-8 string parsing to do it. My trust toward any non-ephemeral generated code is sub zero.
It’s exhausting and annoying. It feels like interacting with Calvin’s (of Calvin and Hobbes) dad but with all the humor taken away.
So he's also using LLMs to steer his writing style towards the lowest common denominator :)
The more constraints we can place on its behavior, the harder it is to mess up.
If it's riskier code, constrain it more with better typing, testing, design, and analysis.
Constraints are to errors (including hallucinations) as water is to fire.
If you don't, don't.
However, this 'lets move past hallucinations' discourse is just disingenuous.
The OP is conflating hallucinations, which are a fact, and undisputed failure mode of LLMs that no one has any solution for.
...and people not spending enough time and effort learning to use the tools.
I don't like it. It feels bad. It feels like a rage bait piece, cast out of frustration that the OP doesn't have an answer for hallucinations, because there isn't one.
> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.
People aren't stupid.
If they use a tool and it sucks, they'll stop using it and say "this sucks".
If people are saying "this sucks" about AI, it's because the LLM tool they're using sucks, not because they're idiots, or there's a grand 'anti-AI' conspiracy.
People are lazy; if the tool is good (eg. cursor), people will use it.
If they use it, and the first thing it does is hallucinate some BS (eg. intellij full line completion), then you'll get people uninstalling it and leaving reviews like "blah blah hallucination blah blah. This sucks".
Which is literally what is happening. Right. Now.
To be fair 'blah blah hallucinations suck' is a common 'anti-AI' trope that gets rolled out.
...but that's because it is a real problem
Pretending 'hallucinations are fine, people are the problem' is... it's just disingenuous and embarrassing from someone of this caliber.
If someone asks me a question about something I've worked on, I might be able to give an answer about some deep functionality.
At the moment I'm working with a LLM on a 3D game and while it works, I would need to rebuild it to understand all the elements of it.
For me this is my biggest fear - not that LLMs can code, but that they do so at such a volume that in a generation or two no one will understand how the code works.
Are these not considered hallucinations still?
> I think a simpler explanation is that hallucinating a non-existent library is a such an inhuman error it throws people. A human making such an error would be almost unforgivably careless.
This might explain why so many people see hallucinations in generated code as an inexcusable red flag.
Good point
Well, those types of errors won't be happening next year will they?
> No amount of meticulous code review—or even comprehensive automated tests—will demonstrably prove that code actually does the right thing. You have to run it yourself!
What rot. The test is the problem definition. If properly expressed, the code passing the test means the code is good.
Even better, this can carry on for a few iterations. And both LLMs can be:
1. Budgeted ("don't exceed X amount")
2. Improved (another LLM can improve their prompts)
and so on. I think we are fixating on how _we_ do things, not how this new world will do their _own_ thing. That to me is the real danger.
I've also tried Cursor with similar mixed results.
But I'll say that we are getting tremendous pressure at work to use AI to write code. I've discussed it with fellow engineers and we're of the opinion that the managerial desire is so great that we are better off keeping our heads down and reporting success vs saying the emperor wears no clothes.
It really feels like the billionaire class has fully drunk the kool-aid and needs AI to live up to the hype.
Ah so you mean... actually doing work. Yeah writing code has the same difficulty, you know. It's not enough to merely get something to compile and run without errors.
> With code you get a powerful form of fact checking for free. Run the code, see if it works.
No, this would be coding by coincidence. Even the most atrociously bad prose writers don't exactly go around just saying random words from a dictionary or vaguely (mis)quoting Shakespeare hoping to be understood.
> I genuinely find myself picking libraries that have been around for a while partly because that way it’s much more likely that LLMs will be able to use them.
People will pick solutions that have a lot of training data, rather than the best solution.
Any entity, human or otherwise, lacking understanding of the problem being solved will, by definition, produce systems which contain some combination of defects, logic errors, and inapplicable functionality for the problem at hand.
Edit: oh and steel capped boots.
Edit 2: and a face shield and ear defenders. I'm all tuckered out like Grover in his own alphabet.
My cynical side suspects they may have been looking for
a reason to dismiss the technology and jumped at the first
one they found.
MY cynical side suggests the author is an LLM fanboi who prefers not to think that hallucinating easy stuff strongly implies hallucinating harder stuff, and therefore jumps at the first reason to dismiss the criticism.It says that Hallucinations are not a big deal, that there’s great dangers that are hard to spot in LLM-generated code… and then presents tips on fixing hallucinations with the general theme of positivity towards using LLMs to generate code, with no more time dedicated to the other dangers.
It sure gives the impression that the article itself was written by an LLM and barely edited by a human.
Absolutely not. If your testing requires a human to do testing, your testing has already failed. Your tests do need to include both positive and negative tests, though. If your tests don't include "things should crash and burn given ..." your tests are incomplete.
> If you’re using an LLM to write code without even running it yourself, what are you doing?
Running code through tests is literally running the code. Have code coverage turned on, so that you get yelled at for LLM code that you don't have tests for, and CI/CD that refuses to accept code that has no tests. By all means push to master on your own projects, but for production code, you better have checks in place that don't allow not-fully-tested code (coverage, unit, integration, and ideally, docs) to land.
The real problem comes from LLMs happily not just giving you code but also test cases. The same prudence applies as with test cases someone added to a PR/MR: just because there are tests doesn't mean they're good tests, or enough tests, review them in the assumption that they're testing the wrong thing entirely.
It's not hallucinating Jim, it's statistical coding errors. It's floating point rounding mistakes. It's the wrong cell in the excel table.
Hallucinating
Then the biggest mistake it could make is running `gh repo delete`
Related
Overcoming the Limits of Large Language Models
Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.
LLMs Will Always Hallucinate, and We Need to Live with This
The paper by Sourav Banerjee and colleagues argues that hallucinations in Large Language Models are inherent and unavoidable, rooted in computational theory, and cannot be fully eliminated by improvements.
A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs
The paper analyzes package hallucinations in code-generating LLMs, revealing a 5.2% rate in commercial models and 21.7% in open-source models, urging the research community to address this issue.
AI hallucinations: Why LLMs make things up (and how to fix it)
AI hallucinations in large language models can cause misinformation and ethical issues. A three-layer defense strategy and techniques like chain-of-thought prompting aim to enhance output reliability and trustworthiness.
Hallucinations in code are the least dangerous form of LLM mistakes
Hallucinations in code from large language models are less harmful than in prose. Manual testing is essential, and developers should engage with and review LLM-generated code to enhance their skills.