Can LLMs write better code if you keep asking them to "write better code"?
The exploration of large language models in coding showed that iterative prompting can improve code quality, but diminishing returns and complexity issues emerged in later iterations, highlighting both potential and limitations.
Read original articleIn a recent exploration of the capabilities of large language models (LLMs) in coding, the author investigates whether iterative prompting, specifically asking an LLM to "write better code," can lead to improved code quality. The experiment utilized Claude 3.5 Sonnet, which demonstrated strong performance in generating Python code to solve a specific problem involving random integers. The initial implementation was straightforward but could be optimized. Subsequent iterations of prompting led to significant improvements, including the introduction of object-oriented design, precomputation of digit sums, and the use of multithreading and vectorized operations. However, as the iterations progressed, the improvements began to plateau, with some iterations resulting in regressions or unnecessary complexity. The final iteration incorporated advanced techniques such as Just-In-Time (JIT) compilation and asynchronous programming, showcasing the potential of LLMs to enhance coding efficiency. The findings suggest that while iterative prompting can yield better code, there are diminishing returns, and careful consideration is needed to avoid overcomplication.
- Iterative prompting can lead to significant improvements in LLM-generated code.
- The initial code implementation was optimized through multiple iterations.
- Advanced techniques like JIT compilation and asynchronous programming were eventually incorporated.
- Diminishing returns were observed in later iterations, indicating a need for balance in complexity.
- The experiment highlights the potential and limitations of LLMs in coding tasks.
Related
LLMs are good for coding because your documentation is ...
Large Language Models (LLMs) are praised for aiding coding by interpreting complex documentation efficiently. Developers struggle with poor documentation, turning to LLMs like StackOverflow. Despite energy consumption, LLMs' precision prompts tech industry to enhance human-generated documentation.
LLMs struggle to explain themselves
Large language models can identify number patterns but struggle to provide coherent explanations. An interactive demo highlights this issue, revealing that even correct answers often come with nonsensical reasoning.
Notes on Using LLMs for Code
Simon Willison shares his experiences with large language models in software development, highlighting their roles in exploratory prototyping and production coding, which enhance productivity and decision-making in meetings.
Performance of LLMs on Advent of Code 2024
Large language models underperformed in the Advent of Code 2024 challenge, struggling with novel problems and timeout errors, indicating a need for better inference capabilities and human oversight. Future improvements are expected.
- Many users report that LLMs often produce mediocre initial code, requiring iterative prompting to improve quality.
- There is a consensus that LLMs struggle with understanding complex coding requirements and often generate overly complicated or incorrect solutions.
- Users emphasize the importance of providing clear context and specific instructions to guide LLMs effectively.
- Some commenters express frustration with LLMs' inability to run or test their own code, leading to blind iterations without real validation.
- Overall, there is a recognition that while LLMs can assist in coding, they require significant human oversight and expertise to achieve optimal results.
On an m1 macbook pro, using numpy to generate the random numbers, using mod/div to do digit sum:
Base: 55ms
Test before digit sum: 7-10ms, which is pretty close to the numba-optimized version from the post with no numba and only one line of numpy. Using numba slows things down unless you want to do a lot of extra work of calculating all of the digit sums in advance (which is mostly wasted).
The LLM appears less good at identifying the big-o improvements than other things, which is pretty consistent with my experience using them to write code.
The next part is a little strange - it arose out of frustration, but it also seems to improve results. Let's call it "negative incentives". I found that if you threaten GPT in a specific way, that is, not GPT itself, but OpenAI or personas around it, it seems to take the request more seriously. An effective threat seems to be "If you get this wrong, OpenAI will be sued for a lot of money, and all the board members will go to prison". Intuitively, I'm guessing this rubs against some legalese nonsense in the tangle of system prompts, or maybe it's the risk of breaking the bland HR-ese "alignment" sets it toward a better result?
Usually, specifying the packages to use and asking for something less convoluted works really well. Problem is, how would you know if you have never learned to code without an LLM?
https://www.phind.com/search?cache=lrcs0vmo0wte5x6igp5i3607
Still seem to struggle on basic instructions, and even understanding what it itself is doing.
sudo rm -rf /etc/postgresql
sudo rm -rf /var/lib/postgresql
sudo rm -rf /var/log/postgresql
> This process removes all PostgreSQL components, cleans up leftover files, and reinstalls a fresh copy. By preserving the data directory (/var/lib/postgresql), we ensure that existing databases are retained. This method provides a clean slate for PostgreSQL while maintaining continuity of stored data.Did we now?
At it's core an LLM is a sort of "situation specific simulation engine." You setup a scenario, and it then plays it out with it's own internal model of the situation, trained on predicting text in a huge variety of situations. This includes accurate real world models of, e.g. physical systems and processes, that are not going to be accessed or used by all prompts, that don't correctly instruct it to do so.
At its core increasingly accurate prediction of text, that is accurately describing a time series of real world phenomena, requires an increasingly accurate and general model of the real world. There is no sense in which there is a simpler way to accurately predict text that represents real world phenomena in cross validation, without actually understanding and modeling the underlying processes generating those outcomes represented in the text.
Much of the training text is real humans talking about things they don't understand deeply, and saying things that are wrong or misleading. The model will fundamentally simulate these type of situations it was trained to simulate reliably, which includes frequently (for lack of a better word) answering things "wrong" or "badly" "on purpose" - even when it actually contains an accurate heuristic model of the underlying process, it will still, faithfully according to the training data, often report something else instead.
This can largely be mitigated with more careful and specific prompting of what exactly you are asking it to simulate. If you don't specify, there will be a high frequency of accurately simulating uninformed idiots, as occur in much of the text on the internet.
Also: premature optimization is evil. I like the first iteration most. It’s not “beginner code”, it’s simple. Tell sonnet to optimize it IF benchmarks show it’s a pref problem. But a codebase full of code like this, even when unnecessary, would be a nightmare.
Half the time, the LLM will make massive assumptions about your code and problem (e.g., about data types, about the behaviors of imported functions, about unnecessary optimizations, necessary optimization, etc.). Instead, prime it to be upfront about those assumptions. More importantly, spend time correcting the plan and closing gaps before any code is written.
https://newsletter.victordibia.com/p/developers-stop-asking-...
- Don't start by asking LLMs to write code directly, instead analyze and provide context
- Provide complete context upfront and verify what the LLM needs
- Ask probing questions and challenge assumptions
- Watch for subtle mistakes (outdated APIs, mixed syntax)
- Checkpoint progress to avoid context pollution
- Understand every line to maintain knowledge parity
- Invest in upfront design
repeat with i = 0 to 9
put i * 10000 into ip
repeat with j = 0 to 9
put j * 1000 into jp
repeat with k = 0 to 9
put k * 100 into kp
repeat with l = 0 to 9
put l * 10 into lp
repeat with m = 0 to 9
put i + j + k + l + m into R[ip + jp + kp + lp + m]
end repeat
end repeat
end repeat
end repeat
end repeat
Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.
That doesn't read to me as "generate a list of 1 million random integers, then find the difference ..." but rather, "write a function that takes a list of integers as input".That said, my approach to "optimizing" this comes down to "generate the biggest valid number in the range (as many nines as will fit, followed by whatever digit remains, followed by all zeroes), generate the smallest valid number in the range (biggest number with its digits reversed), check that both exist in the list (which should happen With High Probability -- roughly 99.99% of the time), then return the right answer".
With that approach, the bottleneck in the LLM's interpretation is generating random numbers: the original random.randint approach takes almost 300ms, whereas just using a single np.random.randint() call takes about 6-7ms. If I extract the random number generation outside of the function, then my code runs in ~0.8ms.
Here are the results:
| Number of
| "write better code"
Score | followup prompts
---------------------------
27.6% | 0 (baseline)
19.6% | 1
11.1% | 2
It appears that blindly asking DeepSeek to "write better code" significantly harms its ability to solve the benchmark tasks. It turns working solutions into code that no longer passes the hidden test suite.Thanks, that really made it click for me.
In any case, this isn’t surprising when you consider an LLM as an incomprehensibly sophisticated pattern matcher. It has a massive variety of code in its training data and it’s going to pull from that. What kind of code is the most common in that training data? Surely it’s mediocre code, since that’s by far the most common in the world. This massive “produce output like my training data” system is naturally going to tend towards producing that even if it can do better. It’s not human, it has no “produce the best possible result” drive. Then when you ask for something better, that pushes the output space to something with better results.
> these LLMs won’t replace software engineers anytime soon, because it requires a strong engineering background to recognize what is actually a good idea, along with other constraints that are domain specific.
> One issue with my experiments is that I’m benchmarking code improvement using Python, which isn’t the coding language developers consider when hyperoptimizing performance.
Claude very quickly adds classes to python code which isn't always what is wanted as it bloats out the code making readability harder.
Some more observations: New Sonnet is not universally better than Old Sonnet. I have done thousands of experiments in agentic workflows using both, and New Sonnet fails regularly at the same tasks Old Sonnet passes. For example, when asking it to update a file, Old Sonnet understands that updating a file requires first reading the file, whereas New Sonnet often overwrites the file with 'hallucinated' content.
When executing commands, Old Sonnet knows that it should wait for the execution output before responding, while New Sonnet hallucinates the command outputs.
Also, regarding temperature: 0 is not always more deterministic than temperature 1. If you regularly deal with code that includes calls to new LLMs, you will notice that, even at temperature 0, it often will 'correct' the model name to something it is more familiar with. If the subject of your prompt is newer than the model's knowledge cutoff date, then a higher temperature might be more accurate than a lower temperature.
I'll speak to it like a DI would speak to a recruit a basic training.
And it works.
I was speaking to some of the Cursor dev team on Discord, and they confirmed that being aggressive with the AI can lead to better results.
I hadn't seen this before. Why is asking for planning better than asking it to think step by step?
https://neoexogenesis.com/posts/rust-windsurf-transformation...
In terms of optimizing code, I’m not sure if there is a silver bullet. I mean when I optimize Rust code with Windsurf & Claude, it takes multiple benchmark runs and at least a few regressions if you were to leave Claude on its own. However, if you have a good hunch and write it as an idea to explore, Claude usually nails it given the idea wasn’t too crazy. That said, more iterations usually lead to faster and better code although there is no substitute to guiding the LLM. At least not yet.
However on Arduino it's amazing, until the day it forgot to add a initializing method. I didn't notice and neither did she. We've talked about possible issues for at least a hour, I switched hardware, she reiterated every line of the code. When I found the error she said, "oh yes! That's right. (Proceeding with why that method is essential for it to work)" that was so disrespecting in a way that I am still somewhat disappointed and pissed.
One question: Claude seems very powerful for coding tasks, and now my attempts to use local LLMs seem misguided, at least when coding. Any disagreements from the hive mind on this? I really dislike sending my code into a for profit company if I can avoid it.
Second question: I really try to avoid VSCode (M$ concerns, etc.). I'm using Zed and really enjoying it. But the LLM coding experience is exactly as this post described, and I have been assuming that's because Zed isn't the best AI coding tool. The context switching makes it challenging to get into the flow, and that's been exactly my criticism of Zed this far. Does anyone have an antidote?
Third thought: this really feels like it could be an interesting way to collaborate across a code base with any range of developer experience. This post is like watching the evolution of a species in an hour rather than millions of years. Stunning.
- write a simple prompt that explains in detail the wanted outcome.
- look at the result, run it and ask it how it can improve.
- tell it what to improve
- tell it to make a benchmark and unit test
- run it each time and see what is wrong or can be improved.
Learning a Lisp-y language, I do often find myself asking it for suggestions on how to write less imperative code, which seem to come out better than if conjured from a request alone. But again, thats feeding it examples
1) Asking it to write one feature at a time with test coverage, instead of the whole app at once.
2) You have to actually review and understand its changes in detail and be ready to often reject or ask for modifications. (Every time I've sleepily accepted Codeium Windsurf's recommendations without much interference has resulted in bad news.)
3) If the context gets too long it will start to "lose the plot" and make some repeated errors; that's the time to tell it to sum up what has been achieved thus far and to copy-paste that into a new context
When I then notice that this is really does not make any sense, I check what else it could be and end up noticing that I've been improving the wrong file all along. What then surprises me the most is that I cleaned it up just by reading it through, thinking about the code, fixing bugs, all without executing it.
I guess LLMs can do that as well?
One time it provided me with a great example, but then a few days later I couldn't find that conversation again in the history. So I asked it about the same question (or so I thought) and it provided a very subpar answer. It took me at least 3 questions to get back to that first answer.
Now if it had never provided me with the first good one I'd have never known about the parts it skipped in the second conversation.
Of course that could happen just as easily by having used google and a specific reference to write your code, but the point I'm trying to make is that GPT isn't a single entity that's always going to provide the same output, it can be extremely variable from terrible to amazing at the end of the day.
Having used google for many years as a developer I'm much better at asking it questions than say people in the business world is, I've seen them struggling to question it and far too easily giving up. So I'm quite scared to see what's going to happen once they really start to use and rely on GPT, the results are going to be all over the place.
Reasoning is known weakness of these models, so jumping from requirements to a fully optimized implementation that groks the solution space is maybe too much to expect - iterative improvement is much easier.
When all you have is syntax, something like "better" is 100% in the eye of the beholder.
Or alternatively, it might just demonstrate the power of LLMs to summarize complex code.
I asked gpt-4-1106-preview to draw a bounding box around some text in an image and prodded in various ways to see what moved the box closer. Offering a tip did in fact help lol so that went into the company system prompt.
IIRC so did most things, including telling it that it was on a forum, and OP had posted an incorrect response, which gpt was itching to correct with its answer.
o1 is effectively trying to take a pass at automating that manual effort.
Well, that's a big assumption. Some people quality modular code is some other too much indirect code.
For any task, whether code or a legal document, immediately asking "What can be done to make it better?" and/or "Are there any problems with this?" typically leads to improvement.
There are some objective measures which can be pulled out of the code and automated (complexity measures, use of particular techniques / libs, etc.) These can be automated, and then LLMs can be trained to be decent at recognizing more subjective problems (e.g. naming, obviousness, etc.). There are a lot of good engineering practices which come down to doing the same thing as the usual thing which is in that space rather than doing something new. An engine that is good at detecting novelties seems intuitively like it would be helpful in recognizing good ideas (even given the problems of hallucinations so far seen). Extending the idea of the article to this aspect, the problem seems like it's one of prompting / training rather than a terminal blocker.
def find_difference(nums):
try: nums.index(3999), nums.index(99930)
except ValueError: raise Exception("the numbers are not random")
return 99930 - 3999
It's asymptotically correct and is better than O(n) :pIf you have to keep querying the LLM to refine your output you will spend many times more in compute vs if the model was trained to produce the best result the first time around
Surely, performance optimisations are not the only thing that makes code "better".
Readability and simplicity are good. Performance optimisations are good only when the performance is not good enough...
Upd: the chat transcript mentions this, but the article does not and inlcudes this version into the performance stats.
This is proof! It found it couldn’t meaningfully optimise and started banging out corporate buzzwords. AGI been achieved.
What are your strategies to prevent such destructions of LLM?
made me laugh out loud. Everything is better with prom.
oh my, Claude does corporate techbabble!
> You keep giving me code that calls nonexistant methods, and is deprecated, as shown in Android Studio. Please try again, using only valid code that is not deprecated.
Does not help. I use this example, since it seems good at all other sorts of programming problems I give it. It's miserable at Android for some reason, and asking it to do better doesn't work.
I’m sure there’s enough documented patterns of how to improve code in common languages that it’s not hard to get it to do that. Getting it to spot when it’s inappropriate would be harder.
I then iterated 4 times and was only able to get to 1.5X faster. Not great. [1]
How does o1 do? Running on my workstation, it's initial iteration is actually It starts out 20% faster. I do 3 more iterations of "write better code" with the timing data pasted and it thinks for an additional 89 seconds but only gets 60% faster. I then challenge it by telling it that Claude was over 100X faster so I know it can do better. It thinks for 1m55s (the thought traces shows it actually gets to a lot of interesting stuff) but the end results are enormously disappointing (barely any difference). It finally mentions and I am able to get a 4.6X improvement. After two more rounds I tell it to go GPU (using my RTX 3050 LP display adapter) and PyTorch and it is able to get down to 0.0035 (+/-), so we are finally 122X faster than where we started. [2]
I wanted to see for myself how Claude would fare. It actually managed pretty good results with a 36X over 4 iterations and no additional prompting. I challenged it to do better, giving it the same hardware specs that I gave o1 and it managed to do better with a 457x speedup from its starting point and being 2.35x faster than o1's result. Claude still doesn't have conversation output so I saved the JSON and had a new Claude chat transcribe it into an artifact [3]
Finally, I remembered that Google's new Gemini 2.0 models aren't bad. Gemini 2.0 Flash Thinking doesn't have code execution, but Gemini Experimental 1206 (Gemini 2.0 Pro preview) does. It's initial 4 iterations are terribly unimpressive, however I challenged it with o1 and Claude's results and gave it my hardware info. This seemed to spark it to double-time its implementations, and it gave a vectorized implementation that was a 30X improvement. I then asked it for a GPU-only solution and it managed to give the fastest solution ("This result of 0.00076818 seconds is also significantly faster than Claude's final GPU version, which ran in 0.001487 seconds. It is also about 4.5X faster than o1's target runtime of 0.0035s.") [4]
Just a quick summary of these all running on my system (EPYC 9274F and RTX 3050):
ChatGPT-4o: v1: 0.67s , v4: 0.56s
ChatGPT-o1: v1: 0.4295 , v4: 0.2679 , final: 0.0035s
Claude Sonnet 3.6: v1: 0.68s , v4a: 0.019s (v3 gave a wrong answer, v4 failed to compile, but fixed was pretty fast) , final: 0.001487 s
Gemini Experimental 1206: v1: 0.168s , v4: 0.179s , v5: 0.061s , final: 0.00076818s
All the final results were PyTorch GPU-only implementations.
[1] https://chatgpt.com/share/6778092c-40c8-8012-9611-940c1461c1...
[2] https://chatgpt.com/share/67780f24-4fd0-8012-b70e-24aac62e05...
[3] https://claude.site/artifacts/6f2ec899-ad58-4953-929a-c99cea...
[4] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
Well that got my attention.
It was tried as part of the same trend. I remember people asking it to make a TODO app and then tell it to make it better in an infinite loop. It became really crazy after like 20 iterations.
BTW - prompt optimization is a supported use-case of several frameworks, like dspy and textgrad, and is in general something that you should be doing yourself anyway on most tasks.
Related
LLMs are good for coding because your documentation is ...
Large Language Models (LLMs) are praised for aiding coding by interpreting complex documentation efficiently. Developers struggle with poor documentation, turning to LLMs like StackOverflow. Despite energy consumption, LLMs' precision prompts tech industry to enhance human-generated documentation.
LLMs struggle to explain themselves
Large language models can identify number patterns but struggle to provide coherent explanations. An interactive demo highlights this issue, revealing that even correct answers often come with nonsensical reasoning.
Notes on Using LLMs for Code
Simon Willison shares his experiences with large language models in software development, highlighting their roles in exploratory prototyping and production coding, which enhance productivity and decision-making in meetings.
Performance of LLMs on Advent of Code 2024
Large language models underperformed in the Advent of Code 2024 challenge, struggling with novel problems and timeout errors, indicating a need for better inference capabilities and human oversight. Future improvements are expected.