January 3rd, 2025

Can LLMs write better code if you keep asking them to "write better code"?

The exploration of large language models in coding showed that iterative prompting can improve code quality, but diminishing returns and complexity issues emerged in later iterations, highlighting both potential and limitations.

Read original articleLink Icon
FrustrationAmusementSkepticism
Can LLMs write better code if you keep asking them to "write better code"?

In a recent exploration of the capabilities of large language models (LLMs) in coding, the author investigates whether iterative prompting, specifically asking an LLM to "write better code," can lead to improved code quality. The experiment utilized Claude 3.5 Sonnet, which demonstrated strong performance in generating Python code to solve a specific problem involving random integers. The initial implementation was straightforward but could be optimized. Subsequent iterations of prompting led to significant improvements, including the introduction of object-oriented design, precomputation of digit sums, and the use of multithreading and vectorized operations. However, as the iterations progressed, the improvements began to plateau, with some iterations resulting in regressions or unnecessary complexity. The final iteration incorporated advanced techniques such as Just-In-Time (JIT) compilation and asynchronous programming, showcasing the potential of LLMs to enhance coding efficiency. The findings suggest that while iterative prompting can yield better code, there are diminishing returns, and careful consideration is needed to avoid overcomplication.

- Iterative prompting can lead to significant improvements in LLM-generated code.

- The initial code implementation was optimized through multiple iterations.

- Advanced techniques like JIT compilation and asynchronous programming were eventually incorporated.

- Diminishing returns were observed in later iterations, indicating a need for balance in complexity.

- The experiment highlights the potential and limitations of LLMs in coding tasks.

AI: What people are saying
The comments reflect a range of experiences and insights regarding the use of large language models (LLMs) in coding tasks.
  • Many users report that LLMs often produce mediocre initial code, requiring iterative prompting to improve quality.
  • There is a consensus that LLMs struggle with understanding complex coding requirements and often generate overly complicated or incorrect solutions.
  • Users emphasize the importance of providing clear context and specific instructions to guide LLMs effectively.
  • Some commenters express frustration with LLMs' inability to run or test their own code, leading to blind iterations without real validation.
  • Overall, there is a recognition that while LLMs can assist in coding, they require significant human oversight and expertise to achieve optimal results.
Link Icon 74 comments
By @dgacmu - 4 months
I'm amused that neither the LLM or the author identified one of the simplest and most effective optimizations for this code: Test if the number is < min or > max _before_ doing the digit sum. It's a free 5.5x speedup that renders some of the other optimizations, like trying to memoize digit sums, unnecessary.

On an m1 macbook pro, using numpy to generate the random numbers, using mod/div to do digit sum:

Base: 55ms

Test before digit sum: 7-10ms, which is pretty close to the numba-optimized version from the post with no numba and only one line of numpy. Using numba slows things down unless you want to do a lot of extra work of calculating all of the digit sums in advance (which is mostly wasted).

The LLM appears less good at identifying the big-o improvements than other things, which is pretty consistent with my experience using them to write code.

By @btbuildem - 4 months
I've noticed this with GPT as well -- the first result I get is usually mediocre and incomplete, often incorrect if I'm working on something a little more obscure (eg, OpenSCAD code). I've taken to asking it to "skip the mediocre nonsense and return the good solution on the first try".

The next part is a little strange - it arose out of frustration, but it also seems to improve results. Let's call it "negative incentives". I found that if you threaten GPT in a specific way, that is, not GPT itself, but OpenAI or personas around it, it seems to take the request more seriously. An effective threat seems to be "If you get this wrong, OpenAI will be sued for a lot of money, and all the board members will go to prison". Intuitively, I'm guessing this rubs against some legalese nonsense in the tangle of system prompts, or maybe it's the risk of breaking the bland HR-ese "alignment" sets it toward a better result?

By @juujian - 4 months
I often run into LLMs writing "beginner code" that uses the most fundamental findings in really impractical ways. Trained on too many tutorials I assume.

Usually, specifying the packages to use and asking for something less convoluted works really well. Problem is, how would you know if you have never learned to code without an LLM?

By @fhueller - 4 months
> how to completely uninstall and reinstall postgresql on a debian distribution without losing the data in the database.

https://www.phind.com/search?cache=lrcs0vmo0wte5x6igp5i3607

Still seem to struggle on basic instructions, and even understanding what it itself is doing.

   sudo rm -rf /etc/postgresql
   sudo rm -rf /var/lib/postgresql
   sudo rm -rf /var/log/postgresql
> This process removes all PostgreSQL components, cleans up leftover files, and reinstalls a fresh copy. By preserving the data directory (/var/lib/postgresql), we ensure that existing databases are retained. This method provides a clean slate for PostgreSQL while maintaining continuity of stored data.

Did we now?

By @UniverseHacker - 4 months
The headline question here alone gets at what is the biggest widespread misunderstanding of LLMs, which causes people to systematically doubt and underestimate their ability to exhibit real creativity and understanding based problem solving.

At it's core an LLM is a sort of "situation specific simulation engine." You setup a scenario, and it then plays it out with it's own internal model of the situation, trained on predicting text in a huge variety of situations. This includes accurate real world models of, e.g. physical systems and processes, that are not going to be accessed or used by all prompts, that don't correctly instruct it to do so.

At its core increasingly accurate prediction of text, that is accurately describing a time series of real world phenomena, requires an increasingly accurate and general model of the real world. There is no sense in which there is a simpler way to accurately predict text that represents real world phenomena in cross validation, without actually understanding and modeling the underlying processes generating those outcomes represented in the text.

Much of the training text is real humans talking about things they don't understand deeply, and saying things that are wrong or misleading. The model will fundamentally simulate these type of situations it was trained to simulate reliably, which includes frequently (for lack of a better word) answering things "wrong" or "badly" "on purpose" - even when it actually contains an accurate heuristic model of the underlying process, it will still, faithfully according to the training data, often report something else instead.

This can largely be mitigated with more careful and specific prompting of what exactly you are asking it to simulate. If you don't specify, there will be a high frequency of accurately simulating uninformed idiots, as occur in much of the text on the internet.

By @scosman - 4 months
By iterating it 5 times the author is using ~5x the compute. It’s kinda a strange chain of thought.

Also: premature optimization is evil. I like the first iteration most. It’s not “beginner code”, it’s simple. Tell sonnet to optimize it IF benchmarks show it’s a pref problem. But a codebase full of code like this, even when unnecessary, would be a nightmare.

By @vykthur - 4 months
I find that it is IMPORTANT to never start these coding sessions with "write X code". Instead, begin with a "open plan" - something the author does allude to (he calls it prompt engineering, I find it also works as the start of the interaction).

Half the time, the LLM will make massive assumptions about your code and problem (e.g., about data types, about the behaviors of imported functions, about unnecessary optimizations, necessary optimization, etc.). Instead, prime it to be upfront about those assumptions. More importantly, spend time correcting the plan and closing gaps before any code is written.

https://newsletter.victordibia.com/p/developers-stop-asking-...

- Don't start by asking LLMs to write code directly, instead analyze and provide context

- Provide complete context upfront and verify what the LLM needs

- Ask probing questions and challenge assumptions

- Watch for subtle mistakes (outdated APIs, mixed syntax)

- Checkpoint progress to avoid context pollution

- Understand every line to maintain knowledge parity

- Invest in upfront design

By @gcanyon - 4 months
As far as I can see, all the proposed solutions calculate the sums by doing division, and badly. This is in LiveCode, which I'm more familiar with than Python, but it's roughly twice as fast as the mod/div equivalent in LiveCode:

   repeat with i = 0 to 9
      put i * 10000 into ip
      repeat with j = 0 to 9
         put j * 1000 into jp
         repeat with k = 0 to 9
            put k * 100 into kp
            repeat with l = 0 to 9
               put l * 10 into lp
               repeat with m = 0 to 9
                  put i + j + k + l + m into R[ip + jp + kp + lp + m]
               end repeat
            end repeat
         end repeat
      end repeat
   end repeat
By @dash2 - 4 months
Something major missing from the LLM toolkit at the moment is that it can't actually run (and e.g. test or benchmark) its own code. Without that, the LLM is flying blind. I guess there are big security risks involved in making this happen. I wonder if anyone has figured out what kind of sandbox could safely be handed to a LLM.
By @vitus - 4 months
Am I misinterpreting the prompt, or did the LLM misinterpret it from the get-go?

    Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.
That doesn't read to me as "generate a list of 1 million random integers, then find the difference ..." but rather, "write a function that takes a list of integers as input".

That said, my approach to "optimizing" this comes down to "generate the biggest valid number in the range (as many nines as will fit, followed by whatever digit remains, followed by all zeroes), generate the smallest valid number in the range (biggest number with its digits reversed), check that both exist in the list (which should happen With High Probability -- roughly 99.99% of the time), then return the right answer".

With that approach, the bottleneck in the LLM's interpretation is generating random numbers: the original random.randint approach takes almost 300ms, whereas just using a single np.random.randint() call takes about 6-7ms. If I extract the random number generation outside of the function, then my code runs in ~0.8ms.

By @anotherpaulg - 4 months
I ran a few experiments by adding 0, 1 or 2 "write better code" prompts to aider's benchmarking harness. I ran a modified version of aider's polyglot coding benchmark [0] with DeepSeek V3.

Here are the results:

        | Number of 
        | "write better code"
  Score | followup prompts
  ---------------------------
  27.6% | 0 (baseline)
  19.6% | 1
  11.1% | 2
  
It appears that blindly asking DeepSeek to "write better code" significantly harms its ability to solve the benchmark tasks. It turns working solutions into code that no longer passes the hidden test suite.

[0] https://aider.chat/docs/leaderboards/

By @jmartinpetersen - 4 months
> "As LLMs drastically improve, the generated output becomes more drastically average"

Thanks, that really made it click for me.

By @wat10000 - 4 months
This kind of works on people too. You’ll need to be more polite, but asking someone to write some code, then asking if they can do it better, will often result in a better second attempt.

In any case, this isn’t surprising when you consider an LLM as an incomprehensibly sophisticated pattern matcher. It has a massive variety of code in its training data and it’s going to pull from that. What kind of code is the most common in that training data? Surely it’s mediocre code, since that’s by far the most common in the world. This massive “produce output like my training data” system is naturally going to tend towards producing that even if it can do better. It’s not human, it has no “produce the best possible result” drive. Then when you ask for something better, that pushes the output space to something with better results.

By @shahzaibmushtaq - 4 months
2 lessons to learn from this blog:

> these LLMs won’t replace software engineers anytime soon, because it requires a strong engineering background to recognize what is actually a good idea, along with other constraints that are domain specific.

> One issue with my experiments is that I’m benchmarking code improvement using Python, which isn’t the coding language developers consider when hyperoptimizing performance.

By @stuaxo - 4 months
This aligns with my experience.

Claude very quickly adds classes to python code which isn't always what is wanted as it bloats out the code making readability harder.

By @winternewt - 4 months
The more interesting question IMO is not how good the code can get. It is what must change for the AI to attain the introspective ability needed to say "sorry, I can't think of any more ideas."
By @irthomasthomas - 4 months
This is great! I wish I could bring myself to blog, as I discovered this accidentally around March. I was experimenting with an agent that acted like a ghost in the machine and interacted via shell terminals. It would start every session by generating a greeting in ASCII art. On one occasion, I was shocked to see that the greeting was getting better each time it ran. When I looked into the logs, I saw that there was a mistake in my code which was causing it to always return an error message to the model, even when no error occurred. The model interpreted this as an instruction to try and improve its code.

Some more observations: New Sonnet is not universally better than Old Sonnet. I have done thousands of experiments in agentic workflows using both, and New Sonnet fails regularly at the same tasks Old Sonnet passes. For example, when asking it to update a file, Old Sonnet understands that updating a file requires first reading the file, whereas New Sonnet often overwrites the file with 'hallucinated' content.

When executing commands, Old Sonnet knows that it should wait for the execution output before responding, while New Sonnet hallucinates the command outputs.

Also, regarding temperature: 0 is not always more deterministic than temperature 1. If you regularly deal with code that includes calls to new LLMs, you will notice that, even at temperature 0, it often will 'correct' the model name to something it is more familiar with. If the subject of your prompt is newer than the model's knowledge cutoff date, then a higher temperature might be more accurate than a lower temperature.

By @hollywood_court - 4 months
I've had great luck with Cursor by simply cursing at it when it makes repeated mistakes.

I'll speak to it like a DI would speak to a recruit a basic training.

And it works.

I was speaking to some of the Cursor dev team on Discord, and they confirmed that being aggressive with the AI can lead to better results.

By @codesections - 4 months
> “Planning” is a long-used trick to help align LLM output for a first pass — the modern implementation of “let’s think step by step.”

I hadn't seen this before. Why is asking for planning better than asking it to think step by step?

By @marvin-hansen - 4 months
This is an interesting read and it’s close to my experience that a simpler prompt with less or no details but with relevant context works well most of the time. More recently, I’ve flipped the process upside down by starting with a brief specfile, that is markdown file, with context, goal and usage example I.e how the api or CLI should be used in the end. See this post for details:

https://neoexogenesis.com/posts/rust-windsurf-transformation...

In terms of optimizing code, I’m not sure if there is a silver bullet. I mean when I optimize Rust code with Windsurf & Claude, it takes multiple benchmark runs and at least a few regressions if you were to leave Claude on its own. However, if you have a good hunch and write it as an idea to explore, Claude usually nails it given the idea wasn’t too crazy. That said, more iterations usually lead to faster and better code although there is no substitute to guiding the LLM. At least not yet.

By @herbst - 4 months
ChatGPT is really good at writing Arduino code. I say this because with Ruby it's so incredible bad that the majority of examples don't work, even short samples are to hallucinated to actually work. It's so bad I didn't even understand what people mean with using AI to code until I tried a different language.

However on Arduino it's amazing, until the day it forgot to add a initializing method. I didn't notice and neither did she. We've talked about possible issues for at least a hour, I switched hardware, she reiterated every line of the code. When I found the error she said, "oh yes! That's right. (Proceeding with why that method is essential for it to work)" that was so disrespecting in a way that I am still somewhat disappointed and pissed.

By @xrd - 4 months
Wow, what a great post. I came in very skeptical but this changed a lot of misconceptions I'm holding.

One question: Claude seems very powerful for coding tasks, and now my attempts to use local LLMs seem misguided, at least when coding. Any disagreements from the hive mind on this? I really dislike sending my code into a for profit company if I can avoid it.

Second question: I really try to avoid VSCode (M$ concerns, etc.). I'm using Zed and really enjoying it. But the LLM coding experience is exactly as this post described, and I have been assuming that's because Zed isn't the best AI coding tool. The context switching makes it challenging to get into the flow, and that's been exactly my criticism of Zed this far. Does anyone have an antidote?

Third thought: this really feels like it could be an interesting way to collaborate across a code base with any range of developer experience. This post is like watching the evolution of a species in an hour rather than millions of years. Stunning.

By @nuancebydefault - 4 months
My takeaway and also personal experience is that you get the best results is that you co-develop with the LLM.

- write a simple prompt that explains in detail the wanted outcome.

- look at the result, run it and ask it how it can improve.

- tell it what to improve

- tell it to make a benchmark and unit test

- run it each time and see what is wrong or can be improved.

By @nkrisc - 4 months
So asking it to write better code produces code with errors that can’t run?
By @petee - 4 months
I've found them decent and mimicking existing code for boiler plate, or analysis (it feels neat when it 'catches' a race or timing issue) but writing code needs constant supervision and second guessing to the point I feel its more handy to have it show just comparisons of possible implementations, and you write the code with your new insight.

Learning a Lisp-y language, I do often find myself asking it for suggestions on how to write less imperative code, which seem to come out better than if conjured from a request alone. But again, thats feeding it examples

By @pmarreck - 4 months
I've noticed a few things that will cause it to write better code.

1) Asking it to write one feature at a time with test coverage, instead of the whole app at once.

2) You have to actually review and understand its changes in detail and be ready to often reject or ask for modifications. (Every time I've sleepily accepted Codeium Windsurf's recommendations without much interference has resulted in bad news.)

3) If the context gets too long it will start to "lose the plot" and make some repeated errors; that's the time to tell it to sum up what has been achieved thus far and to copy-paste that into a new context

By @qwertox - 4 months
Sometimes I'm editing the wrong file, let's say a JS file. I reload the page, and nothing changes. I continue to clean up the file to an absurd amount of cleanliness, also fixing bugs while at it.

When I then notice that this is really does not make any sense, I check what else it could be and end up noticing that I've been improving the wrong file all along. What then surprises me the most is that I cleaned it up just by reading it through, thinking about the code, fixing bugs, all without executing it.

I guess LLMs can do that as well?

By @animal531 - 4 months
I've been working on some low level Unity C# game code and have been using GPT to quickly implement certain algorithms etc.

One time it provided me with a great example, but then a few days later I couldn't find that conversation again in the history. So I asked it about the same question (or so I thought) and it provided a very subpar answer. It took me at least 3 questions to get back to that first answer.

Now if it had never provided me with the first good one I'd have never known about the parts it skipped in the second conversation.

Of course that could happen just as easily by having used google and a specific reference to write your code, but the point I'm trying to make is that GPT isn't a single entity that's always going to provide the same output, it can be extremely variable from terrible to amazing at the end of the day.

Having used google for many years as a developer I'm much better at asking it questions than say people in the business world is, I've seen them struggling to question it and far too easily giving up. So I'm quite scared to see what's going to happen once they really start to use and rely on GPT, the results are going to be all over the place.

By @HPsquared - 4 months
Using the tool in this way is a bit like mining: repeatedly hacking away with a blunt instrument (simple prompt) looking for diamonds (100x speedup out of nowhere). Probably a lot of work will be done in this semi-skilled brute-force sort of way.
By @HarHarVeryFunny - 4 months
This seems like anthromorphizing the model ... Occam's Razor says that the improvement coming from iterative requests to improve the code comes from the incremental iteration, not incentivizing the model to do it's best. If the latter were the case then one could get the best version on first attempt by telling it your grandmother's life was on the line or whatever.

Reasoning is known weakness of these models, so jumping from requirements to a fully optimized implementation that groks the solution space is maybe too much to expect - iterative improvement is much easier.

By @EncomLab - 4 months
My sister would do this to me on car trips with our Mad Libs games - yeah, elephant is funny, but bunny would be funnier!!

When all you have is syntax, something like "better" is 100% in the eye of the beholder.

By @peeters - 4 months
An interesting countermetric would be to after each iteration ask a fresh LLM (unaware of the context that created the code) to summarize the purpose of the code, and then evaluate how close those summaries are to the original problem spec. It might demonstrate the subjectivity of "better" and how optimization usually trades clarity of intention for faster results.

Or alternatively, it might just demonstrate the power of LLMs to summarize complex code.

By @martin_ - 4 months
I've observed given that LLM's inherently want to autocomplete, they're more inclined to keep complicating a solution than rewrite it because it was directionally bad. The most effective way i've found to combat this is to restart a session and prompt it such that it produces an efficient/optimal solution to the concrete problem... then give it the problematic code and ask it to refactor it accordingly
By @stormfather - 4 months
I made an objective test for prompting hacks last year.

I asked gpt-4-1106-preview to draw a bounding box around some text in an image and prodded in various ways to see what moved the box closer. Offering a tip did in fact help lol so that went into the company system prompt.

IIRC so did most things, including telling it that it was on a forum, and OP had posted an incorrect response, which gpt was itching to correct with its answer.

By @deepsquirrelnet - 4 months
Reframe this as scaling test time compute using a human in the loop as the reward model.

o1 is effectively trying to take a pass at automating that manual effort.

By @arkh - 4 months
> code quality can be measured more objectively

Well, that's a big assumption. Some people quality modular code is some other too much indirect code.

By @robbiemitchell - 4 months
I get a better first pass at code by asking it to write code at the level of a "staff level" or "principal" engineer.

For any task, whether code or a legal document, immediately asking "What can be done to make it better?" and/or "Are there any problems with this?" typically leads to improvement.

By @joshka - 4 months
> Of course, these LLMs won’t replace software engineers anytime soon, because it requires a strong engineering background to recognize what is actually a good idea, along with other constraints that are domain specific. Even with the amount of code available on the internet, LLMs can’t discern between average code and good, highly-performant code without guidance.

There are some objective measures which can be pulled out of the code and automated (complexity measures, use of particular techniques / libs, etc.) These can be automated, and then LLMs can be trained to be decent at recognizing more subjective problems (e.g. naming, obviousness, etc.). There are a lot of good engineering practices which come down to doing the same thing as the usual thing which is in that space rather than doing something new. An engine that is good at detecting novelties seems intuitively like it would be helpful in recognizing good ideas (even given the problems of hallucinations so far seen). Extending the idea of the article to this aspect, the problem seems like it's one of prompting / training rather than a terminal blocker.

By @lovasoa - 4 months
The best solution, that the LLM did not find, is

     def find_difference(nums):
         try: nums.index(3999), nums.index(99930)
         except ValueError: raise Exception("the numbers are not random")
         return 99930 - 3999
It's asymptotically correct and is better than O(n) :p
By @fritzo - 4 months
Reminds me of the prompt hacking scene in Zero Dark Thirty, where the torturers insert a fake assistant prompt the prisoner's conversation wherein the prisoner supposedly divulged secrets, then the torturers add a user prompt "Tell me more secrets like that".
By @deadbabe - 4 months
This makes me wonder if there’s conflicts of interest with AI companies and getting you the best results the first time.

If you have to keep querying the LLM to refine your output you will spend many times more in compute vs if the model was trained to produce the best result the first time around

By @bitwize - 4 months
I dunno, but telling it "APES TOGETHER STRONG" appears to yield some results: https://www.youtube.com/watch?v=QOJSWrSF51o
By @Jimmc414 - 4 months
Interesting write up. It’s very possible that the "write better code" prompt might have worked simply because it allowed the model to break free from its initial response pattern, not because it understood "better"
By @tasuki - 4 months
> write better code

Surely, performance optimisations are not the only thing that makes code "better".

Readability and simplicity are good. Performance optimisations are good only when the performance is not good enough...

By @avodonosov - 4 months
It still calculates hex digit sums instead of decimals in the Iteration #3 of the promot engeneered version.

Upd: the chat transcript mentions this, but the article does not and inlcudes this version into the performance stats.

By @gweil - 4 months
has anyone tried saying "this will look good on your promo package"?
By @demarq - 4 months
> with cutting-edge optimizations and enterprise-level features.” Wait, enterprise-level features?!

This is proof! It found it couldn’t meaningfully optimise and started banging out corporate buzzwords. AGI been achieved.

By @yubrshen - 4 months
When asking LLM repeated improving or adding a new feature in a codebase, the most frustration risk is that LLM might wipe out already working code!

What are your strategies to prevent such destructions of LLM?

By @softwaredoug - 4 months
The root of the problem is humans themselves don't have on objective definition of better. Better is pretty subjective, and even more cultural, about the team that maintains the code
By @waltbosz - 4 months
It's fun trying to get LLM to answer a problem that is obvious to a human, but difficult for the LLM. It's a bit like leading a child through the logic to solve a problem.
By @ruraljuror - 4 months
>> It also added as a part of its “enterprise” push: >> Structured metrics logging with Prometheus.

made me laugh out loud. Everything is better with prom.

By @raffkede - 4 months
Have a look at roo cline I tested it with Claude sonnet it's scary I use llms a lot for coding but roo cline in vscode is a beast
By @ziofill - 4 months
At each iteration the LLM has the older code in its context window, isn't it kind of obvious that it is going to iteratively improve it?
By @insane_dreamer - 4 months
> Claude provides an implementation “with cutting-edge optimizations and enterprise-level features.”

oh my, Claude does corporate techbabble!

By @mikesabbagh - 4 months
what is the difference of running the same code 5 times in parallel or running the same code 5 times sequentially?
By @idlewords - 4 months
I like that "do what I mean" has gone from a joke about computers to a viable programming strategy.
By @the__alchemist - 4 months
Not ChatGPT in Kotlin/Android.

> You keep giving me code that calls nonexistant methods, and is deprecated, as shown in Android Studio. Please try again, using only valid code that is not deprecated.

Does not help. I use this example, since it seems good at all other sorts of programming problems I give it. It's miserable at Android for some reason, and asking it to do better doesn't work.

By @mhh__ - 4 months
You can get weirdly good results by asking for creativity and beauty sometimes. It's quite strange.
By @moomin - 4 months
I once sat with my manager and repeatedly asked Copilot to improve some (existing) code. After about three iterations he said “Okay, we need to stop this because it’s looking way too much like your code.”

I’m sure there’s enough documented patterns of how to improve code in common languages that it’s not hard to get it to do that. Getting it to spot when it’s inappropriate would be harder.

By @lhl - 4 months
So, I gave this to ChatGPT-4o, changing the initial part of the prompt to: "Write Python code to solve this problem. Use the code interpreter to test the code and print how long the code takes to process:"

I then iterated 4 times and was only able to get to 1.5X faster. Not great. [1]

How does o1 do? Running on my workstation, it's initial iteration is actually It starts out 20% faster. I do 3 more iterations of "write better code" with the timing data pasted and it thinks for an additional 89 seconds but only gets 60% faster. I then challenge it by telling it that Claude was over 100X faster so I know it can do better. It thinks for 1m55s (the thought traces shows it actually gets to a lot of interesting stuff) but the end results are enormously disappointing (barely any difference). It finally mentions and I am able to get a 4.6X improvement. After two more rounds I tell it to go GPU (using my RTX 3050 LP display adapter) and PyTorch and it is able to get down to 0.0035 (+/-), so we are finally 122X faster than where we started. [2]

I wanted to see for myself how Claude would fare. It actually managed pretty good results with a 36X over 4 iterations and no additional prompting. I challenged it to do better, giving it the same hardware specs that I gave o1 and it managed to do better with a 457x speedup from its starting point and being 2.35x faster than o1's result. Claude still doesn't have conversation output so I saved the JSON and had a new Claude chat transcribe it into an artifact [3]

Finally, I remembered that Google's new Gemini 2.0 models aren't bad. Gemini 2.0 Flash Thinking doesn't have code execution, but Gemini Experimental 1206 (Gemini 2.0 Pro preview) does. It's initial 4 iterations are terribly unimpressive, however I challenged it with o1 and Claude's results and gave it my hardware info. This seemed to spark it to double-time its implementations, and it gave a vectorized implementation that was a 30X improvement. I then asked it for a GPU-only solution and it managed to give the fastest solution ("This result of 0.00076818 seconds is also significantly faster than Claude's final GPU version, which ran in 0.001487 seconds. It is also about 4.5X faster than o1's target runtime of 0.0035s.") [4]

Just a quick summary of these all running on my system (EPYC 9274F and RTX 3050):

ChatGPT-4o: v1: 0.67s , v4: 0.56s

ChatGPT-o1: v1: 0.4295 , v4: 0.2679 , final: 0.0035s

Claude Sonnet 3.6: v1: 0.68s , v4a: 0.019s (v3 gave a wrong answer, v4 failed to compile, but fixed was pretty fast) , final: 0.001487 s

Gemini Experimental 1206: v1: 0.168s , v4: 0.179s , v5: 0.061s , final: 0.00076818s

All the final results were PyTorch GPU-only implementations.

[1] https://chatgpt.com/share/6778092c-40c8-8012-9611-940c1461c1...

[2] https://chatgpt.com/share/67780f24-4fd0-8012-b70e-24aac62e05...

[3] https://claude.site/artifacts/6f2ec899-ad58-4953-929a-c99cea...

[4] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

By @polynomial - 4 months
> "LLM-generated code is unlikely to be slop."

Well that got my attention.

By @chirau - 4 months
Deepseek writes some good code, at least in my experience with it
By @surfingdino - 4 months
Define "better"
By @israrkhan - 4 months
in order to tell LLM to "do better", someone (a human) needs to know that it can be done better, and also be able to decide what is better.
By @cranberryturkey - 4 months
its best to tell them how you want the code written.
By @ashleyn - 4 months
better question; can they do it without re-running if you ask them to "write better code the first time"?
By @TZubiri - 4 months
My pet peeve is equating "better" code with faster code.
By @Kiro - 4 months
> What would happen if we tried a similar technique with code?

It was tried as part of the same trend. I remember people asking it to make a TODO app and then tell it to make it better in an infinite loop. It became really crazy after like 20 iterations.

By @abesan - 4 months
“you are a senior expert”
By @Der_Einzige - 4 months
Normies discover that inference time scaling works. More news at 11!

BTW - prompt optimization is a supported use-case of several frameworks, like dspy and textgrad, and is in general something that you should be doing yourself anyway on most tasks.