OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems
OpenAI's research indicates that advanced AI models struggle with coding tasks, failing to identify deeper bugs and producing mostly incorrect solutions, highlighting their unreliability compared to human coders.
Read original articleOpenAI researchers have revealed that even the most advanced AI models struggle significantly with coding tasks, failing to solve the majority of them. In a recent study utilizing a benchmark called SWE-Lancer, which includes over 1,400 software engineering tasks from Upwork, three large language models (LLMs) were tested: OpenAI's o1 reasoning model, GPT-4o, and Anthropic's Claude 3.5 Sonnet. The evaluation focused on two task types: individual tasks involving bug resolution and management tasks requiring higher-level decision-making. Despite their speed, the models were only able to address surface-level issues and could not identify bugs in larger projects or understand their context, leading to incorrect or incomplete solutions. Claude 3.5 Sonnet outperformed the OpenAI models but still produced mostly wrong answers. The findings indicate that while LLMs have made significant advancements, they are not yet reliable enough to replace human coders in real-world scenarios. This raises concerns as some companies consider replacing human engineers with these immature AI systems.
- OpenAI's research shows advanced AI models struggle with coding tasks.
- The SWE-Lancer benchmark tested LLMs on over 1,400 software engineering tasks.
- AI models can resolve surface-level issues but fail to identify deeper bugs.
- Claude 3.5 Sonnet performed better than OpenAI's models but still had a high error rate.
- Current AI capabilities are insufficient to replace human coders in practical applications.
Related
Reasoning skills of large language models are often overestimated
Large language models like GPT-4 rely heavily on memorization over reasoning, excelling in common tasks but struggling in novel scenarios. MIT CSAIL research emphasizes the need to enhance adaptability and decision-making processes.
LLMs still can't reason like humans
Recent discussions reveal that large language models (LLMs) struggle with basic reasoning tasks, scoring significantly lower than humans. A project called "Simple Bench" aims to quantify these shortcomings in LLM performance.
LLMs don't do formal reasoning
A study by Apple researchers reveals that large language models struggle with formal reasoning, relying on pattern matching. They suggest neurosymbolic AI may enhance reasoning capabilities, as current models are limited.
Apple study proves LLM-based AI models are flawed because they cannot reason
Apple's study reveals significant reasoning shortcomings in large language models from Meta and OpenAI, introducing the GSM-Symbolic benchmark and highlighting issues with accuracy due to minor query changes and irrelevant context.
Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities
Apple's study reveals significant flaws in large language models' logical reasoning, showing they rely on pattern matching. Minor input changes lead to inconsistent answers, suggesting a need for neurosymbolic AI integration.
The first couple back and forths went ok but it quickly gave me some SQL that was invalid. I sent back the exact error and line number and it responded by changing all of the aliases but repeated the same logical error. I tried again and this time it rewrote more of the code, but still used the exact same invalid operation.
At that point I just went ahead and read some docs and other resources and solved things the traditional way.
Given all of the hype around LLMs I'm honestly surprised to see top models still failing in such basic and straightforward ways. I keep trying to use LLMs in my regular work so that I'm not missing out on something potentially great but I still haven't hit a point where they're all that useful.
Maybe this is because of explicitness in prompt and preempting edge cases. Maybe it's because I know exactly what should be done. In these cases, I will still sometimes be surprised by a more complete answer then I was envisioning, a few edge cases that weren't front of mind.
But if I have _no_ idea things go wildly off course. I was doing some tricky frontend work with dynamically placed reactflow nodes and bezier curve edges. It took me easily 6 hours of bashing my head against the problem, and it was hard to stop using the assistant because of sunk cost. But I probably would have gotten more out of it and been faster if I'd just sat down and really broken down the problem for a few hours and then moved to implement.
The most tempting part of LLMs is letting them figure out design when you're in a time crunch. And the way it solves things when you understand the domain and the bottoms-up view of the work is deceptive in terms of capability.
And in this case, it's hoping that people on upwork understand their problems deeply. If they did, they probably wouldn't be posting on upwork. That's what they're trying to pay for.
"low/high level" starts to lose its meaning to me because it gets used in opposite ways
The real issue is that people are not providing proper context to the models. Take any random coding library you’re interfacing with, like a Postgres database connection client. The LLM isn’t going to inherently know all of the different configurations and nuances of that client. However, if you pass in the source code for the client along with the relevant portions of your own codebase, you’re equipping the model with the exact information it needs.
Every time you do this, including a large prompt size—maybe 50,000 to 100,000 tokens—you dramatically improve the model’s ability to generate an accurate and useful response. With a strong model like O1Pro, the results can be exceptional. The key isn’t that these models are incapable; it’s that users aren’t feeding them the right data.
The purpose of new benchmarks is to gather tasks that today's LLMs can't solve comprehensively.
It an AI lab built a benchmark that their models scored 100% on they would have been wasting everyone's time!
Writing a story that effectively says "ha ha ha, look at OpenAI's models failing to beat the new benchemark they created!" is a complete misunderstanding of the research.
I’ve been trying so many things to automate solving bugs and adding features 100% by AI and I have to admit it’s been a failure. Without someone that can read the code and fully understand the AI generated code and suggests improvements (SWE in the loop) AI code is mostly not good.
So AI not going to answer your question right on its first attempt in many cases. It is forced to make a lot of assumptions based on the limited info you gave it, some of those may not match your individual case. Learn to prompt better and it will work better for you. It is a skill, just like everything else in life.
Imagine going into a job today and saying "i tried google but it didnt give me what I was looking for as the first result, so I dont use google anymore". I just wouldnt hire a dev that couldnt learn to use AI as a tool to get there job done 10x faster. If that is your attitude, 2026 might really be a wake-up call for your new life.
This is the “self-driving cars next year, definitely” of the 20s, at this point.
Not these Hackerrank, Leetcode or previous IOI and IMO problems which we already have the solutions to them and reproducing the most optimal solution copied from someone else.
If it can't manage most unseen coding problems with no previous solutions to them, what hope does it have against explaining and fixing bugs correctly on very complex repositories with over 1M-10M+ lines of code?
How many software developers could solve most even simple programming problems (except 'Hello world') with zero shot style (you write in notepad then can compile only once and execute once) without access to internet (stackoverflow, google search, documentation), tools (terminal, debugger, linter, cli)?
I think then it's not the best comparison to make any judgement. Future benchmark should test agents where they allowed to solve the problem in 5-10 minutes, allow give access to internet, documentation, linter, terminal with MCP servers.
TL;DR:
They tested with programming tasks and manager's tasks.
The vast majority of tasks given require bugfixes.
Claude 3.5 Sonnet (the best performing LLM) passed 21.1% of programmer tasks and 47.0% of manager tasks.
The LLMs have a higher probability of passing the tests when they are given more attempts, but there's not a lot of data showing where the improvement tails off. (probably due to how expensive it is to run the tests)
Personally, I have other concerns:
- A human being asked to review repeated LLM attempts to resolve a problem is going to lead that human to review things less thoroughly after a few attempts and over time is going to let false positives slip through
- An LLM being asked to review repeated LLM attempts to resolve a problem is going to lead to the LLM convincing itself that it is correct with no regard for the reality of the situation.
- LLM use increases code churn in a code base
- Increased code churn is known to be bad the health of projects
Spreadsheets becoming mainstream made it easy to do computing that once took a lot of manual human labor quite quickly. And it made plenty of jobs and people who do them obsolete. But they didn’t upend society fundamentally or the need for intelligence and they didn’t get rolled out overnight.
This is why even most software projects (built by humans) go through multiple iterations before they work perfectly.
We should consider a few things before asking, "Can AI code like humans?":
- How did AI learn to code? What structured curriculum was used?
- Did AI receive mentoring from an experienced senior who has solved real-life issues that the AI hasn't encountered yet?
- Did the AI learn through hands-on coding or just by reading Stack Overflow?
If we want to model AI as being on par with (or even superior to) human intelligence, don’t we at least need to consider how humans learn these complex skills?
Right now, it's akin to giving a human thousands of coding books to "read" and "understand," but offering no opportunity to test their programs on a computer. That’s essentially what's happening!
Without doing that, I don't think we'll ever be able to determine whether the limitation of current AI is due to its "low intelligence" or because it hasn’t been given a proper opportunity to learn.
Current LLMs will change the world, but it won't be by completing pull requests quickly.
Although a "stargate level" LLM could accelerate pink plane traversal so much that you don't even need to find the correct usecase. LLM scaling will be the computer graphics scaling of this generation. In terms of intelligence gpt4 based o3 is but a postage stamp. As LLMs scale a picture of intelligence will emerge.
Apparently it has to do with overflow anchor or something in React? Idk. I gave up.
What's not mentioned here (I think) is that tasks in this benchmark are priced and they sum up to million dollars.
And current AIs were able to earn nearly half of that.
So while technically they can't solve most problems (yet) they are already perfectly capable of taking about 40% of the food off your plate.
They are not creative at all, but 99% of my job is not creative either.
Yet Claude 3.5 sonnet "earned" $403.325,00 according to the paper referenced. That is $403k worth of labour potentially replaced.
So… not the same basic tools that a human has when coding?
Transformers with memory would be different story.
But, no memory, no capability to reason. End of story, right?
So it's not ALL bad news.
I also include similar code interview platforms like leetcode, hackerrank and so on.
Ironically it actually refutes Altman’s claims mentioned in the same article . Hard to replace engineers when you create a benchmark you can’t score decently on.
still, chain of thought is great for LeetCode 75
Since interviewers “want to see how you think” (and get the right answer in less time than other candidates on average)
I can now see how you’re supposed to think (and get the right answer in less time than other candidates on average, for now)
I think researchers will find that human coders are unable to solve most coding problems without access to the internet.
Related
Reasoning skills of large language models are often overestimated
Large language models like GPT-4 rely heavily on memorization over reasoning, excelling in common tasks but struggling in novel scenarios. MIT CSAIL research emphasizes the need to enhance adaptability and decision-making processes.
LLMs still can't reason like humans
Recent discussions reveal that large language models (LLMs) struggle with basic reasoning tasks, scoring significantly lower than humans. A project called "Simple Bench" aims to quantify these shortcomings in LLM performance.
LLMs don't do formal reasoning
A study by Apple researchers reveals that large language models struggle with formal reasoning, relying on pattern matching. They suggest neurosymbolic AI may enhance reasoning capabilities, as current models are limited.
Apple study proves LLM-based AI models are flawed because they cannot reason
Apple's study reveals significant reasoning shortcomings in large language models from Meta and OpenAI, introducing the GSM-Symbolic benchmark and highlighting issues with accuracy due to minor query changes and irrelevant context.
Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities
Apple's study reveals significant flaws in large language models' logical reasoning, showing they rely on pattern matching. Minor input changes lead to inconsistent answers, suggesting a need for neurosymbolic AI integration.