February 24th, 2025

OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems

OpenAI's research indicates that advanced AI models struggle with coding tasks, failing to identify deeper bugs and producing mostly incorrect solutions, highlighting their unreliability compared to human coders.

Read original article

OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems

OpenAI researchers have revealed that even the most advanced AI models struggle significantly with coding tasks, failing to solve the majority of them. In a recent study utilizing a benchmark called SWE-Lancer, which includes over 1,400 software engineering tasks from Upwork, three large language models (LLMs) were tested: OpenAI's o1 reasoning model, GPT-4o, and Anthropic's Claude 3.5 Sonnet. The evaluation focused on two task types: individual tasks involving bug resolution and management tasks requiring higher-level decision-making. Despite their speed, the models were only able to address surface-level issues and could not identify bugs in larger projects or understand their context, leading to incorrect or incomplete solutions. Claude 3.5 Sonnet outperformed the OpenAI models but still produced mostly wrong answers. The findings indicate that while LLMs have made significant advancements, they are not yet reliable enough to replace human coders in real-world scenarios. This raises concerns as some companies consider replacing human engineers with these immature AI systems.

- OpenAI's research shows advanced AI models struggle with coding tasks.

- The SWE-Lancer benchmark tested LLMs on over 1,400 software engineering tasks.

- AI models can resolve surface-level issues but fail to identify deeper bugs.

- Claude 3.5 Sonnet performed better than OpenAI's models but still had a high error rate.

- Current AI capabilities are insufficient to replace human coders in practical applications.

Reasoning skills of large language models are often overestimated

Large language models like GPT-4 rely heavily on memorization over reasoning, excelling in common tasks but struggling in novel scenarios. MIT CSAIL research emphasizes the need to enhance adaptability and decision-making processes.

LLMs still can't reason like humans

Recent discussions reveal that large language models (LLMs) struggle with basic reasoning tasks, scoring significantly lower than humans. A project called "Simple Bench" aims to quantify these shortcomings in LLM performance.

LLMs don't do formal reasoning

A study by Apple researchers reveals that large language models struggle with formal reasoning, relying on pattern matching. They suggest neurosymbolic AI may enhance reasoning capabilities, as current models are limited.

Apple study proves LLM-based AI models are flawed because they cannot reason

Apple's study reveals significant reasoning shortcomings in large language models from Meta and OpenAI, introducing the GSM-Symbolic benchmark and highlighting issues with accuracy due to minor query changes and irrelevant context.

Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities

Apple's study reveals significant flaws in large language models' logical reasoning, showing they rely on pattern matching. Minor input changes lead to inconsistent answers, suggesting a need for neurosymbolic AI integration.

44 comments

By @rurp - about 2 months

I recently had to do a one-off task using SQL in a way that I wasn't too familiar with. Since I could explain conceptually what I needed but didn't know all the right syntax this seemed like a perfect use case to loop in Claude.

The first couple back and forths went ok but it quickly gave me some SQL that was invalid. I sent back the exact error and line number and it responded by changing all of the aliases but repeated the same logical error. I tried again and this time it rewrote more of the code, but still used the exact same invalid operation.

At that point I just went ahead and read some docs and other resources and solved things the traditional way.

Given all of the hype around LLMs I'm honestly surprised to see top models still failing in such basic and straightforward ways. I keep trying to use LLMs in my regular work so that I'm not missing out on something potentially great but I still haven't hit a point where they're all that useful.

By @jasonthorsness - about 2 months

Half of the work is specification and iteration. I think there’s a focus on full SWE replacement because it’s sensational, but we’ll more end up with SWE able to focus on the less patterned or ambiguous work and made way more productive with the LLM handling subtasks more efficiently. I don’t see how full SWE replacement can happen unless non-SWE people using LLMs become technical enough to get what they need out of them, in which case they probably have just become SWE anyway.

By @jr-ai-interview - about 2 months

This has been obvious for a couple years to anyone in the industry that has been faced with an onslaught of PRs to review from AI enabled coders who sometimes can't even explain the changes being made at all. Great job calling it AI.

By @pton_xd - about 2 months

Well, OpenAI does currently have 288 job openings, including plenty of software engineers, so that says something.

By @nostrebored - about 2 months

This mirrors what I've seen. I've found that LLMs are most helpful in places where I have the most experience.

Maybe this is because of explicitness in prompt and preempting edge cases. Maybe it's because I know exactly what should be done. In these cases, I will still sometimes be surprised by a more complete answer then I was envisioning, a few edge cases that weren't front of mind.

But if I have _no_ idea things go wildly off course. I was doing some tricky frontend work with dynamically placed reactflow nodes and bezier curve edges. It took me easily 6 hours of bashing my head against the problem, and it was hard to stop using the assistant because of sunk cost. But I probably would have gotten more out of it and been faster if I'd just sat down and really broken down the problem for a few hours and then moved to implement.

The most tempting part of LLMs is letting them figure out design when you're in a time crunch. And the way it solves things when you understand the domain and the bottoms-up view of the work is deceptive in terms of capability.

And in this case, it's hoping that people on upwork understand their problems deeply. If they did, they probably wouldn't be posting on upwork. That's what they're trying to pay for.

By @_def - about 2 months

> even though CEO Sam Altman insists they will be able to beat "low-level" software engineers by the end of this year.

"low/high level" starts to lose its meaning to me because it gets used in opposite ways

By @floppiplopp - about 2 months

LLMs are still just text generators. These are statistical models that cannot think or solve logical problems. They might fool people, as Weizenbaum's "Eliza" did in the late 60s, by generating code that sort of runs sometimes, but identifying and solving a logic problem is something I reliably see these things fail at.

By @ldjkfkdsjnv - about 2 months

I’ve got 15 years of coding experience at some of the biggest tech companies. My personal opinion is that most people have no clue how good these AI coding systems already are. If you use something like RepoPrompt, where you selectively choose which files to include in the prompt, and then also provide a clear description of what changes you want to make—along with a significant portion of the source code—a model like O1Pro will nail the solution the first time.

The real issue is that people are not providing proper context to the models. Take any random coding library you’re interfacing with, like a Postgres database connection client. The LLM isn’t going to inherently know all of the different configurations and nuances of that client. However, if you pass in the source code for the client along with the relevant portions of your own codebase, you’re equipping the model with the exact information it needs.

Every time you do this, including a large prompt size—maybe 50,000 to 100,000 tokens—you dramatically improve the model’s ability to generate an accurate and useful response. With a strong model like O1Pro, the results can be exceptional. The key isn’t that these models are incapable; it’s that users aren’t feeding them the right data.

By @simonw - about 2 months

I find the framing of this story quite frustrating.

The purpose of new benchmarks is to gather tasks that today's LLMs can't solve comprehensively.

It an AI lab built a benchmark that their models scored 100% on they would have been wasting everyone's time!

Writing a story that effectively says "ha ha ha, look at OpenAI's models failing to beat the new benchemark they created!" is a complete misunderstanding of the research.

By @mohsen1 - about 2 months

I wonder how many of the solutions that passes SWE-lancer evals would not be accepted by the poster due to low quality

I’ve been trying so many things to automate solving bugs and adding features 100% by AI and I have to admit it’s been a failure. Without someone that can read the code and fully understand the AI generated code and suggests improvements (SWE in the loop) AI code is mostly not good.

By @WiSaGaN - about 2 months

So this is an in-house benchmarks after their undisclosed partnership with a previous benchmark company. Really hope they do not have their next model to vastly outperform on this benchmark in the coming weeks.

By @contractorwolf - about 2 months

To all those devs saying "i tried it and it wasnt perfect the first time, so I gave up", I am reminded of something my father used to say: "A bad carpenter blames his tools"

So AI not going to answer your question right on its first attempt in many cases. It is forced to make a lot of assumptions based on the limited info you gave it, some of those may not match your individual case. Learn to prompt better and it will work better for you. It is a skill, just like everything else in life.

Imagine going into a job today and saying "i tried google but it didnt give me what I was looking for as the first result, so I dont use google anymore". I just wouldnt hire a dev that couldnt learn to use AI as a tool to get there job done 10x faster. If that is your attitude, 2026 might really be a wake-up call for your new life.

By @blindriver - about 2 months

They should feed it bootcamp study materials and Cracking the Coding Interview book in order to improve its ability to code.

By @rsynnott - about 2 months

> OpenAI researchers have admitted that even the most advanced AI models still are no match for human coders — even though CEO Sam Altman insists they will be able to beat "low-level" software engineers by the end of this year

This is the “self-driving cars next year, definitely” of the 20s, at this point.

By @rvz - about 2 months

The benchmark for AI models to assess their 'coding' ability should be on actual real world production-grade repositories and fixing bugs in them such as the Linux kernel, Firefox, sqlite or other large scale well known repositories.

Not these Hackerrank, Leetcode or previous IOI and IMO problems which we already have the solutions to them and reproducing the most optimal solution copied from someone else.

If it can't manage most unseen coding problems with no previous solutions to them, what hope does it have against explaining and fixing bugs correctly on very complex repositories with over 1M-10M+ lines of code?

By @pzo - about 2 months

> The models weren't allowed to access the internet

How many software developers could solve most even simple programming problems (except 'Hello world') with zero shot style (you write in notepad then can compile only once and execute once) without access to internet (stackoverflow, google search, documentation), tools (terminal, debugger, linter, cli)?

I think then it's not the best comparison to make any judgement. Future benchmark should test agents where they allowed to solve the problem in 5-10 minutes, allow give access to internet, documentation, linter, terminal with MCP servers.

By @spartanatreyu - about 2 months

Link to the original paper: https://arxiv.org/pdf/2502.12115

TL;DR:

They tested with programming tasks and manager's tasks.

The vast majority of tasks given require bugfixes.

Claude 3.5 Sonnet (the best performing LLM) passed 21.1% of programmer tasks and 47.0% of manager tasks.

The LLMs have a higher probability of passing the tests when they are given more attempts, but there's not a lot of data showing where the improvement tails off. (probably due to how expensive it is to run the tests)

Personally, I have other concerns:

- A human being asked to review repeated LLM attempts to resolve a problem is going to lead that human to review things less thoroughly after a few attempts and over time is going to let false positives slip through

- An LLM being asked to review repeated LLM attempts to resolve a problem is going to lead to the LLM convincing itself that it is correct with no regard for the reality of the situation.

- LLM use increases code churn in a code base

- Increased code churn is known to be bad the health of projects

By @AbstractH24 - about 2 months

Increasingly I think that the impact of generative AI is going to more of an incremental form of disruption than revolutionary. More like spreadsheets than the printing press.

Spreadsheets becoming mainstream made it easy to do computing that once took a lot of manual human labor quite quickly. And it made plenty of jobs and people who do them obsolete. But they didn’t upend society fundamentally or the need for intelligence and they didn’t get rolled out overnight.

By @anandnair - about 2 months

Coding, especially the type mentioned in the article (building an app based on a specification)—is a highly complex task. It cannot be completed with a single prompt and an immediate, flawless result.

This is why even most software projects (built by humans) go through multiple iterations before they work perfectly.

We should consider a few things before asking, "Can AI code like humans?":

- How did AI learn to code? What structured curriculum was used?

- Did AI receive mentoring from an experienced senior who has solved real-life issues that the AI hasn't encountered yet?

- Did the AI learn through hands-on coding or just by reading Stack Overflow?

If we want to model AI as being on par with (or even superior to) human intelligence, don’t we at least need to consider how humans learn these complex skills?

Right now, it's akin to giving a human thousands of coding books to "read" and "understand," but offering no opportunity to test their programs on a computer. That’s essentially what's happening!

Without doing that, I don't think we'll ever be able to determine whether the limitation of current AI is due to its "low intelligence" or because it hasn’t been given a proper opportunity to learn.

By @casey2 - about 2 months

"Here's the choice you have when you are faced with something new. You can take this technological advance and decide "This is a better way of doing the stuff I'm doing now and I can use this to continue on the path that I'm going", so that's staying in the pink plane , or you can say "This is not a better old thing, this is almost a new thing and I wonder what that new thing is trying to be" and if you do that there's a chance of actually perhaps gaining some incredible leverage over simply optimizing something that can't be optimized very much. - Kay

Current LLMs will change the world, but it won't be by completing pull requests quickly.

Although a "stargate level" LLM could accelerate pink plane traversal so much that you don't even need to find the correct usecase. LLM scaling will be the computer graphics scaling of this generation. In terms of intelligence gpt4 based o3 is but a postage stamp. As LLMs scale a picture of intelligence will emerge.

By @DarkmSparks - about 2 months

LLMs will never solve this problem, they are basically just glorified copy & paste engines, solving real code problems requires invention, even for most basic tasks. The best they will manage in their current direct is reason they don't have the capability or capacity to actually solve the problem rather than just getting it wrong the vast majority of the time.

By @internet101010 - about 2 months

I believe it. I couldn't even get o1 or claude 3.5 to write a tampermonkey script that would turn off auto-scroll to bottom in LibreChat, even when uploading the html and javascript as context.

Apparently it has to do with overflow anchor or something in React? Idk. I gave up.

By @marban - about 2 months

I prompted up a very basic Flask scaffold via Windsurf and once it reached a certain code size, it just started to remove or weirdly rewrite old parts to handle the context. ("You're right let's move that back in"). Didn't end well.

By @aszantu - about 2 months

it's so much easier to learn from examples than from documentation in my opinion, documentation is, what I use when I want to know additional parameters or downsides of a functionality. I'm no coder though.

By @ginvok - about 2 months

AI don't "solve" problems, best it can do is remember them. Ask them to solve anything new that's challenging and it starts to hallucinate. At least currently.

By @numba888 - about 2 months

Not sure what they found. Either model is unable, or they were unable to solve the tasks using models. Loos like they used strait questions and not Chain of Thoughts. The result for the same model depends on how you ask. The tasks probably required more thinking under the hood than model is allowed to do in one request. More interesting would be if model is capable of solving given enough time. Using multiple requests orchestrated by some framework automatically.

By @scotty79 - about 2 months

> SWE-Lancer, built on more than 1,400 software engineering tasks from the freelancer site Upwork

What's not mentioned here (I think) is that tasks in this benchmark are priced and they sum up to million dollars.

And current AIs were able to earn nearly half of that.

So while technically they can't solve most problems (yet) they are already perfectly capable of taking about 40% of the food off your plate.

By @halis - about 2 months

The biggest scam in AI is Salesforce. They’re going to take some crappy model that they made or simply switch to a better open source model. Then they’re going to make a large deal with a cloud provider to spin up all these models for all their customers and then re-sell it as AI to their customers for 100x what OpenAI gets per month. And the quality will be lower.

By @darepublic - about 2 months

To me o1 is pretty good. I dunno how it would digest an entire codebase and solve a bug in it. Those details weren't obvious to me from the article above. But o1 has certainly been very valuable to me in coding in new languages on the fly.

By @ryandvm - about 2 months

LLMs at this point are just a great replacement for Stack Overflow. If what you're doing has been heavily documented and you just need a primer or some skeletal sample code, the LLMs are great.

They are not creative at all, but 99% of my job is not creative either.

By @imperial_note - about 2 months

The article concludes that LLMs are "not skilled enough at software engineering to replace real-life people quite yet".

Yet Claude 3.5 sonnet "earned" $403.325,00 according to the paper referenced. That is $403k worth of labour potentially replaced.

By @ineedasername - about 2 months

The models weren't allowed to access the internet, meaning they couldn't just crib similar answers that'd been posted online

So… not the same basic tools that a human has when coding?

By @emorning3 - about 2 months

I dont understand why anyone would even think that the current crop of AI is capable of planning or reason on its own.

Transformers with memory would be different story.

But, no memory, no capability to reason. End of story, right?

By @mrayycombi - about 2 months

Despite the lack luster coding performance, AI has PROVEN its able to provide a rationale for profit taking job cuts, layoffs, reduced stock grants, and increased executive bonuses.

So it's not ALL bad news.

By @ChrisArchitect - about 2 months

Previously on source: https://news.ycombinator.com/item?id=43086347

By @prompt_overflow - about 2 months

This shows how unrealistic, inaccurate and pointless coding benchmarks are.

I also include similar code interview platforms like leetcode, hackerrank and so on.

By @siliconc0w - about 2 months

Interesting that Claude wins despite the other models being more expensive and doing much better in the traditional benchmarks.

By @realitysballs - about 2 months

I believe the outcome of this type of article is actually positive. The ‘SWE-Lancer’ benchmark provides visibility into a more pragmatic assessment of LLM capabilities.

Ironically it actually refutes Altman’s claims mentioned in the same article . Hard to replace engineers when you create a benchmark you can’t score decently on.

By @yieldcrv - about 2 months

I saw this

still, chain of thought is great for LeetCode 75

Since interviewers “want to see how you think” (and get the right answer in less time than other candidates on average)

I can now see how you’re supposed to think (and get the right answer in less time than other candidates on average, for now)

By @m3kw9 - about 2 months

It solved a lot of mines

By @tiberriver256 - about 2 months

... And then they tried sonnet 3.7

By @axelfontaine - about 2 months

... so far

By @chasing0entropy - about 2 months

The models were restricted from accessing the internet and forced to develop their own solutions internally.

I think researchers will find that human coders are unable to solve most coding problems without access to the internet.

OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems

Related

Reasoning skills of large language models are often overestimated

LLMs still can't reason like humans

LLMs don't do formal reasoning

Apple study proves LLM-based AI models are flawed because they cannot reason

Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities

Related

Reasoning skills of large language models are often overestimated

LLMs still can't reason like humans

LLMs don't do formal reasoning

Apple study proves LLM-based AI models are flawed because they cannot reason

Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities