February 18th, 2025

SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork

The SWE-Lancer benchmark features over 1,400 freelance software engineering tasks valued at $1 million, revealing that current models struggle with most tasks while aiming to explore AI's economic impact.

Read original article

FrustrationSkepticismCuriosity

SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork

The paper titled "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?" introduces a benchmark called SWE-Lancer, which consists of over 1,400 freelance software engineering tasks sourced from Upwork, collectively valued at $1 million. The tasks include independent engineering jobs, such as bug fixes and feature implementations, as well as managerial tasks where models evaluate technical proposals. The independent tasks are assessed through end-to-end tests verified by experienced engineers, while managerial tasks are judged based on the decisions of original engineering managers. The evaluation of model performance reveals that current frontier models struggle to complete most of the tasks. To support future research, the authors have made a unified Docker image and a public evaluation split, named SWE-Lancer Diamond, available for use. The goal of SWE-Lancer is to correlate model performance with economic value, thereby fostering research into the economic implications of AI model advancements.

- SWE-Lancer benchmark includes over 1,400 freelance software engineering tasks valued at $1 million.

- Tasks range from simple bug fixes to complex feature implementations and managerial evaluations.

- Current frontier models have difficulty solving the majority of tasks presented in the benchmark.

- The authors provide resources for future research, including a Docker image and public evaluation split.

- The initiative aims to explore the economic impact of AI model development in the freelance software engineering sector.

Ask HN: In 2024, is SWE a sustainable career?

The software engineering landscape is changing due to large language models, making job interviews more challenging and raising concerns about task relevance, supply-demand imbalance, and career sustainability for professionals.

FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI

FrontierMath is a new benchmark for evaluating AI's advanced mathematical reasoning, revealing that current models solve under 2% of expert-level problems, highlighting a significant capability gap. Regular evaluations are planned.

New secret math benchmark stumps AI models and PhDs alike

Epoch AI has launched FrontierMath, a challenging benchmark for AI and mathematicians, with models solving under 2% of expert-level problems. It remains unpublished to ensure fair assessments and future evaluations.

We need data engineering benchmarks for LLMs

Specialized benchmarks for data engineering are essential to evaluate large language models effectively, as current frameworks do not address unique challenges, impacting AI adoption and performance in this field.

Researchers created an open rival to OpenAI's o1 'reasoning' model for under $50

Researchers from Stanford and the University of Washington created the s1 AI model, which rivals advanced models in performance, trained for under $50 using a distillation process and a small dataset.

AI: What people are saying

The comments on the SWE-Lancer benchmark reveal a mix of skepticism and analysis regarding its implications and methodology.

Some users highlight the performance of specific models, noting that 3.5 Sonnet excels in real-world tasks compared to others.
Concerns are raised about the validity of the benchmark, particularly regarding the sourcing of tasks and potential biases in the training data.
Several commenters share personal experiences with AI models failing to solve practical software engineering problems.
There is a discussion about the economic implications of AI in software engineering roles and how professionals can prepare for changes in the industry.
Some comments express confusion about the overall benefit of the research to humanity and its alignment with OpenAI's mission.

12 comments

By @Tiberium - 3 days

The extremely interesting part is that 3.5 Sonnet is above o1 on this benchmark, which again shows that 3.5 Sonnet is a very special model that's best for real world tasks and not some one-shot scripts or math. And the weirdest part is that they tested the 20240620 snapshot which is objectively worse on code than the newer 20241022 (so-called v2).

By @CSMastermind - 3 days

I hire software engineers off Upwork. Part of our process is a 1-hour screening take home question that we ask people to solve. We always do a main one and an alternate for each role. I've tested all of ours on each of the main models and none have been able to solve any of the screening questions yet.

By @Snuggly73 - 2 days

First time commenter - I was so triggered by this benchmark, so I just had to come out of lurking.

I've spent time going over the description and the cases and its an misrepresented travesty.

The benchmark takes existing cases from Upwork, then reintroduces the problems back in the code and then asks the LLM to fix them testing against newly written 'comprehensive tests'.

Lets look at some of the cases:

1. The regex zip code validation problem

Looking at the Upwork problem - https://github.com/Expensify/App/issues/14958 it was mainly that they were using a common regex to validate across all countries, so the solution had to introduce country specific regex etc.

The "reintroduced bug" - https://github.com/openai/SWELancer-Benchmark/blob/main/issu... is just taking that new code and adding , to two countries....

2. Room showing empty - 14857

The "reintroduced bug" - https://github.com/openai/SWELancer-Benchmark/blob/main/issu...

Adds code explicitly commented as introducing a "radical bug" and "intentionally returning an empty array"...

I could go on and on and on...

The "extensive tests" are also laughable :(

I am not sure if OpenAI is actually aware of how great this "benchmark" is, but after so much fanfare - they should be.

By @runako - 3 days

It looks like they sourced tasks via a public Github repository, which is possibly part of the training dataset for the LLM. (It is not clear based on my scan whether the actual answers are also possibly in the public corpus).

Does this work as an experiment if the questions under test were also used to train the LLMs?

By @westurner - 2 days

> By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.

What could be costed in an upwork or a mechanical turk task Value?

Task Centrality or Blockingness estimation: precedence edges, tsort topological sort, graph metrics like centrality

Task Complexity estimation: story points, planning poker, relative local complexity scales

Task Value estimation: cost/benefit analysis, marginal revenue

By @bufferoverflow - 3 days

And how do you evaluate if the task was completed correctly? There are nearly infinite ways to solve a given software dev problem, if the problem isn't trivial (and I hope they are not benchmarking trivial problems).

By @moralestapia - 3 days

The writing is very clearly on the wall.

On a non-pessimist note, I don't think the SWE role will disappear, but what's the best one could do to be prepared for this?

By @comeonbro - 3 days

Models tested: o1, 4o (August 2024 version), 3.5 Sonnet (June 2024 version)

Notably missing: o3

Consult this graph and extrapolate: https://i.imgur.com/EOKhZpL.png

By @neilv - 3 days

"SWE-Lancer", like, skewering SWEs with a lance?

By @ctoth - 2 days

Gonna lance them SWEs like a boil!

By @colesantiago - 3 days

Can anyone explain how this research benefits humanity for OpenAI's mission?

OpenAI's AGI mission statement

> "By AGI we mean highly autonomous systems that outperform humans at most economically valuable work."

https://openai.com/index/how-should-ai-systems-behave/

I would have to admit some humility as I sort of brought this on myself [1]

> This is a fantastic idea. Perhaps then this should be the next test for these SWE Agents, in the same manner as the 'Will Smith Eats Spaghetti" video tests

https://news.ycombinator.com/item?id=43032191

But curiously the question is still valid.

Sam Altman: "50¢ of compute of a SWE Agent can yield "$500 or $5k of work."

https://news.ycombinator.com/item?id=43032098

https://x.com/vitrupo/status/1889720371072696554

SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork

Related

Ask HN: In 2024, is SWE a sustainable career?

FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI

New secret math benchmark stumps AI models and PhDs alike

We need data engineering benchmarks for LLMs

Researchers created an open rival to OpenAI's o1 'reasoning' model for under $50

Related

Ask HN: In 2024, is SWE a sustainable career?

FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI

New secret math benchmark stumps AI models and PhDs alike

We need data engineering benchmarks for LLMs

Researchers created an open rival to OpenAI's o1 'reasoning' model for under $50