SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork
The SWE-Lancer benchmark features over 1,400 freelance software engineering tasks valued at $1 million, revealing that current models struggle with most tasks while aiming to explore AI's economic impact.
Read original articleThe paper titled "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?" introduces a benchmark called SWE-Lancer, which consists of over 1,400 freelance software engineering tasks sourced from Upwork, collectively valued at $1 million. The tasks include independent engineering jobs, such as bug fixes and feature implementations, as well as managerial tasks where models evaluate technical proposals. The independent tasks are assessed through end-to-end tests verified by experienced engineers, while managerial tasks are judged based on the decisions of original engineering managers. The evaluation of model performance reveals that current frontier models struggle to complete most of the tasks. To support future research, the authors have made a unified Docker image and a public evaluation split, named SWE-Lancer Diamond, available for use. The goal of SWE-Lancer is to correlate model performance with economic value, thereby fostering research into the economic implications of AI model advancements.
- SWE-Lancer benchmark includes over 1,400 freelance software engineering tasks valued at $1 million.
- Tasks range from simple bug fixes to complex feature implementations and managerial evaluations.
- Current frontier models have difficulty solving the majority of tasks presented in the benchmark.
- The authors provide resources for future research, including a Docker image and public evaluation split.
- The initiative aims to explore the economic impact of AI model development in the freelance software engineering sector.
Related
Ask HN: In 2024, is SWE a sustainable career?
The software engineering landscape is changing due to large language models, making job interviews more challenging and raising concerns about task relevance, supply-demand imbalance, and career sustainability for professionals.
FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI
FrontierMath is a new benchmark for evaluating AI's advanced mathematical reasoning, revealing that current models solve under 2% of expert-level problems, highlighting a significant capability gap. Regular evaluations are planned.
New secret math benchmark stumps AI models and PhDs alike
Epoch AI has launched FrontierMath, a challenging benchmark for AI and mathematicians, with models solving under 2% of expert-level problems. It remains unpublished to ensure fair assessments and future evaluations.
We need data engineering benchmarks for LLMs
Specialized benchmarks for data engineering are essential to evaluate large language models effectively, as current frameworks do not address unique challenges, impacting AI adoption and performance in this field.
Researchers created an open rival to OpenAI's o1 'reasoning' model for under $50
Researchers from Stanford and the University of Washington created the s1 AI model, which rivals advanced models in performance, trained for under $50 using a distillation process and a small dataset.
- Some users highlight the performance of specific models, noting that 3.5 Sonnet excels in real-world tasks compared to others.
- Concerns are raised about the validity of the benchmark, particularly regarding the sourcing of tasks and potential biases in the training data.
- Several commenters share personal experiences with AI models failing to solve practical software engineering problems.
- There is a discussion about the economic implications of AI in software engineering roles and how professionals can prepare for changes in the industry.
- Some comments express confusion about the overall benefit of the research to humanity and its alignment with OpenAI's mission.
I've spent time going over the description and the cases and its an misrepresented travesty.
The benchmark takes existing cases from Upwork, then reintroduces the problems back in the code and then asks the LLM to fix them testing against newly written 'comprehensive tests'.
Lets look at some of the cases:
1. The regex zip code validation problem
Looking at the Upwork problem - https://github.com/Expensify/App/issues/14958 it was mainly that they were using a common regex to validate across all countries, so the solution had to introduce country specific regex etc.
The "reintroduced bug" - https://github.com/openai/SWELancer-Benchmark/blob/main/issu... is just taking that new code and adding , to two countries....
2. Room showing empty - 14857
The "reintroduced bug" - https://github.com/openai/SWELancer-Benchmark/blob/main/issu...
Adds code explicitly commented as introducing a "radical bug" and "intentionally returning an empty array"...
I could go on and on and on...
The "extensive tests" are also laughable :(
I am not sure if OpenAI is actually aware of how great this "benchmark" is, but after so much fanfare - they should be.
Does this work as an experiment if the questions under test were also used to train the LLMs?
What could be costed in an upwork or a mechanical turk task Value?
Task Centrality or Blockingness estimation: precedence edges, tsort topological sort, graph metrics like centrality
Task Complexity estimation: story points, planning poker, relative local complexity scales
Task Value estimation: cost/benefit analysis, marginal revenue
On a non-pessimist note, I don't think the SWE role will disappear, but what's the best one could do to be prepared for this?
Notably missing: o3
Consult this graph and extrapolate: https://i.imgur.com/EOKhZpL.png
OpenAI's AGI mission statement
> "By AGI we mean highly autonomous systems that outperform humans at most economically valuable work."
https://openai.com/index/how-should-ai-systems-behave/
I would have to admit some humility as I sort of brought this on myself [1]
> This is a fantastic idea. Perhaps then this should be the next test for these SWE Agents, in the same manner as the 'Will Smith Eats Spaghetti" video tests
https://news.ycombinator.com/item?id=43032191
But curiously the question is still valid.
Related:
Sam Altman: "50¢ of compute of a SWE Agent can yield "$500 or $5k of work."
Related
Ask HN: In 2024, is SWE a sustainable career?
The software engineering landscape is changing due to large language models, making job interviews more challenging and raising concerns about task relevance, supply-demand imbalance, and career sustainability for professionals.
FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI
FrontierMath is a new benchmark for evaluating AI's advanced mathematical reasoning, revealing that current models solve under 2% of expert-level problems, highlighting a significant capability gap. Regular evaluations are planned.
New secret math benchmark stumps AI models and PhDs alike
Epoch AI has launched FrontierMath, a challenging benchmark for AI and mathematicians, with models solving under 2% of expert-level problems. It remains unpublished to ensure fair assessments and future evaluations.
We need data engineering benchmarks for LLMs
Specialized benchmarks for data engineering are essential to evaluate large language models effectively, as current frameworks do not address unique challenges, impacting AI adoption and performance in this field.
Researchers created an open rival to OpenAI's o1 'reasoning' model for under $50
Researchers from Stanford and the University of Washington created the s1 AI model, which rivals advanced models in performance, trained for under $50 using a distillation process and a small dataset.