Effects of Gen AI on High Skilled Work: Experiments with Software Developers
A study on generative AI's impact on software developers revealed a 26.08% productivity increase, particularly benefiting less experienced developers, through trials at Microsoft, Accenture, and a Fortune 100 company.
Read original articleThis study investigates the effects of generative AI on the productivity of software developers through three field experiments conducted at Microsoft, Accenture, and a Fortune 100 electronics company. The experiments involved providing a randomly selected group of developers access to GitHub Copilot, an AI tool that offers intelligent code suggestions. The analysis, which included data from 4,867 developers, indicated a significant productivity increase of 26.08% in the number of completed tasks among those using the AI tool, with less experienced developers benefiting the most from its implementation. The findings suggest that generative AI can enhance productivity in high-skilled work environments, particularly for individuals with less experience in software development.
- Generative AI tools like GitHub Copilot can significantly boost software developer productivity.
- A 26.08% increase in task completion was observed among developers using the AI tool.
- Less experienced developers showed higher adoption rates and productivity gains from the AI assistance.
- The study was based on three randomized controlled trials across major companies.
- The findings highlight the potential of AI to transform high-skilled work dynamics.
Related
Ask HN: Will AI make us unemployed?
The author highlights reliance on AI tools like ChatGPT and GitHub Copilot, noting a 30% efficiency boost and concerns about potential job loss due to AI's increasing coding capabilities.
Up to 90% of my code is now generated by AI
A senior full-stack developer discusses the transformative impact of generative AI on programming, emphasizing the importance of creativity, continuous learning, and responsible integration of AI tools in coding practices.
Survey: The AI wave continues to grow on software development teams
A GitHub survey of 2,000 software developers reveals 97% use AI coding tools, with varying organizational support. AI enhances productivity, but companies need strategies for broader adoption and integration.
GitHub Named a Leader in the Gartner First Magic Quadrant for AI Code Assistants
GitHub has been recognized as a Leader in Gartner's inaugural Magic Quadrant for AI Code Assistants, excelling in execution and vision, with plans to enhance AI tools for one billion developers.
- Many experienced developers feel that AI tools like Copilot can be distracting and may not significantly enhance their productivity, as they often already know what to write.
- There are concerns about the long-term effects of AI on developer skills, with some suggesting it may lead to a decline in quality and an increase in technical debt.
- Critics question the validity of using pull requests as a measure of productivity, arguing that it does not account for code quality or the potential for increased bugs.
- Some commenters highlight the disparity in productivity gains between junior and senior developers, suggesting that while juniors may benefit, seniors face challenges due to the quality of junior contributions.
- Several comments express skepticism about the study's objectivity, given its ties to Microsoft, raising concerns about potential bias in the findings.
Developers, Operations, and Security used to be dedicated roles.
Then we made DevOps and some businesses took that to mean they only needed 2/3 of the headcount, rather than integrating those teams.
Then we made DevSecOps, and some businesses took that to mean they only needed 1/3 the original roles, and that devs could just also be their operations and appsec team.
That's not a knock on shift-left and integrated operations models; those are often good ideas. It's just the logical outcome of those models when execs think they can get a bigger bonus by cutting costs by cutting headcounts.
Now you have new devs coming into insanely complex n-microservice environments, being asked to learn the existing codebase, being asked to learn their 5-tool CI/CD pipelines (and that ain't being taught in school), being asked to learn to be DBAs, and also to keep up a steady code release cycle.
Is anyone really surprised they are using ChatGPT to keep up?
This is going to keep happening until IT companies stop cutting headcounts to make line go up (instead of good business strategy).
I’m an experienced engineer. Copilot is worse than useless for me. I spend most of my time understanding the problem space, understanding the constraints and affordances of the environment I’m in and thinking about the code I’m going to write app. When I start typing code, I know what I’m going to write, and so a “helpful” Copilot autocomplete is just distraction for me. It makes my workflow much much worse.
On the other hand, AI is incredibly useful for all of those steps I do before actually coding. And sometimes getting the first draft of something is as simple as a well crafted prompt (informed by all the thinking I’ve done prior to starting. After that, pairing with an LLM to get quick answers for all the little unexpected things that come up is extremely helpful.
So, contrary to this report, I think that if experienced developers use AI well, they could benefit MORE than inexperienced developers.
Also, I've personally seen more interest in AI in devs that have little interest in technology, but a big interest in delivering. PMs love them though.
The abstract and the conclusion only give a single percentage figure (26.08% increase in productivity, which probably has too many decimals) as the result. If you go a bit further, they give figures of 27 to 39 percent for juniors and 8 to 13 percent for seniors.
But if you go deeper, it looks like there's a lot of variation not only by seniority, but also by the company. Beside pull requests, results on the other outcome measures (commits, builds, build success rate) don't seem to be statistically significant at Microsoft, from what I can tell. And the PR increases only seem to be statistically significant for Microsoft, not for Accenture. And even then possibly only for juniors, but I'm not sure I can quite figure out if I've understood that correctly.
Of course the abstract and the conclusion have to summarize. But it really looks like the outcomes vary so much depending on the variables that I'm not sure it makes sense to give a single overall number even as a summary. Especially since statistical significance seems a bit hit-and-miss.
edit: better readability
My experience is that the LLM isn't just used for "boilerplate" code, but rather called into action when a junior developer is faced with a fairly common task they've still not (fully) understood. The process of experimenting, learning and understanding is then largely replaced by the LLM, and the real skill becomes applying prompt tweaks until it looks like stuff works.
This tracks with my own experience: Copilot is nice for resolving some tedium and freeing up my brain to focus more on deeper questions, but it's not as world-altering as junior devs describe it as. It's also frequently subtly wrong in ways that a newer dev wouldn't catch, which requires me to stop and tweak most things it generates in a way that a less experienced dev probably wouldn't know to. A few years into it I now have a pretty good sense for when to use Copilot and when not to—so I think it's probably a net positive for me now—but it certainly wasn't always that way.
I also wonder if the possibly-decreased 'productivity' for more senior devs stems in part from the increase in 'productivity' from the juniors in the company. If the junior devs are producing more PRs that have more mistakes and take longer to review, this would potentially slow down seniors, reducing their own productivity gains proportionally.
Does it increase the number of things that pass QA?
Do the things done with AI assistance have fewer bugs caught after QA?
Are they easier to extend or modify later? Or do they have rigid and inflexible designs?
A tool that can help turn developers into unknown quality code monkeys is not something I’m looking for. I’m looking for a tool that helps developers find bugs or design flaws in what they’re doing. Or maybe write well designed tests.
Just counting PRs doesn’t tell me anything useful. But it triggers my gut feeling that more code per unit time = lower average quality.
Microsoft: September 2022 to May 3rd, 2023
Accenture: July 2023 to December 2023
Anonymous Company: October 2023 to ?
Copilot _Chat_ update to GPT-4 was Nov 30, 2023: https://github.blog/changelog/label/copilot/
Even so, AI will propose different things at different times and you still need an experienced developer to make the call. In the end it replaces documentation and typing.
The output of these tools today is unsafe to use unless you possess the ability to assess its correctness. The less able you are to perform that assessment, the more likely you are to use these tools.
Only one of many problems with this direction, but gravity sucks, doesn't it.
As such, when I do have to debug problems myself, or dream up ideas of improvements, I no longer can do this properly due to lack of internal mental state.
Wonder how people who have used genai coding successfully get around this?
I've already fixed a couple of tests like this, where people clearly used AI and didn't think about it, when in reality it was testing something wrong.
Not to mention the rest of the technical debt added... looking at productivity in software development by amount of tasks is so wrong.
Is that a flag we should be watching out for?
I know preprints don't need polish but this is even below the standard of a preprint, imo.
Dev: Hey einpoklum, how do I do XYZ?
Me: Hmm, I think I remember that... you could try AB and then C.
Dev: Ok, but isn't there a better/easier way? Let me ask ChatGPT.
...
Dev: Hey einpoklum, ChatGPT said I should do AB and then C.
Me: Let me have a look at that for a second.
Me: -Right, so it's just what I read on StackOverflow about this, a couple of years ago.
Sometimes it's even the answer that _I_ wrote on StackOverflow and then I feel cheated.I think it's a big productivity boost, but also a chance that the learning rate might actually be significantly slower.
Would love to see it replicated by researchers at a company that does not have a clear financial interest in the outcome (the corresponding author here was working at Microsoft Research during the study period).
> Before moving on, we discuss an additional experiment run at Accenture that was abandoned due to a large layoff affecting 42% of participants
Eek
A minor drawback to that enthusiasm is that a lot of the code I read didn't need to exist in the first place, even before this wave. Lots of it can be attributed to the path dependence of creation as opposed to what it is trying to do. This should be a rich time to change to security / exploit work - the random search tools are great and the target just keeps getting easier.
What our industry really desperately needed was to drive the quality of implementation right down. It's going to be an exciting time to be alive.
And that is why demand for senior developers is going to go through the roof. Who is going to unfuck the giant balls of mud those inexperienced devs are slinging together? Who’s going to keep the lights on?
Both AI tools came back with...garbage. Loops within loops within loops as they iterated through each day to check if the day is a weekend or not, is a leap year and to account for the extra day, is it a holiday or not, etc.
However, chatGPT provided a clever division to cut the dataset down to weeks, then process the result. I ended up using that portion in my final algorithm creation.
So, my take on AI coding tools are: "Buyer beware. Your results may vary".
Because development will become an auction-like activity where the one that accepts more suggestions wins.
No two tasks are the same level of complexity and one task may take 5x longer than another to complete.
When I was using chatgpt to do qualifiers for a CTF called Hack A Sat at defcon 31 I could not get anything to work such as gnu radio programs.
If you have the ability to debug then I have experienced that it is productive but when you don’t understand you run into problems.
However, there's a big question as to whether these are short productivity gains vs longer lasting gains. There's a hypothesis that the AI generate code will slowly spaghetti-fy a codebase.
Is 1-2 years sufficiently long enough to take this into consideration? Or disprove the spaghettification?
For those who are beginners, it can bring their skills up and make them look like better developers than they are.
More insidiously, expert programmers who overuse of AI might also regress to the mean as their skills deteriorate.
This is what I’ve seen too, I don't think less experienced developers have gotten better in their understanding of anything just more exposed and quicker, while I do think more experience developers have stagnated
I think this post from the other day adds some important context[0]. In that study kids with access to GPT did way more practice problems but worse on the test. But the most important part was that they found that while GPT usually got the final answer right that the logic was wrong, meaning that the answer is wrong. This is true for math and code.
There's the joke: there's two types of 10x devs, those that do 10x work and those who finish 10x jira tickets. The problem with this study is the assumptions that it makes, which is quite common and naive in our industry. They assume that PRs and commits are measures of productivity and they assume passing review is a good quality metric. These are so variable between teams. Plenty are just "lgtm" reviews.
The issue here is that there's no real solid metric for things like good code. Meeting the goals of a ticket doesn't mean you haven't solved the problem so poorly you are the reason 10 new tickets will be created. This is the real issue here and the only real way to measure it is using Justice Potter's test (I know it when I see it), and requires an expert evaluator. In other words, tech debt. Which is something we're seeing a growing rise in, all the fucking enshitification.
So I don't think that study here contradicts [0], in fact I think they're aligned. But I suspect people who are poor programmers (or non programmers) will use this at evidence for what they want to see. Believing naive things like lines of code, number of commits/PRs, etc are measures of productivity rather than hints of measure. I'm all for "move fast and break things" as long as there's time set aside to clean up the fucking mess you left behind. But there never is. It's like we have businesses ADHD. There's so much lost productivity because so much focus is placed on short term measurements and thinking. I know medium and long term thinking are hard, but humans do hard shit every day. We can do a lot better than a shoddy study like this.
Copilot often saves me a lot of typing on a 1-3 line scope, occasionally surprising me with exactly what I was about to write on a 5-10 line scope. It’s really good during rearrangement and early refactoring (as you are building a new thing and changing your mind as you go about code organization).
ChatGPT, or “Jimmy” - as I like to call him - has been great for answering syntax questions, idiom questions, etc. when applying my general skills based on other languages to ones I’m less familiar with.
It has also been good for “discussing” architecture approaches to a problem with respect to a particular toolset.
With proper guidance and very clear prompting, I usually get highly value responses.
I would rough guess that these two tools have saved me 2-3 months of solo time this year - nay, since April.
One I get down in the deep details, I use Jimmy much less often. But when I hit something new, or something I long since forgot, he’s ready to be relative expert / knowledge base.
If you start low it's easier to get greater growth rates.
The biggest is the first step, 0% to 1% is infinite growth.
If an AI tool makes me more productive, I would probably either spend the time won browsing the internet, or use it to attempt different approaches to solve the problem at hand. In the latter case, I would perhaps make more reliable or more flexible software. Which would also be almost impossible to measure in a scientific investigation.
In my experience, the differences in developer productivity are so enormous (depending on existing domain knowledge, motivation, or management approach), that it seems pretty hard to make any scientific claim based on looking at large groups of developers. For now, I prefer the individual success story.
BUT I think a lot of people mentioned that, you get code - that the person which wrote it do not understand. So the next time you get a bug there, good luck fixing it.
My take so far. AI is great, but only for non critical, non core code. Everything that is done for plotting and scripting is awesome (which can take days to implement and in minutes with AI) - but core lib functions - wouldn't outsource it to the AI right now.
I, for one, only decide whether CoPilot's productivity increase is worth the $10 it costs per month.
It doesn't really matter whether you're an employer getting a 3–30% increase in productivity or whether you pay for it personally and finish 2 hours faster every week and log off illegaly. It's easily worth its money. What more to consider?
Related
Ask HN: Will AI make us unemployed?
The author highlights reliance on AI tools like ChatGPT and GitHub Copilot, noting a 30% efficiency boost and concerns about potential job loss due to AI's increasing coding capabilities.
Up to 90% of my code is now generated by AI
A senior full-stack developer discusses the transformative impact of generative AI on programming, emphasizing the importance of creativity, continuous learning, and responsible integration of AI tools in coding practices.
Survey: The AI wave continues to grow on software development teams
A GitHub survey of 2,000 software developers reveals 97% use AI coding tools, with varying organizational support. AI enhances productivity, but companies need strategies for broader adoption and integration.
GitHub Named a Leader in the Gartner First Magic Quadrant for AI Code Assistants
GitHub has been recognized as a Leader in Gartner's inaugural Magic Quadrant for AI Code Assistants, excelling in execution and vision, with plans to enhance AI tools for one billion developers.