How I use LLMs as a staff engineer
Sean Goedecke discusses the benefits and limitations of large language models in software engineering, highlighting their value in code writing and learning, while remaining cautious about their reliability for complex tasks.
Read original articleSean Goedecke, a staff engineer, shares his perspective on the use of large language models (LLMs) in software engineering, highlighting both their benefits and limitations. He notes a divide among engineers regarding LLMs, with some viewing them as revolutionary and others as overhyped. Goedecke finds significant value in LLMs, particularly in tasks such as writing production code, where he uses tools like Copilot for boilerplate code and to make tactical changes in unfamiliar programming languages. He emphasizes the efficiency of LLMs in generating throwaway code for research purposes, claiming they can expedite the process by 2x to 4x. Additionally, he utilizes LLMs as a learning tool, asking questions and receiving feedback on his understanding of new domains. While he occasionally seeks help with bug fixes, he prefers to rely on his own skills, as LLMs often struggle with complex issues. For written communication, he uses LLMs for proofreading and catching typos but does not allow them to draft documents. Overall, he appreciates LLMs for specific tasks but remains cautious about their limitations, particularly in areas where he has expertise.
- Sean Goedecke finds LLMs valuable for tasks like code writing and learning new domains.
- He uses LLMs primarily for boilerplate code, throwaway research code, and as a tutor.
- Goedecke is cautious about relying on LLMs for bug fixes and prefers his own debugging skills.
- He employs LLMs for proofreading but does not let them draft his written communications.
- He acknowledges the divide in the engineering community regarding the utility of LLMs.
Related
In my opinion, using LLMs to write code comes as a faustian deal where you learn terrible practices and rely on code quantity, boilerplate, and indeterministic outputs - all hallmarks of poor software craftsmanship. Until ML can actually go end to end on requirements to product and they fire all of us, you can't cut corners on building intuition as a human by forgoing reading and writing code yourself.
I do think that there is a place for LLMs in generating ideas or exploring an untrusted knowledge base of information, but using code generated from an LLM is pure madness unless what you are building is truly going to be thrown away and rewritten from scratch, as is relying on it as a linting, debugging, or source of truth tool.
> I don’t do this a lot, but sometimes when I’m really stuck on a bug, I’ll attach the entire file or files to Copilot chat, paste the error message, and just ask “can you help?”
The "reasoning" models are MUCH better than this. I've had genuinely fantastic results with this kind of thing against o1 and Gemini Thinking and the new o3-mini - I paste in the whole codebase (usually via my https://github.com/simonw/files-to-prompt tool) and describe the bug or just paste in the error message and the model frequently finds the source, sometimes following the path through several modules to get there.
Here's a slightly order example: https://gist.github.com/simonw/03776d9f80534aa8e5348580dc6a8... - finding a bug in some Django middleware
LLMs can absolutely bust out some corporate docs super crazy fast too... probably a reasonable thing to re-evaluate the value though
Ah, now it makes sense.
Then I reflected, how very true it was. In fact, as of writing this there are 138 comments and I started simply scrolling through what was shown to assess the negative/neutral/positive bias based upon a highly subjective personal assessment: 2/3 were negative and so I decided to stop.
As a profession, it seems many of us have become accustomed to dealing in absolutes when reality is subjective. Judging LLMs prematurely with a level of perfectionism not even cast upon fellow humans.. or at least, if cast upon humans I'd be glad not to be their colleagues.
Honestly right now - I would use this as a litmus test in hiring and the majority would fail based upon their closed-mindedness and ability to understand how to effectively utilise tools at their disposal. It won't exist as a signal for much longer, sadly!
See this is what I don't get about the AI Evangelists. Every time I use the technology I am astounded at the amount of incorrect information and straight up fantasy it invents. When someone tells me that they just don't see it, I have to wonder what is motivating them to lie. There is simply no way you're using the same technology as me with such wildly different results.
This is how I use AI at work for maintaining Python projects, a language in which I am not at all really versed. Sometimes I might add “this is how I would do it in …, how would I do this in Python?”
I find this extremely helpful and productive, especially as I have to pull the code onto a server to test it.
One thing that is not mentioned -- code review. It is not great at it, often pointing out trivial or non issues. But if it finds 1 area for improvement out of 10 bullet points, that's still worth it -- most human code reviewers don't notice all the issues in the code anyway.
--
I work on Graphite Reviewer (https://graphite.dev/features/reviewer). I'm also partly dyslexic. I lean massively on Grammarly (using it to write this comment) and type-safe compiled languages. When I engineered at Airbnb, I caused multiple site outages due to typos in my ruby code that I didn't see and wasn't able to execute before prod.
The ability for LLMs to proofread code is a godsend. We've tuned Graphite Reviewer to shut up about subjective stylistic comments and focus on real bugs, mistakes, and typos. Fascinatingly, it catches a minor mistake in ~1/5 PRs in prod at real companies (we've run it on a few million PRs now). Those issues it catches result in a pre-merge code change 75% of the time, about equal to what a human comment does.
AIs aren't perfect, but Im thrilled that they work as fancy code spell-checkers :)
CoPilot is used for simple boilerplate code, and also for the autocomplete. It's often a starting point for unit tests (but a thorough review is needed - you can't just accept it, I've seen it misinterpret code). I started experimenting with RA.Aid (https://github.com/ai-christianson/RA.Aid) after seeing a post on it here today. The multi-step actions are very promising. I'm about to try files-to-prompt (https://github.com/simonw/files-to-prompt) mentioned elsewhere in the thread.
For now, LLMs are a level-up in tooling but not a replacement for developers (at least yet)
1. Try to write some code
2. Wonder why my IDE is providing irrelevant, confusing and obnoxious suggestions
3. Realize the AI completion plugin somehow turned itself back on
4. Turn it off
5. Do my job better than everyone that didn't do step 4
The question I keep asking myself is, "Should we be making tools that auto-write code for us, or should we be using this training data to suss out the missing tools we have where everyone writes the same code 10 times in their careers?"
Such an unnecessary flex.
At the end, I ask it to give me a quiz on everything we talked about and any other insights I might have missed. Instead of typing out the answers, I just use Apple Dictation to transcribe my answers directly.
It's only recently that I thought to take the conversation I just had, and have it write a blog post of the insights and ah-ha moments I had, and have it write a blog post. It takes a fair bit of curation to get it to do that, however. I can't just say, "write me a blog post on all we talked about". I have to first get it to write an outline with the key insights. And then based on the outline, write each section. And then I'll use chatgpt's canvas to guide and fine-tune each section.
However, at no point do I have to specifically write the actual text. I mostly do curation.
I feel ok about doing this, and don't consider it AI slop, because I clearly mark at the top that I didn't write a word of it, and it's the result of a curated conversation with 4o. In addition, I think if most people do this as a result of their own Socratic methods with an AI, it'd build up enough training data for next generation of AI to do a better job of writing pedagogical explanations, posts, and quizzes to get people learning topics that are just out of reach, but there hadn't been too many people able to bridge the gap.
The two I had it write are: Effects as Protocols and Contexts as Agents: https://interjectedfuture.com/effects-as-protocols-and-conte...
How free monads and functors represent syntax for algebraic effects: https://interjectedfuture.com/how-the-free-monad-and-functor...
Coding assistant LLMs have changed how I work in a couple of ways:
1) They make it a lot easier to context switch between e.g. writing kernel code one day and a Pandas notebook the next, because you're no longer handicapped by slightly forgetting the idiosyncrasies of every single language. It's like having smart code search and documentation search built into the autocomplete.
2) They can do simple transformations of existing code really well, like generating a match expression from an enum. They can extrapolate the rest from 2-3 examples of something repetitive, like converting from Rust types into corresponding Arrow types.
I don't find the other use cases the author brings up realistic. The AI is terrible at code review and I have never seen it spot a logic error I missed. Asking the AI to explain how e.g. Unity works might feel nice, but the answers are at least 40% total bullshit and I think it's easier to just read the documentation.
I still get a lot of use out of Copilot. The speed boost and removal of friction lets me work on more stacks and, consequently, lead a much bigger span of related projects. Instead of explaining how to do something to a junior engineer, I can often just do it myself.
I don't understand how fresh grads can get use out of these things, though. Tools like Copilot need a lot of hand-holding. You can get them to follow simple instructions over a moderate amount of existing code, which works most of the time, or ask them to do something you don't exactly know how to do without looking it up, and then it's a crapshoot.
The main reason I get a lot of mileage out of Copilot is exactly because I have been doing this job for two decades and understand what's happening. People who are starting in the industry today, IMO, should be very judicious with how they use these tools, lest they end up with only a superficial knowledge of computing. Every project is a chance to learn, and by going all trial-and-error with a chatbot you're robbing yourself of that. (Not to mention the resulting code is almost certainly half-broken.)
That's bad because it makes "not training your juniors" the default path for senior people.
I can assign the task to one of my junior engineers and they will take several days of back and forth with me to work out the details--that's annoying but it's how you train the next generation.
Or I can ask the LLM and it will spit back something from its innards that got indexed from Github or StackOverflow. And for a "junior engineer" task it will probably be correct with the occasional hallucination--just like my junior engineers. And all I have to do for the LLM is click a couple of keys.
With all the talk of o1-pro as a superb staff engineer-level architect, it took me awhile to re-parse this headline to understand what the author, apparently a staff engineer, meant
I stick to a "no copy & paste" rule and that includes autocomplete. Interactions are a conversation but I write all my code myself.
I would be so bored if my job consisted of writing prompts all day long.
- imprecise semantic search
- simple auto-completion (1-5 tokens)
- copying patterns with substitutions
- inserting commonly-used templates