Test-Driven Development with an LLM for Fun and Profit
The blog post highlights integrating Test-Driven Development with Large Language Models to improve software practices, emphasizing clear specifications, iterative debugging, modular design, and caution regarding AI advancements.
Read original articleThe blog post discusses the integration of Test-Driven Development (TDD) with Large Language Models (LLMs) to enhance software development practices. The author reflects on the challenges of using AI tools like GitHub Copilot, which often struggle with complex specifications and generating complete solutions. To address these issues, the author proposes a structured approach where developers provide clear specifications and function signatures to the LLM, which then generates unit tests and implementations. This iterative process allows for debugging and refining code based on test results, reducing the cognitive load on developers. The author emphasizes the importance of maintaining a well-organized project structure to facilitate LLM workflows, advocating for a modular design that encourages independent testing. The post concludes with a cautionary note about the unpredictability of AI advancements, suggesting that developers should be mindful of potential changes in LLM capabilities before overhauling existing codebases.
- The integration of LLMs with TDD can improve software development efficiency.
- Clear specifications and structured prompts are essential for effective LLM use.
- An iterative approach allows for debugging and refining code based on test outcomes.
- Maintaining a modular project structure can enhance LLM workflows and reduce cognitive load.
- Developers should remain cautious about the evolving nature of AI technologies.
Related
Self hosting a Copilot replacement: my personal experience
The author shares their experience self-hosting a GitHub Copilot replacement using local Large Language Models (LLMs). Results varied, with none matching Copilot's speed and accuracy. Despite challenges, the author plans to continue using Copilot.
How I Program with LLMs
The author discusses the positive impact of large language models on programming productivity, highlighting their uses in autocomplete, search, and chat-driven programming, while emphasizing the importance of clear objectives.
Cheating Is All You Need
Steve Yegge discusses the transformative potential of Large Language Models in software engineering, emphasizing their productivity benefits, addressing skepticism, and advocating for their adoption to avoid missed opportunities.
- Many users express skepticism about the reliability of LLM-generated outputs, particularly for non-code tasks, emphasizing the importance of human verification.
- There is a debate over the definition and practice of TDD, with some arguing that traditional TDD principles are being overlooked or misapplied in the context of LLMs.
- Several commenters share personal experiences and projects that utilize LLMs for generating tests or code, highlighting both successes and challenges.
- Concerns are raised about the potential for LLMs to produce overly specific or poorly structured code that may not align with best practices.
- Some participants advocate for a careful balance between automation and human oversight in the testing process to ensure quality and comprehensiveness.
1. Coding assistants based on o1 and Sonnet are pretty great at coding with <50k context, but degrade rapidly beyond that.
2. Coding agents do massively better when they have a test-driven reward signal.
3. If a problem can be framed in a way that a coding agent can solve, that speeds up development at least 10x from the base case of human + assistant.
4. From (1)-(3), if you can get all the necessary context into 50k tokens and measure progress via tests, you can speed up development by 10x.
5. Therefore all new development should be microservices written from scratch and interacting via cleanly defined APIs.
Sure enough, I see HN projects evolving in that direction.
The code is at least testable and verifiable. For everything else I am left wondering if it's the truth or a hallucination. It incurs more mental burden that I was trying to avoid using LLM in the first place.
Example usage from that README (and the blog post):
% go run main.go \
--spec 'develop a function to take in a large text, recognize and parse any and all ipv4 and ipv6 addresses and CIDRs contained within it (these may be surrounded by random words or symbols like commas), then return them as a list' \
--sig 'func ParseCidrs(input string) ([]*net.IPNet, error)'
The all important prompts it uses are in https://github.com/yfzhou0904/tdd-with-llm-go/blob/main/prom...Is the label "TDD" being hijacked for something new? Did that already happen? Are LLMs now responsible for defining TDD?
No clunky loop needed.
It's gotten me back into TDD.
Automation doesn't seem like a good idea. I feel it's mandatory to carefully guard the LLM, not only to verify that the LLM-generated tests (functions) are as expected, but also to modify some code that, while not affecting the correctness of the function, has low performance or poor readability.
What's the main barrier to doing this all the time? Sounds like a good practice in general.
If you want better tests with more cases exercising your code: write property based tests.
Tests form an executable, informal specification of what your software is supposed to do. It should absolutely be written by hand, by a human, for other humans to use and understand. Natural language is not precise enough for even informal specifications of software modules, let alone software systems.
If using LLM's to help you write the code is your jam, I can't stop you, but at least write the tests. They're more important.
As an aside, I understand how this antipathy towards TDD develops. People write unit tests, after writing the implementation, because they see it as boilerplate code that mirrors what the code they're testing already does. They're missing the point of what makes a good test useful and sufficient. I would not expect generating more tests of this nature is going to improve software much.
Edit added some wording for clarity
That's what the software industry has been trying and failing at for more than a decade.
This doesn't actually work out that well in practice though because the implementations the llm tended to generate were highly specific to pass the tests. There were several times it would cheat and just return hard coded strings that matched the expects of the tests. I'm sure better prompt engineering could help, but it was a fairly funny outcome.
Something I've found more valuable is generating the tests themselves. Obviously you don't wholesale rely on what's generated. Tests can have a certain activation energy just to figure out how to set up correctly (especially if you're in a new project). Having an LLM take a first pass at it and then ensuring it's well structured and testing important codepaths instead of implementation details makes it a lot faster to write tests.
https://gist.github.com/czhu12/b3fe42454f9fdf626baeaf9c83ab3...
It basically starts from some model or controller, and then parses the Ruby code into an AST, and load all the references, and then parses that code into an AST, up to X number of files, and ships them all off to GPT4-o1 for writing a spec.
I found sometimes, without further prompting, the LLM would write specs that were so heavily mocked that it became almost useless like:
``` mock(add_two_numbers).and_return(3) ... expect(add_two_numbers(1, 2)).to_return(3) ``` (Not that bad, just an illustrating example)
But the tests it generates is quite good overall, and sometimes shockingly good.
Did I miss the generated code and test cases? I would like to see how complete it was.
For example, for IPv4 does it only handle quad-dotted IP addresses, or does it also handle decimal and hex formats?
For that matter, should it handle those, and if so, where there clarification of what exactly 'all ipv4 ... addresses' means?
I can think of a lot of tricky cases (like 1.2.3.4.5 and 3::2::1 as invalid cases, or http://[2001:db8:4006:812::200e] to test for "symbols like commas"), and would like to see if the result handles them.
Related
Self hosting a Copilot replacement: my personal experience
The author shares their experience self-hosting a GitHub Copilot replacement using local Large Language Models (LLMs). Results varied, with none matching Copilot's speed and accuracy. Despite challenges, the author plans to continue using Copilot.
How I Program with LLMs
The author discusses the positive impact of large language models on programming productivity, highlighting their uses in autocomplete, search, and chat-driven programming, while emphasizing the importance of clear objectives.
Cheating Is All You Need
Steve Yegge discusses the transformative potential of Large Language Models in software engineering, emphasizing their productivity benefits, addressing skepticism, and advocating for their adoption to avoid missed opportunities.