January 16th, 2025

Test-Driven Development with an LLM for Fun and Profit

The blog post highlights integrating Test-Driven Development with Large Language Models to improve software practices, emphasizing clear specifications, iterative debugging, modular design, and caution regarding AI advancements.

Read original articleLink Icon
CuriositySkepticismEnthusiasm
Test-Driven Development with an LLM for Fun and Profit

The blog post discusses the integration of Test-Driven Development (TDD) with Large Language Models (LLMs) to enhance software development practices. The author reflects on the challenges of using AI tools like GitHub Copilot, which often struggle with complex specifications and generating complete solutions. To address these issues, the author proposes a structured approach where developers provide clear specifications and function signatures to the LLM, which then generates unit tests and implementations. This iterative process allows for debugging and refining code based on test results, reducing the cognitive load on developers. The author emphasizes the importance of maintaining a well-organized project structure to facilitate LLM workflows, advocating for a modular design that encourages independent testing. The post concludes with a cautionary note about the unpredictability of AI advancements, suggesting that developers should be mindful of potential changes in LLM capabilities before overhauling existing codebases.

- The integration of LLMs with TDD can improve software development efficiency.

- Clear specifications and structured prompts are essential for effective LLM use.

- An iterative approach allows for debugging and refining code based on test outcomes.

- Maintaining a modular project structure can enhance LLM workflows and reduce cognitive load.

- Developers should remain cautious about the evolving nature of AI technologies.

AI: What people are saying
The comments reflect a diverse range of opinions on integrating Test-Driven Development (TDD) with Large Language Models (LLMs).
  • Many users express skepticism about the reliability of LLM-generated outputs, particularly for non-code tasks, emphasizing the importance of human verification.
  • There is a debate over the definition and practice of TDD, with some arguing that traditional TDD principles are being overlooked or misapplied in the context of LLMs.
  • Several commenters share personal experiences and projects that utilize LLMs for generating tests or code, highlighting both successes and challenges.
  • Concerns are raised about the potential for LLMs to produce overly specific or poorly structured code that may not align with best practices.
  • Some participants advocate for a careful balance between automation and human oversight in the testing process to ensure quality and comprehensiveness.
Link Icon 17 comments
By @xianshou - 1 day
One trend I've noticed, framed as a logical deduction:

1. Coding assistants based on o1 and Sonnet are pretty great at coding with <50k context, but degrade rapidly beyond that.

2. Coding agents do massively better when they have a test-driven reward signal.

3. If a problem can be framed in a way that a coding agent can solve, that speeds up development at least 10x from the base case of human + assistant.

4. From (1)-(3), if you can get all the necessary context into 50k tokens and measure progress via tests, you can speed up development by 10x.

5. Therefore all new development should be microservices written from scratch and interacting via cleanly defined APIs.

Sure enough, I see HN projects evolving in that direction.

By @smusamashah - 1 day
On a similar note, has anyone found themselves absolutely not trusting non-code LLM output?

The code is at least testable and verifiable. For everything else I am left wondering if it's the truth or a hallucination. It incurs more mental burden that I was trying to avoid using LLM in the first place.

By @simonw - 1 day
Here's the Go app described in the post: https://github.com/yfzhou0904/tdd-with-llm-go

Example usage from that README (and the blog post):

  % go run main.go \
  --spec 'develop a function to take in a large text, recognize and parse any and all ipv4 and ipv6 addresses and CIDRs contained within it (these may be surrounded by random words or symbols like commas), then return them as a list' \
  --sig 'func ParseCidrs(input string) ([]*net.IPNet, error)'
The all important prompts it uses are in https://github.com/yfzhou0904/tdd-with-llm-go/blob/main/prom...
By @voiceofunreason - 1 day
I have yet to see an LLM + TDD essay where the author demonstrates any mastery of Test Driven Development.

Is the label "TDD" being hijacked for something new? Did that already happen? Are LLMs now responsible for defining TDD?

By @blopker - 1 day
In Rust, there's a controversial practice around putting unit tests in the same file as the actual code. I was put off by it at first, but I'm finding LLM autocomplete is able to be much more effective just being able to see the tests.

No clunky loop needed.

It's gotten me back into TDD.

By @mmikeff - 1 day
Writing a whole load of tests up front and then coding until all the tests pass is not TDD.
By @suraci - about 7 hours
In my experience, I let the LLM help me produce code and tests. Most of my human effort is dedicated to verifying the tests, and then using the tests to verify the code.

Automation doesn't seem like a good idea. I feel it's mandatory to carefully guard the LLM, not only to verify that the LLM-generated tests (functions) are as expected, but also to modify some code that, while not affecting the correctness of the function, has low performance or poor readability.

By @jacobpedd - 1 day
> For best results, our project structure needs to be set up with LLM workflows in mind. Specifically, we should carefully manage and keep the cognitive load required to understand and contribute code to a project at a minimum.

What's the main barrier to doing this all the time? Sounds like a good practice in general.

By @vydra - 1 day
We implemented something similar for our Java backend project based on my rant here: https://testdriven.com/testdriven-2-0-8354e8ad73d7 Works great! I only look at generated code if it passes the tests. Now, can we use LLMs to generate tests from requirements? Maybe, but tests are mostly declarative and are easier to write than production code most of the time. This approach also allows us to use cheaper models, because the tool will automatically tell the model about compile error and failed tests. Usually, we give it up to five attempts to fix the code.
By @erlapso - about 15 hours
Super interesting approach! We've been working on the opposite - always getting your Unit tests written with every PR. The idea is that you don't have to bother running or writing them, you just get them delivered in your Github repo. You can check it out here https://www.codebeaver.ai
By @agentultra - 1 day
This is not a good idea.

If you want better tests with more cases exercising your code: write property based tests.

Tests form an executable, informal specification of what your software is supposed to do. It should absolutely be written by hand, by a human, for other humans to use and understand. Natural language is not precise enough for even informal specifications of software modules, let alone software systems.

If using LLM's to help you write the code is your jam, I can't stop you, but at least write the tests. They're more important.

As an aside, I understand how this antipathy towards TDD develops. People write unit tests, after writing the implementation, because they see it as boilerplate code that mirrors what the code they're testing already does. They're missing the point of what makes a good test useful and sufficient. I would not expect generating more tests of this nature is going to improve software much.

Edit added some wording for clarity

By @jappgar - about 22 hours
> 5. Therefore all new development should be microservices written from scratch and interacting via cleanly defined APIs.

That's what the software industry has been trying and failing at for more than a decade.

By @zephraph - 1 day
Hey, yeah, this is a fun idea. I built a little toy llm-tdd loop as a Saturday morning side project a little while back: https://github.com/zephraph/llm-tdd.

This doesn't actually work out that well in practice though because the implementations the llm tended to generate were highly specific to pass the tests. There were several times it would cheat and just return hard coded strings that matched the expects of the tests. I'm sure better prompt engineering could help, but it was a fairly funny outcome.

Something I've found more valuable is generating the tests themselves. Obviously you don't wholesale rely on what's generated. Tests can have a certain activation energy just to figure out how to set up correctly (especially if you're in a new project). Having an LLM take a first pass at it and then ensuring it's well structured and testing important codepaths instead of implementation details makes it a lot faster to write tests.

By @czhu12 - 1 day
I did something similar for autogenerating RSpec tests in a Rails project.

https://gist.github.com/czhu12/b3fe42454f9fdf626baeaf9c83ab3...

It basically starts from some model or controller, and then parses the Ruby code into an AST, and load all the references, and then parses that code into an AST, up to X number of files, and ships them all off to GPT4-o1 for writing a spec.

I found sometimes, without further prompting, the LLM would write specs that were so heavily mocked that it became almost useless like:

``` mock(add_two_numbers).and_return(3) ... expect(add_two_numbers(1, 2)).to_return(3) ``` (Not that bad, just an illustrating example)

But the tests it generates is quite good overall, and sometimes shockingly good.

By @eesmith - 1 day
> recognize and parse any and all ipv4 and ipv6 addresses and CIDRs contained within it (these may be surrounded by random words or symbols like commas), then return them as a list'

Did I miss the generated code and test cases? I would like to see how complete it was.

For example, for IPv4 does it only handle quad-dotted IP addresses, or does it also handle decimal and hex formats?

For that matter, should it handle those, and if so, where there clarification of what exactly 'all ipv4 ... addresses' means?

I can think of a lot of tricky cases (like 1.2.3.4.5 and 3::2::1 as invalid cases, or http://[2001:db8:4006:812::200e] to test for "symbols like commas"), and would like to see if the result handles them.

By @picografix - 1 day
very few times we are encountered with developing from scratch
By @ComputerGuru - about 9 hours
I’m not going to claim I’ve solved this and figured out “the way” to use LLMs for tests, but I’ve found that copy-and-pasting code + tests and then providing a short essay about my own reasoning of edge cases followed with something along the lines of “your job is to find out what edge cases my reasoning isn’t accounting for, cases that would expose latent properties of the implementation not exposed via its contract, cases tested for by other similar code, domain exceptions I’m not accounting for, cases that test unexplored code paths, cases that align exactly with chunking boundaries or that break chunking assumptions, or any other edge cases I’m neglecting to mention that would be useful both to catch mistakes in the current code and to handle foreseeable mistakes that could arise from refactoring in the future. Try to understand how the existing test cases are defined to catch possibly problematic inputs and extend accordingly. Take into account both the api contract and the underlying implementation and approach this matter from an adversarial perspective where the goal of the tests is to challenge the author’s assumptions and break their code” has been useful.