The sad state of property-based testing libraries
Property-based testing, originating in the late nineties at Chalmers University, emphasizes test generation over manual writing. Despite lacking advanced features, Quviq QuickCheck excels in stateful and parallel testing, setting industry standards.
Read original articleProperty-based testing has gained popularity in programming communities, with the concept of generating tests instead of writing them becoming mainstream. The history of property-based testing traces back to the late nineties at Chalmers University in Sweden, where researchers like John Hughes and Koen Claessen developed the first implementation known as QuickCheck for Haskell. This approach allows for testing programs against formal specifications without the need for exhaustive manual testing or formal proofs.
Despite its potential, a recent survey of property-based testing libraries reveals a lack of advanced features like stateful and parallel testing in many popular implementations across various programming languages. While some libraries support stateful testing, parallel testing remains a challenge for most. The Quviq QuickCheck library, developed by John Hughes and Thomas Arts, stands out for its closed-source version that excels in stateful and parallel testing, setting a high standard for the industry.
The analysis points to a gap between the current state of property-based testing libraries and the advanced capabilities demonstrated by Quviq QuickCheck. Users have expressed the need for more sophisticated testing features, but progress in implementing stateful and parallel testing functionalities has been slow across many open-source libraries. This discrepancy raises questions about the design and evolution of property-based testing tools in the programming landscape.
Related
https://www.tedinski.com/2018/12/11/fuzzing-and-property-tes...
Like, with the below fuzz test, and the corresponding invariant checks, isn't this all but equivalent to property tests?
https://github.com/ncruces/aa/blob/505cbbf94973042cc7af4d6be...
https://github.com/ncruces/aa/blob/505cbbf94973042cc7af4d6be...
It seems like Hypothesis is unable to handle simple but "large" data sets >>by design<<, where "large" is really not so large. [0] It was such a pain that we ripped out Hypothesis (and generative testing altogether, sadly) completely from our python test suite at work.
[0] https://github.com/HypothesisWorks/hypothesis/issues/3493
For parallel testing I guess it can be useful at times, but often it's easier to just run a bunch of tests in parallel instead.
type Instruction = | Read of stuff | Write of stuff | Seek of stuff | …
And then the property is "given this list of instructions, blah". The `StateModel` formalism requires doing basically the same work! I really don't see that `StateModel` is pulling its weight: it's adding a bunch more framework code to understand, and the benefit is that it's getting rid of what is in my experience a very small amount of test code.
One of these is coverage guided PBT, check eg. Dan Luu's article: https://danluu.com/testing/
Another one, which I'm biased towards, is making shrinking automatic while keeping all the invariants you created while generating the values. I have a talk on this but the TL;DR is that
- derived QuickCheck-like shrinker functions that work on values (shrink : a -> [a]) have issues with the constraints, making people turn off shrinking instead of dealing with the issues
- rose tree "integrated shrinking" (eg. Hedgehog) follows the constraints of the generators, but has issues with monadic bind (using results of generators to dispatch to another generator)
- the only approach that seems to magically "just work" is Hypothesis' "internal shrinking", which uses a layer of indirection to shrink lists of random choices instead of the values themselves. The disadvantage is that the generators are now parsers from the lists of bytes that can fail (introducing some inefficiency) and user can make a crazy generator that the internal shrinker will not be able to shrink perfectly. Nevertheless, it's the best DX of the three approaches, and given it's a small miracle people are testing at all, feels to me like the approach most worth building on as a testing library author.
Parallel testing is interesting, but hasn't been a large source of pain yet.
Besides, it would be really cool to have property-based integration or hybrid unit/integration tests. Or even property-based E2E tests. Unfortunately, the setup of the example will almost always take too long for a relevant set of runs.
For instance: If you have a basic data model (say in sqlalchemy) and want to write property-based tests (say in hypothesis), you can relatively quickly derive strategies for the models (but beware of recursion and primary keys). But writing that model instance into the DB for running an example just takes too long for swift testing.
Big fan of property testing.
A particularly big fan of Hedgehog.
I tried writing a Hedgehog-inspired library in Rust and realized how complex the underlying domain is.
For example, the RNG is "splittable" [1][2] which means you can take a deterministic seed and split it in two, e.g. for parallel generators to work independently but still deterministically. The effort that has gone into this feature level is a little numbing. I have an awe similar to that of the "fast inverse square root" hack when I see code like this:
-- | A predefined gamma value's needed for initializing the "root" instances of
-- 'Seed'. That is, instances not produced by splitting an already existing
-- instance.
--
-- We choose: the odd integer closest to @2^64/φ@, where @φ = (1 + √5)/2@ is
-- the golden ratio.
--
goldenGamma :: Word64
goldenGamma =
0x9e3779b97f4a7c15
when realizing that those numbers don't come easily. [3][1]: https://hackage.haskell.org/package/hedgehog-1.4/docs/src/He...
[2]: https://gee.cs.oswego.edu/dl/papers/oopsla14.pdf
[3]: https://github.com/hedgehogqa/haskell-hedgehog/issues/191
Wonder why most property-testing libaries don't have features like this?
The libraries require training to use. And they're not that easy to write.
> the current state-of-the-art when it comes to property-based testing is stateful testing via a state machine model and reusing the same sequential state machine model combined with linearisability to achieve parallel testing
Okay, okay. I admit I've never performed property-based stateful testing, nor in parallel. So that may be the coolest feature out there, because it addresses one of the hardest problems in testing.
But I think that yet other things have happened with modern property-testing libraries (e.g. Hypothesis, PropEr, Hedgehog, Validity):
Shrinking for free [4], generators for free [5], defining the probability distribution of your sub-generators in a composable way.
Maybe those features are not as significant, but they're equally missing from almost all property-test libaries.
[4]: Gens N’ Roses: Appetite for Reduction • Jacob Stanley • YOW! 2017 https://www.youtube.com/watch?v=LfD0DHqpeVQ
[5]: https://tech.fpcomplete.com/blog/quickcheck-hedgehog-validit...
One pitfall with non-coverage-driven randomized testing like QuickCheck, is that how good your tests are depends a lot on the generator. It may be very rarely generating interesting inputs because you biased the generator in the wrong way, and depending on how you do the generation, you need to be careful to ensure the generator halts. With coverage-driven fuzzing all of these problems go away; you don’t have to be smart to choose distributions so that interesting cases are more common, coverage instrumentation will automatically discover new paths in your program and drill down on them.
But isn’t fuzzing about feeding a large program or function random bytestrings as inputs, whereas property-based testing is about testing properties about data structures? It is true that fuzzers operate on bytestrings, but there is no rule that says we can’t use that bytestring to generate a data structure (in a sense, replacing the role of the random seed). And indeed this is what the Arbitrary crate [1] in Rust does, it gives tools and even derive macros to automatically generate values of your data types in the same way that QuickCheck can. The fuzzing community calls this Structure-Aware Fuzzing and there is a chapter about it in the Rust fuzzing book [2]. There are also tools like libprotobuf-mutator [3] that substitute fuzzers’ naive mutation strategies, but even with naive strategies fuzzers can usually get to 100% coverage with appropriate measures (e.g. recomputing checksums after mutation, if the data structure contains checksums).
I am using this extensively in my own projects. For example, RCL (a configuration language that is a superset of json) contains multiple fuzzers that test various properties [4], such as idempotency of the formatter. In the beginning it used the raw source files as inputs but I also added a more advanced generator that wastes less cycles on inputs that get rejected by the parser. The fuzzer has caught serveral bugs, and most of them would have had no hope of being found with naive randomized testing, because they required a cascade of unlikely events.
Structure-aware fuzzing is not limited to generating data structures either, you can use it to generate reified commands to test a stateful API, as you describe in the _Stateful property-based testing_ section. The Rust fuzzing book has an example of this [5], and I use this approach to fuzz a tree implementation in Noblit [6].
[1]: https://docs.rs/arbitrary/latest/arbitrary/ [2]: https://rust-fuzz.github.io/book/cargo-fuzz/structure-aware-... [3]: https://github.com/google/libprotobuf-mutator [4]: https://docs.ruuda.nl/rcl/testing/#fuzz-tests [5]: https://rust-fuzz.github.io/book/cargo-fuzz/structure-aware-... [6]: https://github.com/ruuda/noblit/blob/a0fd1342c4aa6e05f2b1c4e...
First off, the "training" problem is a very real and fundamental blocker. I think academics or less experienced engineers see it as a problem of education, but I believe it's actually an engineering/organizational problem: for a company or even OSS project, when you use niche/unadopted tools like this you either need to provide training for it or you massively shrink the set of people able to work on your project by requiring or expecting them to learn it on their own. This introduces a practical headache bigger than the ones a tool like this solves (since these are just incrementally better ways to test software) - you get less contributions or need to spend more on getting people onboarded. Note that even if "training" people is free in the sense that there is no paid training material, you still pay a cost in lower productivty for new people and the time spent training.
Even once people are trained, you now have a process/institutional need to support the property based tests you wrote. That may mean paying for licenses for software, or maintenance of whatever infrastructure/integrations you write to get property based testing working in your developer workflow. And you also now need to rely on people using these tools correctly - the problem with formal verification is that it doesn't verify that you're actually solving the real problem correctly, just that your program operates as expected (ie it only verifies that your software is Working As Implemented, not Working as Intended). The more you rely on average joes to wield complex things like this well, the more headaches you introduce - if the bottom 10% of people you hire use it sloppily or wrongly, once you're past a few dozen regular contributors you're basically going to have constant issues. You see this all the time with even just regular unit and integration tests - less capable developers constantly introduce flakes or write tests so bad/basic that it'd be better for them to not even be writing tests at all.
Even if after considering all that it still seems a worthy tradeoff, there's the question of whether property based testing solves the problem you think it does. As I mentioned, it can moreso verify software is Working as Implemented rather than Working as Intended. But there is the bigger problem of major software projects not conforming to the simpler use cases that sell people on using this tool. In my experience stateless functions almost never have bugs, and when they do, it's mostly a "requirements discovery" problem you wouldn't catch with any kind of verification tooling.
Stateful verification becomes a problem in the kinds of real systems programming projects like the Linux Kernel or Qemu which stand to benefit most from verification. If you're verifying things at the level of a "higher level" abstraction like a scheduler, you almost always have really high dimensionality/a lot of state, perhaps deceptively more than you think, because of composition. And you tend to also have a devloop that is already very slow - builds take a long time, and running tests of higher level components can be "expensive" because it may take 100ms-1s for something like a kernel to finish starting in a test environment. For a single test that's nothing, but multiplied across all tests makes testing painful enough already; adding 3x or 100x by doing property based testing with new tests, even if pruned before submitting, could be either hugely expensive or way too slow. And similarly, the increased resource requirements to maintain state machines can be very expensive. TSAN does essentially parallel property-based testing and introduces 128-512 bytes of overhead for every 8 bytes of memory under test: https://github.com/google/sanitizers/wiki/ThreadSanitizerAlg....
Is it cool and underrated? Definitely. But it's not a silver bullet and honestly just isn't worth it for most software. Even when it is worth it, it has its drawbacks.
Today a weird, scary splinter faction of the already terrifying Effective Altruism religion/cult is talking about building digital gods they can’t control and saying with a straight face that this is imminent (absurdly extravagant claim with no evidence more convincing than a logarithmic scale and a ruler) and using this as an argument to redefine everything from the definition of science to the very epistemology of factuality and getting the public to directly or indirectly subsidize it into the bargain.
People who make Ray Kurzweil’s Singularity magical thinking look flat fucking normal and frankly sound like L Ron Hubbard more every day are unilaterally deciding that all research into, ya know, what they claim is the most important event in human history is totally black box propriety because, ya know, for our own good.
I’ll take the polite Swiss functional programming folks over Anthropic’s Mechanical Interpretability people seven days a week and twice on Sunday.
Can we go back to this please? The new thing sucks.