July 4th, 2024

The sad state of property-based testing libraries

Property-based testing, originating in the late nineties at Chalmers University, emphasizes test generation over manual writing. Despite lacking advanced features, Quviq QuickCheck excels in stateful and parallel testing, setting industry standards.

Read original articleLink Icon
The sad state of property-based testing libraries

Property-based testing has gained popularity in programming communities, with the concept of generating tests instead of writing them becoming mainstream. The history of property-based testing traces back to the late nineties at Chalmers University in Sweden, where researchers like John Hughes and Koen Claessen developed the first implementation known as QuickCheck for Haskell. This approach allows for testing programs against formal specifications without the need for exhaustive manual testing or formal proofs.

Despite its potential, a recent survey of property-based testing libraries reveals a lack of advanced features like stateful and parallel testing in many popular implementations across various programming languages. While some libraries support stateful testing, parallel testing remains a challenge for most. The Quviq QuickCheck library, developed by John Hughes and Thomas Arts, stands out for its closed-source version that excels in stateful and parallel testing, setting a high standard for the industry.

The analysis points to a gap between the current state of property-based testing libraries and the advanced capabilities demonstrated by Quviq QuickCheck. Users have expressed the need for more sophisticated testing features, but progress in implementing stateful and parallel testing functionalities has been slow across many open-source libraries. This discrepancy raises questions about the design and evolution of property-based testing tools in the programming landscape.

Related

Link Icon 18 comments
By @ncruces - 3 months
With the advent of coverage based fuzzing, and how well supported it is in Go, what am I missing from not using one of the property based testing libraries?

https://www.tedinski.com/2018/12/11/fuzzing-and-property-tes...

Like, with the below fuzz test, and the corresponding invariant checks, isn't this all but equivalent to property tests?

https://github.com/ncruces/aa/blob/505cbbf94973042cc7af4d6be...

https://github.com/ncruces/aa/blob/505cbbf94973042cc7af4d6be...

By @epgui - 3 months
Anecdotally, I had a fantastic experience with `clojure.spec.alpha` (with or without `test.check`), and when I went to use python's `hypothesis` it was just... abysmal.

It seems like Hypothesis is unable to handle simple but "large" data sets >>by design<<, where "large" is really not so large. [0] It was such a pain that we ripped out Hypothesis (and generative testing altogether, sadly) completely from our python test suite at work.

[0] https://github.com/HypothesisWorks/hypothesis/issues/3493

By @PeterisP - 3 months
The simple answer to a question posed in the article "On the other hand one could ask why there isn’t a requirement that published research should be reproducible using open source tools (or at least tools that are freely available to the public and other researchers)?" is that the obvious immediate outcome of such a requirement is that papers failing that requirement - like the Quviq QuickCheck papers, which seem to have been useful to the author and others - simply would not get published, and the community would lose out on that gift of information.
By @sunshowers - 3 months
I write stateful property tests with Rust's proptest quite regularly, I just tend to handcode them which is quite straightforward. See https://github.com/sunshowers-code/buf-list/blob/main/src/cu... for a nontrivial example which found 6 bugs.

For parallel testing I guess it can be useful at times, but often it's easier to just run a bunch of tests in parallel instead.

By @Smaug123 - 3 months
I had a quick look through the linked "Testing Telecoms Software with Quviq QuickCheck" paper, but couldn't immediately see anything to answer my question: "why is this stateful stuff not better rolled by hand?". The OP gestures at this with the key-value pair model of a key-value store, but why would one not just write a state machine; why does it require a framework? I literally did this last week at work to test a filesystem interaction; it boiled down to:

type Instruction = | Read of stuff | Write of stuff | Seek of stuff | …

And then the property is "given this list of instructions, blah". The `StateModel` formalism requires doing basically the same work! I really don't see that `StateModel` is pulling its weight: it's adding a bunch more framework code to understand, and the benefit is that it's getting rid of what is in my experience a very small amount of test code.

By @mjaniczek - 3 months
The author focuses on the state machine and parallel aspect of PBT, but there are other aspects that might have larger effect.

One of these is coverage guided PBT, check eg. Dan Luu's article: https://danluu.com/testing/

Another one, which I'm biased towards, is making shrinking automatic while keeping all the invariants you created while generating the values. I have a talk on this but the TL;DR is that

- derived QuickCheck-like shrinker functions that work on values (shrink : a -> [a]) have issues with the constraints, making people turn off shrinking instead of dealing with the issues

- rose tree "integrated shrinking" (eg. Hedgehog) follows the constraints of the generators, but has issues with monadic bind (using results of generators to dispatch to another generator)

- the only approach that seems to magically "just work" is Hypothesis' "internal shrinking", which uses a layer of indirection to shrink lists of random choices instead of the values themselves. The disadvantage is that the generators are now parsers from the lists of bytes that can fail (introducing some inefficiency) and user can make a crazy generator that the internal shrinker will not be able to shrink perfectly. Nevertheless, it's the best DX of the three approaches, and given it's a small miracle people are testing at all, feels to me like the approach most worth building on as a testing library author.

By @lmm - 3 months
I've tried to use property based testing but always found it falls between two stools. If I understand a property well enough to write a strict test of it, I can generally push it into the type system and make it true by construction. And if I just want a "smoke test" then a single arbitrary input is easier.
By @rdtsc - 3 months
I wonder if it's missing the original QuviQ Erlang QuickCheck in the list? The full product is proprietary, but there is a freeware version QuickCheck Mini available as well: http://www.quviq.com/downloads/
By @arohner - 3 months
Clojure does have stateful quickcheck library now: https://github.com/griffinbank/test.contract

Parallel testing is interesting, but hasn't been a large source of pain yet.

By @algorithmsRcool - 3 months
For my C#/.NET testing I've been using CsCheck[0] and I've enjoyed it quite a bit. It was a lot more approachable compared to Hedgehog and FsCheck and is also quite fast.

[0] https://github.com/AnthonyLloyd/CsCheck

By @choeger - 3 months
The obvious downside is the number of examples computed. A simple unit test can easily take 100 times as long when property-tested.

Besides, it would be really cool to have property-based integration or hybrid unit/integration tests. Or even property-based E2E tests. Unfortunately, the setup of the example will almost always take too long for a relevant set of runs.

For instance: If you have a basic data model (say in sqlalchemy) and want to write property-based tests (say in hypothesis), you can relatively quickly derive strategies for the models (but beware of recursion and primary keys). But writing that model instance into the DB for running an example just takes too long for swift testing.

By @sshine - 3 months
Amazing survey.

Big fan of property testing.

A particularly big fan of Hedgehog.

I tried writing a Hedgehog-inspired library in Rust and realized how complex the underlying domain is.

For example, the RNG is "splittable" [1][2] which means you can take a deterministic seed and split it in two, e.g. for parallel generators to work independently but still deterministically. The effort that has gone into this feature level is a little numbing. I have an awe similar to that of the "fast inverse square root" hack when I see code like this:

  -- | A predefined gamma value's needed for initializing the "root" instances of
  --   'Seed'. That is, instances not produced by splitting an already existing
  --   instance.
  --
  --   We choose: the odd integer closest to @2^64/φ@, where @φ = (1 + √5)/2@ is
  --   the golden ratio.
  --
  goldenGamma :: Word64
  goldenGamma =
    0x9e3779b97f4a7c15
when realizing that those numbers don't come easily. [3]

[1]: https://hackage.haskell.org/package/hedgehog-1.4/docs/src/He...

[2]: https://gee.cs.oswego.edu/dl/papers/oopsla14.pdf

[3]: https://github.com/hedgehogqa/haskell-hedgehog/issues/191

Wonder why most property-testing libaries don't have features like this?

The libraries require training to use. And they're not that easy to write.

> the current state-of-the-art when it comes to property-based testing is stateful testing via a state machine model and reusing the same sequential state machine model combined with linearisability to achieve parallel testing

Okay, okay. I admit I've never performed property-based stateful testing, nor in parallel. So that may be the coolest feature out there, because it addresses one of the hardest problems in testing.

But I think that yet other things have happened with modern property-testing libraries (e.g. Hypothesis, PropEr, Hedgehog, Validity):

Shrinking for free [4], generators for free [5], defining the probability distribution of your sub-generators in a composable way.

Maybe those features are not as significant, but they're equally missing from almost all property-test libaries.

[4]: Gens N’ Roses: Appetite for Reduction • Jacob Stanley • YOW! 2017 https://www.youtube.com/watch?v=LfD0DHqpeVQ

[5]: https://tech.fpcomplete.com/blog/quickcheck-hedgehog-validit...

By @throw156754228 - 3 months
In JS I often want to test a parameter value does what is expected in the positive case: foo("A"), but also in the negative case, i.e foo([not "A"]). Where [not "A"] some is some arbitrary string(s) representing the negative condition like say "ZZ". Can the js property based testing library help me generate an arbitrary not "A" elegantly?
By @ruuda - 3 months
The major omission in this article is fuzzing. Not only is it practical and in wide (and growing use), it’s also far more advanced than QuickCheck’s approach of generating random inputs, because fuzzing can be _coverage-driven_. Property-based testing came out of academia and fuzzing came out of security research, initially they were not connected. But with the advent of in-process fuzzing (through libFuzzer), which encourages writing small fuzz tests rather than testing entire programs; and structure-aware fuzzing, which enables testing more than just functions that take a bytestring as input, in my view the two techniques have converged. It’s just that the two separate communities haven’t fully realized this yet.

One pitfall with non-coverage-driven randomized testing like QuickCheck, is that how good your tests are depends a lot on the generator. It may be very rarely generating interesting inputs because you biased the generator in the wrong way, and depending on how you do the generation, you need to be careful to ensure the generator halts. With coverage-driven fuzzing all of these problems go away; you don’t have to be smart to choose distributions so that interesting cases are more common, coverage instrumentation will automatically discover new paths in your program and drill down on them.

But isn’t fuzzing about feeding a large program or function random bytestrings as inputs, whereas property-based testing is about testing properties about data structures? It is true that fuzzers operate on bytestrings, but there is no rule that says we can’t use that bytestring to generate a data structure (in a sense, replacing the role of the random seed). And indeed this is what the Arbitrary crate [1] in Rust does, it gives tools and even derive macros to automatically generate values of your data types in the same way that QuickCheck can. The fuzzing community calls this Structure-Aware Fuzzing and there is a chapter about it in the Rust fuzzing book [2]. There are also tools like libprotobuf-mutator [3] that substitute fuzzers’ naive mutation strategies, but even with naive strategies fuzzers can usually get to 100% coverage with appropriate measures (e.g. recomputing checksums after mutation, if the data structure contains checksums).

I am using this extensively in my own projects. For example, RCL (a configuration language that is a superset of json) contains multiple fuzzers that test various properties [4], such as idempotency of the formatter. In the beginning it used the raw source files as inputs but I also added a more advanced generator that wastes less cycles on inputs that get rejected by the parser. The fuzzer has caught serveral bugs, and most of them would have had no hope of being found with naive randomized testing, because they required a cascade of unlikely events.

Structure-aware fuzzing is not limited to generating data structures either, you can use it to generate reified commands to test a stateful API, as you describe in the _Stateful property-based testing_ section. The Rust fuzzing book has an example of this [5], and I use this approach to fuzz a tree implementation in Noblit [6].

[1]: https://docs.rs/arbitrary/latest/arbitrary/ [2]: https://rust-fuzz.github.io/book/cargo-fuzz/structure-aware-... [3]: https://github.com/google/libprotobuf-mutator [4]: https://docs.ruuda.nl/rcl/testing/#fuzz-tests [5]: https://rust-fuzz.github.io/book/cargo-fuzz/structure-aware-... [6]: https://github.com/ruuda/noblit/blob/a0fd1342c4aa6e05f2b1c4e...

By @pydry - 3 months
Lack of parallelization really doesnt make me feel that sad. I'm constrained by many things while coding but CPU horsepower isnt one of them.
By @weitendorf - 3 months
I don't think the problem is primarily a lack of good OSS implementations or UX lacking polish. To me, it's more that there is a mismatch between what programming language/CS enthusiasts find interesting and useful, and what industry software developers find useful.

First off, the "training" problem is a very real and fundamental blocker. I think academics or less experienced engineers see it as a problem of education, but I believe it's actually an engineering/organizational problem: for a company or even OSS project, when you use niche/unadopted tools like this you either need to provide training for it or you massively shrink the set of people able to work on your project by requiring or expecting them to learn it on their own. This introduces a practical headache bigger than the ones a tool like this solves (since these are just incrementally better ways to test software) - you get less contributions or need to spend more on getting people onboarded. Note that even if "training" people is free in the sense that there is no paid training material, you still pay a cost in lower productivty for new people and the time spent training.

Even once people are trained, you now have a process/institutional need to support the property based tests you wrote. That may mean paying for licenses for software, or maintenance of whatever infrastructure/integrations you write to get property based testing working in your developer workflow. And you also now need to rely on people using these tools correctly - the problem with formal verification is that it doesn't verify that you're actually solving the real problem correctly, just that your program operates as expected (ie it only verifies that your software is Working As Implemented, not Working as Intended). The more you rely on average joes to wield complex things like this well, the more headaches you introduce - if the bottom 10% of people you hire use it sloppily or wrongly, once you're past a few dozen regular contributors you're basically going to have constant issues. You see this all the time with even just regular unit and integration tests - less capable developers constantly introduce flakes or write tests so bad/basic that it'd be better for them to not even be writing tests at all.

Even if after considering all that it still seems a worthy tradeoff, there's the question of whether property based testing solves the problem you think it does. As I mentioned, it can moreso verify software is Working as Implemented rather than Working as Intended. But there is the bigger problem of major software projects not conforming to the simpler use cases that sell people on using this tool. In my experience stateless functions almost never have bugs, and when they do, it's mostly a "requirements discovery" problem you wouldn't catch with any kind of verification tooling.

Stateful verification becomes a problem in the kinds of real systems programming projects like the Linux Kernel or Qemu which stand to benefit most from verification. If you're verifying things at the level of a "higher level" abstraction like a scheduler, you almost always have really high dimensionality/a lot of state, perhaps deceptively more than you think, because of composition. And you tend to also have a devloop that is already very slow - builds take a long time, and running tests of higher level components can be "expensive" because it may take 100ms-1s for something like a kernel to finish starting in a test environment. For a single test that's nothing, but multiplied across all tests makes testing painful enough already; adding 3x or 100x by doing property based testing with new tests, even if pruned before submitting, could be either hugely expensive or way too slow. And similarly, the increased resource requirements to maintain state machines can be very expensive. TSAN does essentially parallel property-based testing and introduces 128-512 bytes of overhead for every 8 bytes of memory under test: https://github.com/google/sanitizers/wiki/ThreadSanitizerAlg....

Is it cool and underrated? Definitely. But it's not a silver bullet and honestly just isn't worth it for most software. Even when it is worth it, it has its drawbacks.

By @benreesman - 3 months
Oh man, talk about a walk down memory lane: a polite disagreement about the intersection of the public and private sectors in which commercial control of source code written by a founder who is also a researcher is being sold because the researcher found that to be more effective than opening the code (which he tried and didn’t get much attention as a result). Both reasonable arguments, I can see both sides. I remember when the software business wasn’t a dystopian nightmare.

Today a weird, scary splinter faction of the already terrifying Effective Altruism religion/cult is talking about building digital gods they can’t control and saying with a straight face that this is imminent (absurdly extravagant claim with no evidence more convincing than a logarithmic scale and a ruler) and using this as an argument to redefine everything from the definition of science to the very epistemology of factuality and getting the public to directly or indirectly subsidize it into the bargain.

People who make Ray Kurzweil’s Singularity magical thinking look flat fucking normal and frankly sound like L Ron Hubbard more every day are unilaterally deciding that all research into, ya know, what they claim is the most important event in human history is totally black box propriety because, ya know, for our own good.

I’ll take the polite Swiss functional programming folks over Anthropic’s Mechanical Interpretability people seven days a week and twice on Sunday.

Can we go back to this please? The new thing sucks.

By @webprofusion - 3 months
If it's so sad, why not fix it? Why is it someone else's problem?