July 15th, 2024

Automated Test-Case Reduction

Adrian Sampson explains automated test-case reduction techniques using the Shrinkray reducer to debug an interpreter bug. He emphasizes the importance of effective test scripts and occasional manual intervention for efficient bug identification and resolution.

Read original articleLink Icon
Automated Test-Case Reduction

Adrian Sampson from Cornell University discusses automated test-case reduction techniques in a blog post. He explains how automated reducers can quickly and blindly try different approaches to reduce test cases, potentially in parallel, compared to manual methods. The reduction process involves picking code to delete, checking the output to decide whether to backtrack, and determining when to stop reducing. Sampson demonstrates using the Shrinkray reducer to debug an interpreter bug, emphasizing the importance of writing an effective interestingness test script. Tricks like shell negation, including a "not bogus" check, and using grep to search for specific error messages help in crafting robust test scripts. The post highlights the need for occasional manual intervention to guide the reduction process effectively. By combining automated reduction with manual adjustments, users can efficiently identify and address bugs in their code.

Related

Chasing a Bug in a SAT Solver

Chasing a Bug in a SAT Solver

Adolfo Ochagavía and Prefix.dev swiftly resolved a bug in the SAT-based dependency solver, resolvo, with community input. The incident emphasizes open-source collaboration and potential debugging tool enhancements for software quality.

Mix-testing: revealing a new class of compiler bugs

Mix-testing: revealing a new class of compiler bugs

A new "mix testing" approach uncovers compiler bugs by compiling test fragments with different compilers. Examples show issues in x86 and Arm architectures, emphasizing the importance of maintaining instruction ordering. Luke Geeson developed a tool to explore compiler combinations, identifying bugs and highlighting the need for clearer guidelines.

Refined Input, Degraded Output: The Counterintuitive World of Compiler Behavior

Refined Input, Degraded Output: The Counterintuitive World of Compiler Behavior

The study delves into compiler behavior when given extra information for program optimization. Surprisingly, more data can sometimes lead to poorer optimization due to intricate compiler interactions. Testing identified 59 cases in popular compilers, emphasizing the need for better understanding.

The Transformation Priority Premise (2013)

The Transformation Priority Premise (2013)

The Clean Coder Blog discusses Transformations in code development, distinguishing them from Refactorings. It emphasizes maintaining a preferred order of Transformations to enhance the red/green/refactor cycle efficiency. Examples illustrate transitioning from specific to generic code.

Boosting Compiler Testing by Injecting Real-World Code

Boosting Compiler Testing by Injecting Real-World Code

The research introduces a method to enhance compiler testing by using real-world code snippets to create diverse test programs. The approach, implemented in the Creal tool, identified and reported 132 bugs in GCC and LLVM, contributing to compiler testing practices.

Link Icon 7 comments
By @clord - 6 months
I used to do something like this all the time with C/C++ compiler tests. I tried lots of fancy tools and stuff, but I kept going back to: expand all macros and retokenize to one token per line (I made a custom build of the preprocessor that had this built in). Then, have a shell script randomly remove lines, and use another script to check that the resulting test case behaves consistently with the failure. It would run for a few hours (or days, for boost problems), then usually you'd get a really minimal testcase that shows the problem. Often I would use this to find regressions. Just have the shell script check one is good, the other has the problem. The resulting output would then usually point exactly at the regressed feature, and make an amazing unit test.

Before leaving compiler team, I wanted to find a way to do this one the AST level (so parens always go in pairs, etc), but that could also complect with the bug.

I wonder if LLMs could accelerate this by more intelligently removing stuff first, iteratively?

By @crvdgc - 6 months
Very interesting. Previously I've seen two similar ideas:

1. Reduce the test data by property-based tests' auto shrinking a là QuickCheck [1]

2. Reduce the change set to a codebase, a là delta debugging [2]

The reducer is able to do both at the same time in a pretty generic way, with some restrictions on the expressivity. Seems to be a good fit for end-to-end tests.

[1]: https://www.cse.chalmers.se/~rjmh/QuickCheck/

[2]: https://dl.acm.org/doi/10.1145/318774.318946

By @imp0cat - 6 months
Interesting. The techniques are quite similar to what one can do with Hypothesis in Python ( https://hypothesis.readthedocs.io/en/latest/quickstart.html ). Basically, it generates a bunch of test data and then tries to shrink it to the smallest example that fails the test automatically.
By @wyldfire - 6 months
In addition to c-reduce and shrinkray there's also cvise [1] for reducing tests. It's great.

Also, if you happen to be reducing an LLVM case you can use llvm-reduce [2].

[1] https://github.com/marxin/cvise

[2] https://llvm.org/docs/CommandGuide/llvm-reduce.html

By @mspdx22 - 6 months
There was a talk about some novel ways to approach test suite generation and reduction at a PLDI workshop a few weeks ago. That work came from the compilers and programming languages tool domain:

https://pldi24.sigplan.org/home/egraphs-2024#program

Specifically, the EGSTRA paper.

By @jedisct1 - 6 months
Zig has a built-in test-case reduction command ("zig reduce").
By @reruwe - 6 months
I find it odd that test-case reduction is billed as a “research skill”. To me, this is a basic debugging skill for anyone who works with code.

Maybe it’s more critical when you’re dealing with “research-quality” code. This post seems to ignore the underlying problem: it’s rare to see research code with any tests.