Using optimization to make good guesses for test cases
2021-07-28, 19:30–19:40 (UTC), Purple

Some applications seem untestable because they are slow to run, with too many options. One approach chooses tests carefully using an optimization algorithm to find the smallest set of tests that are likely to exercise all the parts of the code. In this talk, we introduce the UnitTestDesign.jl package for combinatorial testing and show how it integrates with Julia's test framework using Julia's system of artifacts and scratch spaces.


I'm developing the largest inference application I have ever seen. For any population, it estimates morbidity and mortality from disease, cast against a background of mortality, but this is measured across years for multiple ages. There are seven web pages of settings, and it can take a day to run. In some way, it's easy to test because it's an inverse problem, so I can create a correct answer, generate data, and see if the application finds the correct answer. What I want is a defensible claim that I've tested the seven pages of settings.

My first approach is to write tests for some important cases. These paradigmatic tests have to pass, and they tell stakeholders that the basics work well. I add to these some tests I know challenge the system. Beyond these two classes of tests are another set of less common techniques that help look for bugs where I don't expect them. These include random testing, concolic testing, and property-based testing. For this problem, let's focus on a simpler technique, combinatorial testing.

Combinatorial testing is a careful selection of test arguments, designed to likely have good code coverage. If we picture a page of code, then any one call to a function will walk through that code, skipping parts of it when it fails an if-condition. A thorough set of tests should, at least, execute different parts of if-conditions. There must be some choice of inputs to the application that lead to every branch of the code. Some branches depend on two input arguments multiplied pairwise. Others may depend on a particular combination of three or four input arguments. It would be helpful to test each value of each option and, somehow, walk through all possible pairs of arguments or all possible triples of arguments, in order to cover all branches.

If we have twenty different options, each of which can take one of four values, we don't have to run twenty-times-four tests to try every value. We can pack them into only a few tests. What if we wanted to try all pairs of the first two values? For each pair, that's four-choose-two, or twelve, combinations of arguments, to make twelve tests for each pair, and there are twenty-choose-two pairs, but we can pack these together, too, so that each test case explores a lot of the code.

The algorithms in UnitTestDesign.jl use greedy optimization to construct short test suites to pack all-pairs testing into as few arguments as possible. For twenty arguments with four values each, it can pack every possible pair of arguments into thirty-seven test cases. There is some research support that all-pairs testing will do a good job of finding faults in code, but the same package can generate tests with higher coverage, where higher means all triples or quadruples of input values are included in test cases.

Most implementations of all-pairs algorithms aren't easy to run in a unit-testing framework because they are web-based or proprietary. There are a few reasons for this. These algorithms need to deal with different argument types. They need to give the tester a way to say that, if a flag is false, then another argument can't take certain values, so they need a little domain-specific language. Julia handles those problems naturally and, further, is efficient at the greedy optimization to determine test cases. These can take time to generate.

For applications with many options, or functions with many arguments, generating a good set of test cases can be computationally intensive, so we rely on the testing framework to help us generate values when needed, save them, and load them later. In Julia, the packages for scratch space and artifacts give us a workflow for testing where we generate combinatorial values, save them to scratch, and upload them as artifacts for others to use.

The resulting approach is to create a set of tests, save them for reuse, and run them many times. Given the challenging problem of testing a large, slow application, we've begun to describe a paradigm from the field of test automation. The general approach is to create a bunch of tests, measure their coverage, select a set to run, and respond to failing tests by refining those tests until we've narrowed down the fault at their source. Parts of this general approach can be seen in random testing, concolic testing, and property-based testing. Compared with these, combinatorial testing is the art of starting with a really good guess.

Andrew Dolgert is a computational scientist at the University of Washington. He has been a high-performance computing consultant for many years, working on diverse projects such as parallelization of molecular dynamics, immersive visualization of fracture mechanics, provenance for the Large Hadron Collider, and time series analysis of the world's global health. His recent work is on continuous-time, discrete-event simulation and on testing of scientific code.