Why is learning so effective in software testing?

1 Mar 2017

The Communications of ACM published
an article
on automata learning in software engineering last February. The techniques
described in the article are used to obtain models for the (input/output)
behaviour of software. Even without access to source code, one can now use
model checking or other bug finding tools on these models. The article shows
many successful applications. Why is this possible at all?
(This is a copy of a post I wrote on
automatonlearning.net.)

Many people I talk to are very surprised that a black box learning technique
is even able to find any bugs at all. When explaining the learning algorithm,
it seems at first that no guarantees can be given. The algorithm keeps
refining some hypothesis and many people ask: “When do you stop learning?”
Luckily, the field of computational learning theory is rather mature, and we
can give very precise theoretical guarantees. In this post I will highlight
some of these results and why they matter to us. Before I do so, let’s recap
the theoretical learning framework which is used in the CACM article. (I
strongly suggest, you also read the CACM paper at some point! Or at least
watch the accompanying video.)

The type of learning we will investigate is called active learning (also
called exact learning, or query learning). We suppose there is some alphabet A
and a unknown language L ⊆ A* (where A* is the set of words). It is the task
of a learning algorithm to infer an automaton accepting the language L. The
learner has access to an oracle which can answer two types of queries:

Membership queries: The learner provides a word, the oracle will answer
whether that sequence is accepted by the language or not.

Equivalence queries: The learning provides a hypothetical automaton for the
language, the oracle will either reply positively (the hypothesis is
correct) or negatively and in addition gives a word (a counterexample) for
which the automaton has a different acceptance than the language L.

In the context of software, acceptance is generalized to arbitrary output (in
particular the membership queries reply with actual output of the software).
So instead of working with DFAs, one works with Moore or Mealy type automata.

Dana Angluin showed that any regular language can be learned in this model,
using only polynomially many queries. As a consequence we can learn the
behaviour of unknown software as long as it has some regular behaviour.
However, there are some objections one might have.

OBJECTION 1

For a black box system we don’t have such an oracle.

ANSWER

The great thing about software is that we can in fact run it. It is
interactive. This means that membership queries can be answered immediately by
the software itself!

For equivalence queries, however, this is a reasonable objection. How can we
know whether the hypothesis is correct w.r.t. a unknown system? Luckily any
efficient learning algorithm with membership queries and equivalence queries
can be transformed into an efficient probably approximately correct (PAC)
algorithm. (This is a fun exercise in the book by Kearns & Vazirani. Their
book provides a very good introduction in PAC theory.) In practice it means
that we can randomly sample test cases (with whatever distribution we seem
fit) and obtain a good approximation. The bounds on accuracy are quantitative
and can be set arbitrarily high (at polynomial cost only).

OBJECTION 2

Since we are doing machine learning we need big data. It takes ages to gather
so many samples.

ANSWER

Indeed, learning neural networks, for example, is very hard in typical
learning settings. In fact, in the PAC framework it is proven to be
intractable (unless some cryptographic assumptions fail). As a consequence, if
you want to increase accuracy you’ll need more than polynomially many more
samples. That is, you need big data.

Our setting, however, is slightly different from the typical machine learning.
We are saved by the membership queries. The learner does not rely on just any
sample, it relies on precisely the samples it chooses to query. This makes all
the difference and gives the polynomial bounds mentioned earlier.

I once asked Falk Howar (one of the developers of
LearnLib) whether he considers himself as a machine
learning scientist. Indeed he does so, “but”, he added, “we do small data, not
big data.”

OBJECTION 3

All this talk on polynomial bounds doesn’t apply in practice.

ANSWER

The celebrated survey paper
P versus NP by Scott Aaronson
already answers this objection. Somehow the gap between P and NP seems to be
the gap between feasible and non-feasible. Partly, because algorithms in P
often have a low- order polynomial as running time.

I experience that learning algorithms are no exception. The algorithm by Dana
Angluin is polynomial (counting number of queries here) with a low-order
polynomial. These asymptotics and also the constants have been improved in
recent years, as mentioned in the CACM article. It’s just very efficient. All
of this is witnessed by the many applications mentioned in the CACM.

A while ago, I posted about the RERS challenge, a challenge where software had
to be analysed. This was a white-box challenge, but we used exclusively black-
box techniques and got very good results. In particular we used a fuzzing and
a learning algorithm. We noticed that their results were very comparable (the
fuzzer was slightly better). However, the fuzzer afl-lop executed a couple
orders of magnitude more traces than the learner. I have no doubts any more
that learning is efficient in practice!

OBJECTION 4

So… When do you stop learning?

ANSWER

One way is to use the PAC framework. Specify the accuracy, compute the number
of samples needed, and run the algorithm. This might not be completely
satisfactory, though. What does it mean when a hypothesis is epsilon-close to
the real software?

There is another way. In testing literature there are many conformance testing
algorithms. Instead of giving probabilistic bounds, these algorithms give more
concrete bounds on the hypothesis. A typical guarantee is: The black box
system is equivalent to the hypothesis, unless it has at least k more states.
The parameter k can be set as high as you want (this time at exponential
cost).

In our work we often combine the two approaches. We have probabilistic
guarantees and additionally, have equivalence up to k states, for a small k.
For many experiments we do, we actually obtain the equivalent automaton.

OBJECTION 5

Software is often Turing complete, and cannot be captured by regular
behaviour.

ANSWER

This is a deeper problem. The CACM article touches on this and already gives
some solutions. There is (at least) two ways to generalize from automata to
richer classes of behaviour.

One is to consider richer automata, still having a regular structure as a
core. Examples are register automata (which add data-flow and predicates on
transitions) and weighted automata (which add some arithmetic to the
transitions). Learning algorithms have been described for such automata.

Another direction is going up to Chomsky hierarchy. It is known that visibly
pushdown automata can be learned in a similar fashion. Currently, we’re not
able to go further up the hierarchy.

However, in the many applications we see that the regular universe is big
enough. This is partly because one can choose an alphabet small enough (or
abstract enough) to exhibit regular behaviour. By choosing such a alphabet we
might get a smaller, more abstract view on the actual behaviour. But it’s
often enough for bug finding, and additionally it gives a model for which we
have very good tooling, theory and interpretability (see
Chris’ post).

OBJECTION 6

Using black box techniques is silly if you actually have source code!

ANSWER

I totally agree! Despite the effectiveness of learning, we can imagine it
being much more successful if we incorporate information from the source code.
Frits Vaandrager likes to call this “grey box methods”. A good example is
found in work by Andreas Zeller. He learns grammars more effectively by
instrumenting the source code, tracking certain values through the execution.
I hope we will see more of this combined techniques, utilizing best of both
worlds.

I hope I gave a nice overview of some of the theoretical guarantees, making
clear that computational learning has a good foundation. Maybe you have some
ideas of how to apply learning in your own application. Or you have some ideas
on how to combine it with white box methods. We are very interested in hearing
these ideas! People contacting us with new applications of the theory always
inspires us!