The last several years have seen a proliferation of static and runtime
analysis tools for finding security violations that are caused by explicit information
flow in programs. Much of this interest has been caused by the
increase in the number of vulnerabilities such as cross-site scripting and SQL
injections. In fact, these explicit information flow vulnerabilities commonly
found inWeb applications now outnumber vulnerabilities such as buffer overruns
common in type-unsafe languages such as C and C++. Tools checking
for these vulnerabilities require a specification to operate. In most cases the
task of providing such a specification is delegated to the user. Moreover, the
efficacy of these tools is only as good as the specification. Unfortunately,
writing a comprehensive specification presents a major challenge: parts of
the specification are easy to miss leading to missed vulnerabilities; similarly,
incorrect specifications may lead to false positives.

This paper proposes Merlin, a new algorithm for automatically inferring
explicit information flow specifications from program code. Such
specifications greatly reduce manual labor, and enhance the quality of results,
while using tools that check for security violations caused by explicit
information flow. Beginning with a data propagation graph, which represents
interprocedural flow of information in the program, Merlin aims
to automatically infer an information flow specification. Merlin models
information flow paths in the propagation graph using probabilistic constraints.
A na¨ýve modeling requires an exponential number of constraints,
one per path in the propagation graph. For scalability, we approximate
these path constraints using constraints on chosen triples of nodes, resulting
in a cubic number of constraints. We characterize this approximation
as a probabilistic abstraction, using the theory of probabilistic refinement
developed by McIver and Morgan. We solve the resulting system of probabilistic
constraints using factor graphs, which are a well-known structure
for performing probabilistic inference.

We experimentally validate the Merlin approach by applying it to 10
large business-critical Web applications that have been analyzed with
Cat.Net, a state-of-the-art static analysis tool for .NET. We find a total
of 167 new confirmed specifications, which result in a total of 302 additional
vulnerabilities across the 10 benchmarks. More accurate specifications also
reduce the false positive rate: in our experiments, Merlin-inferred specifications
result in 13 false positives being removed; this constitutes a 15%
reduction in the Cat.Net false positive rate on these 10 programs. The final
false positive rate for Cat.Net after applying Merlin in our experiments
drops to under 1%.