Introduction

Sawja is a library written in OCaml, relying
on the Javalib to provide a high level representation of
Java byte-code programs. Whereas Javalib is
dedicated to class per class loading, Sawja introduces a
notion of program thanks to control flow algorithms. For instance,
a program can be loaded using various algorithms like Class
Reachability Analysis (CRA), a variant of Class
Hierarchy Analysis algorithm (CHA) or Rapid Type
Analysis (RTA). For now, RTA is the best
compromise between loading time and precision of the call graph. A
version of XTA is available and provides a way to refine
the call graph of a program. To get more information about control
flow graph algorithms and their complexity, you can consult the
paper of Frank Tip and Jens Palsberg 1.

In Sawja, classes and interfaces are represented by
interconnected nodes belonging to a common hierarchy. For example,
given a class node, it's easy to access its super class, its
implemented interfaces or its children classes. The next chapters
will give more information about the nodes and program data
structures.

Moreover, Sawja provides some stack-less intermediate
representations of code, called JBir and A3Bir.
Such representations open the way to many analyses which can be
built upon them more naturally, better than with the byte-code
representation (e.g. Live Variable Analysis). The
transformation algorithm, common to these representations, has been
formalized and proved to be semantics-preserving2.

Sawja also provides functions to map a program using a
particular code representation to another.

Global architecture

In this section, we present the different modules of
Sawja and how they interact together. While reading the
next sections, we recommend you to have a look at Sawja
API at the same time. All modules of Sawja are sub-modules
of the package module Sawja_pack in order to avoid
possible namespace conflicts.

JProgram module

This module defines:

the types representing the class hierarchy.

the program structure.

some functions to access classes, methods and fields (similar
to Javalib functions).

some functions to browse the class hierarchy.

a large set of program manipulations.

Classes and interfaces are represented by
class_node and interface_node
record types, respectively. These types are parametrized by the
code representation type, like in Javalib. These types are
private and cannot be modified by the user. The only way to create
them is to use the functions make_class_node and
make_interface_node with consistent arguments. In
practice, you will never need to build them because the class
hierarchy is automatically generated when loading a program. You
only need a read access to these record fields.

The program structure contains:

a map of all the classes referenced in the loaded program.
These classes are linked together through the node structure.

a map of parsed methods. This map depends on the algorithm used
to load the program (CRA, RTA, ...).

a static lookup method. Given the calling class name, the
calling method signature, the invoke kind (virtual, static, ...),
the invoked class name and method signature, it returns a set of
potential couples of (class_name,
method_signature) that may be called.

JCRA, JRTA
and JRRTA modules

Each of these modules implements a function
parse_program (the signature varies) which returns
a program parametrized by the Javalib.jcode
representation.

In RTA, the function parse_program
takes at least, as parameters, a class-path string and a program
entry point. The default_entrypoints value
represents the methods that are always called by Sun JVM
before any program is launched.

In CRA, the function parse_program
takes at least, as parameters, a class-path string and a list of
classes acting as entry points. The
default_classes value represents the classes that
are always loaded by Sun JVM.

JRRTA is a refinement of RTA. It first calls
RTA and then prunes the call graph.

If we compare these algorithms according to their precision on
the call-graph, and their cost (time and memory consumption), we
get the following order : CRA < RTA <
RRTA < XTA.

JNativeStubs module

This module allows to define stubs for native methods,
containing information about native method calls and native object
allocations. Stubs can be stored in files, loaded and merged. The
format to describe stubs looks like:

JControlFlow module

JControlFlow provides many functions related to class,
field an method resolution. Static lookup functions for
invokevirtual, invokeinterface,
invokestatic and invokespecial
are also present.

This module also contains an internal module PP
which allows to navigate through the control flow graph of a
program.

JBir,
JBirSSA, A3Bir and A3BirSSA modules

These modules both declare a type t defining an
intermediate code representation. Both representations are
stack-less. A3Bir looks like a three-address code
representation whereas expressions in JBir can have
arbitrary depths. JBirSSA and A3BirSSA are
variants which respect the Static Single Assignment (SSA) form.

Each module defines a function transform which
takes as parameters a concrete method and its associated
JCode.code, and returns a representation of type
t. This function coupled with
JProgram.map_program2 can be used to transform a
whole program loaded with RTA algorithm for example.

JPrintHtml module

This module allows, for a given code representation, to dump a
program into a set of .html files (one per class)
related together by the control flow graph. It provides a functor
Make that can be instantiated by a module of signature
PrintInterface. This functor generates a module of
signature HTMLPrinter containing a function
print_program.

This module is internally used by the different Sawja
code representations through a print_program
function to dump a program using their representation.

The printer for JCode, which is a Javalib
module is defined in JPrintHtml using the presented
functor. It will be used as example in the tutorial.

Tutorial

To begin this tutorial, open an OCaml toplevel, for
instance using the Emacstuareg-mode, and
load the following libraries in the given order:

Don't forget the associated #directory
directives that allow you to specify the paths where to find these
libraries. If you installed sawja with FindLib you should do:

#directory "<package_install_path>extlib"#directory "<package_install_path>camlzip"#directory "<package_install_path>ptrees"#directory "<package_install_path>javalib"#directory "<package_install_path>sawja"(*<package_install_path> is given by command 'ocamlfind printconf'. If it is the same path than standard ocaml library just replace by '+'.*)

You can also build a toplevel including all these libraries
using the command make ocaml in the sources
repository of Sawja. This command builds an executable
named ocaml which is the result of the
ocamlmktop command.

Loading and printing a
program

In this section, we present how to load a program with
Sawja and some basic manipulations we can do on it to
recover interesting information.

In order to test the efficiency of Sawja, we like to
work on huge programs. For instance we will use Soot, a
Java Optimization Framework written in Java,
which can be found at http://www.sable.mcgill.ca/soot. Once you have
downloaded Soot and its dependencies, make sure that the
$CLASSPATH environment variable contains the
corresponding .jar files, the Java
Runtimert.jar and the Java Cryptographic
Extensionjce.jar. The following sample of
code loads Soot program, given its main entry point:

Warning: : Subroutines inlining is handled in JBir and
A3Bir only for not nested subroutines (runtime of version
JRE1.6_20 contains a few nested subroutines and next version 1.7
none). If some transformed code contains such subroutines, the
exception JBir.Subroutine or
AB3Bir.Subroutine will be raised, respectively.
However, when transforming a whole program with the above function,
no exception will be raised because of the lazy evaluation
of code.

To see how JBir representation looks like, we can
pretty-print one class, for instance
java.lang.Object:

You also need to provide a function that may associate names to
method parameters in the signature. Then, when generating the html
instructions you need to be consistent with those names. In our
implementation JCodePrinter in jPrintHtml.ml, we
use the source variables names when the local variable table exists
in the considered method. If you want to test your printer very
quickly, you can define:

let method_param_names _ _ _ = None

Now, we need to define how to display the instructions in html.
In order to do that, some html elements can be created by using
predefined functions in JPrintHtml. These functions are
simple_elem, value_elem,
field_elem, invoke_elem and
method_args_elem. The sample of code below will
help you to understand how you can use these functions. You are
also recommended to read the API documentation.

The html elements have to be concatenated in a list and will be
displayed in the given order. The element returned by
simple_elem is raw text. The element returned by
value_elem refers to an html class file. The
element returned by field_elem is a link to the
field definition in the corresponding html class file. Field
resolution is done by the function resolve_field
of JControlFlow. If more than one field is resolved (it
can happen with interface fields), a list of possible links is
displayed. The element returned by invoke_elem is
a list of links refering to html class file methods that have been
resolved by the static_lookup_method function of
JProgram.program. The element returned by
method_args_elem is a list of
value_elem elements corresponding to the method
parameters. They are separated by commas and encapsulated by
parentheses, ready to be displayed.

If you don't want any html effect, the above function becomes
very simple:

Create an
analysis for the Sawja Eclipse Plugin

In this section we will use the live variable
analysis, included in Sawja as a dataflow analysis example, to
create an algorithm that detects unused variable assignment, and
turn it into a component of the Sawja Eclipse Plugin.

We use the result of the live variable analysis
associated with the JBir code representation (Live_bir module in
Sawja) that returns the live variables before the execution of an
instruction. For each instruction JBir.AffectVar
(var,expr) we check that on the next instruction the
variable var is alive: if not, it is a dead
affectation.

The analysis should notify the programmer in case of a dead
variable affectation, as it could be a sign of a bug. As a
consequence we want to put warnings on the dead affectation
instruction and on the method containing it. We also want to give
more verbose information on the result of the analysis, in this
case to indicate, for each instruction, which variables are
live.

In precedent tutorials we used the OCaml toplevel but
for this one we want to generate an executable: as a consequence we
will use the native ocaml compiler and construct the file
dvad.ml step by step (bottom up).

The head of the dvad.ml file should load the
Javalib and Sawja library packages:

open Javalib_pack
open Sawja_pack

We first parse the arguments of our executable using the module
ArgPlugin which is a wrapper to the standard Arg
module. It will allow us to directly add our executable in the
Eclipse plugin, just by dropping it in a folder. In order to
automatically analyze all the classes in a project (see
documentation of ArgPlugin), our code will parse a list of
class names.

The implementation of this tutorial is supplied with the
Sawja library (version > 1.2) as the file
dvad-plugin.ml in src/dataflow_analyses.
It also demonstrates how to insert HTML code to display the
information on the variable liveness.

Using formulae
to make assertions

Formulae is a new feature of Sawja 1.4. It provides a
way to add some special assertions into a Bir
representation (JBir, A3Bir...) using dedicated Java stub methods.
To keep full compatibility with the previous versions of
Sawja it is disabled by default. You can enable
formulae by setting the optionnal formula
argument.

If you set the formula argument to
true without specifying the
formula_cmd argument, you will be using the
default_formulae. It means that the default Java stub methods are
going to be used to generate the formulae. A Java call to
such a method will be replaced in the generated JBir code
by the formula. Those static methods must return
void and take only a single boolean argument,
otherwise, it will not be considered as a formula.

The default value offers 3 Java static methods named
'assume', 'check',
'invariant' and defined in the Class
'sawja.Assertions'. The java source file is
present in runtime/sawja/Assertions.java and will
be necessary to compile your Java class using formulae (methods
bodies are empty since those methods will not really be called)
.

You must pay attention to the fact that the formula directly
stores the boolean expression which was given as argument to the
Java method. This expression can then be directly manipulated in
different analyzes to make assumptions. The generated 'nop'
instructions are replacing the instructions which have been used to
create the formula.

In the current implementation, the expression obtained when
working with the A3bir representation is quite
limited (it is often a simple temporary variable, containing the
result of the expression). It will be completed in a next
release.

You can also create your own formula handler, using the
class and static methods you want (methods must return void and
take only a single boolean argument). You just have to do the as
follows: