File Dependency Example

One of the most common uses of the graph abstraction in computer
science is to track dependencies. An example of dependency tracking
that we deal with on a day to day basis is the compilation
dependencies for files in programs that we write. These dependencies
are used inside programs such as make or in an IDE such as
Visual C++ to minimize the number of files that must be recompiled
after some changes have been made.

Figure 1 shows a graph that has a vertex
for each source file, object file, and library that is used in the
killerapp program. The edges in the graph show which files
are used in creating other files. The choice of which direction to
point the arrows is somewhat arbitrary. As long as we are consistent
in remembering that the arrows mean ``used by'' then things will work
out. The opposite direction would mean ``depends on''.

Figure 1:
A graph representing file dependencies.

A compilation system such as make has to be able to answer a
number of questions:

If we need to compile (or recompile) all of the files, what order
should that be done it?

What files can be compiled in parallel?

If a file is changed, which files must be recompiled?

Are there any cycles in the dependencies? (which means the user
has made a mistake and an error should be emitted)

In the following examples we will formulate each of these questions in
terms of the dependency graph, and then find a graph algorithm to
provide the solution. The graph in Figure
1 will be used in all of the following examples. The source code
for this example can be found in the file examples/file_dependencies.cpp.

Graph Setup

Here we show the construction of the graph. First, these are the required
header files:

For simplicity we have
constructed the graph "by-hand". A compilation system such
as make would instead parse a Makefile to get the
list of files and to set-up the dependencies. We use the
adjacency_list class to represent the graph. The
vecS selector means that a std::vector will be used
to represent each edge-list, which provides efficient traversal. The
bidirectionalS selector means we want a directed graph with access to both the edges outgoing from each vertex and the edges incoming to each vertex, and the
color_property attaches a color property to each vertex of the
graph. The color property will be used in several of the algorithms in
the following sections.

Compilation Order (All Files)

On the first invocation of make for a particular project, all
of the files must be compiled. Given the dependencies between the
various files, what is the correct order in which to compile and link
them? First we need to formulate this in terms of a graph. Finding a
compilation order is the same as ordering the vertices in the graph.
The constraint on the ordering is the file dependencies which we have
represented as edges. So if there is an edge (u,v) in the graph
then v better not come before u in the ordering. It
turns out that this kind of constrained ordering is called a
topological sort. Therefore, answering the question of
compilation order is as easy as calling the BGL algorithm topological_sort(). The
traditional form of the output for topological sort is a linked-list
of the sorted vertices. The BGL algorithm instead puts the sorted
vertices into any OutputIterator,
which allows for much more flexibility. Here we use the
std::front_insert_iterator to create an output iterator that
inserts the vertices on the front of a linked list. Other possible
options are writing the output to a file or inserting into a different
STL or custom-made container.

Parallel Compilation

Another question the compilation system might need to answer is: what
files can be compiled simultaneously? This would allow the system to
spawn threads and utilize multiple processors to speed up the build.
This question can also be put in a slightly different way: what is the
earliest time that a file can be built assuming that an unlimited
number of files can be built at the same time? The main criteria for
when a file can be built is that all of the files it depends on must
already be built. To simplify things for this example, we'll assume
that each file takes 1 time unit to build (even header files). For
parallel compilation, we can build all of the files corresponding to
vertices with no dependencies, e.g., those that have
an in-degree of 0, in the first step. For all other files, the
main observation for determining the ``time slot'' for a file is that
the time slot must be one more than the maximum time-slot of the files
it depends on.

We start by creating a vector time that will store the
time step at which each file can be built. We initialize every value
with time step zero.

std::vector<int> time(N, 0);

Now, we want to visit the vertices against in topological order,
from those files that need to be built first until those that need
to be built last. However, instead of printing out the order
immediately, we will compute the time step in which each file should
be built based on the time steps of the files it depends on. We
only need to consider those files whose in-degree is greater than
zero.

Cyclic Dependencies

Another question the compilation system needs to be able to answer is
whether there are any cycles in the dependencies. If there are cycles,
the system will need to report an error to the user so that the cycles
can be removed. One easy way to detect a cycle is to run a depth-first search, and if the
search runs into an already discovered vertex (of the current search
tree), then there is a cycle. The BGL graph search algorithms (which
includes
depth_first_search()) are all extensible via the
visitor mechanism. A visitor is similar to a function object,
but it has several methods instead of just the one
operator(). The visitor's methods are called at certain
points within the algorithm, thereby giving the user a way to extend
the functionality of the graph search algorithms. See Section Visitor Concepts
for a detailed description of visitors.

We will create a visitor class and fill in the back_edge()
method, which is the DFSVisitor method
that is called when DFS explores an edge to an already discovered
vertex. A call to this method indicates the existence of a
cycle. Inheriting from dfs_visitor<>
provides the visitor with empty versions of the other visitor methods.
Once our visitor is created, we can construct and object and pass it
to the BGL algorithm. Visitor objects are passed by value inside of
the BGL algorithms, so the has_cycle flag is stored by
reference in this visitor.