Digraphs, Dags, and Trees in Java

Graphs are a collection of nodes connected by edges. Programmers run into graphs fairly regularly because almost any collection of things with binary relationships can be viewed as a graph. As practitioners we need to understand both graph theory (abstract structures) and graph data structures (concrete representations).

Trees: Dags with a single root and where non-roots have one parent. (Technically this is a rooted tree.)

This article looks at digraphs, dags, and trees from a programmer’s perspective. Where do we see them in practice? How can we recognize them? What can we do with them?

This article also includes a Java library for working with digraphs, dags, and trees. The library code is available at https://github.com/stevewedig/blog and is published to Maven for easy usage as a dependency. This is the third in a series of Java libraries I’m sharing on this blog. The first two were:

Parents & Children: We can define the nodes on one side of arcs as “parents” and the nodes on the other side as “children”. Once we’ve settled this definition, then a node’s neighbors are grouped into a parent set and a child set. In cyclic digraphs a node can be its own parent and child.

Ancestors & Descendants: A node’s ancestors include its parents, its parents’ parents, and so on. So ancestors are the transitive version of parents, and descendants are the transitive version of children. In cyclic digraphs a node can be among its own ancestors and descendants. The nodes in a cycle will all have the same set of ancestors and descendants.

Roots (Sources) & Leaves (Sinks): Roots are nodes without any parents. Leaves are nodes without any children. Cyclic digraphs don’t necessarily have any roots or leaves. Parents are non-leaves, and children are non-roots.

Runtime Object Graphs: Objects in memory form an object graph with references or pointers as arcs. Cycles are allowed, as you would find in a doubly linked list. The basic idea behind garbage collection is to dispose of objects that are no longer in use, which means they are no longer referenced, which means they have become disconnected from the main object graph. Cycles can cause memory leaks because objects in a cycle all appear to be used by someone. For some cases a garbage collector can detect and dispose of cyclic islands, but in the general case you need to break the cycle, either manually or by using a weak reference.

Static Component Graphs (Cyclic): Components or modules depend on other components or modules. These dependencies are ideally acyclic, but this is not always the case. Cyclic dependencies can be a code smell indicating a big ball of mud. Many component design principles are related to the static component graph.

RDF Graphs: An RDF graph is a directed graph composed of triples (subject + predicate + object). Each triple is an arc representing a fact. SPARQL is a query language for querying RDF graphs.

YAML: YAML is a serialization format that can represent cyclic digraphs, unlike XML or JSON.

Digraphs are everywhere… These Princeton slides have a list of digraph applications that includes finance, transportation, scheduling, synonyms, games, phones, food chain, diseases, control flow, and so on.

2. Dags (directed-acyclic graphs)

Candy Land game board

A dag is a digraph that doesn’t contain any cycles. Dags contain at least one root (source) and leaf (sink), and can be viewed as flowing from their roots to their leaves.

Defining characteristic: Sequencing

Dags emerge when arcs indicate sequencing between nodes. Any kind of sequencing will do:

Temporal: A occurs before B, so B occurs after A.

Subset: A is a superset of B, so B is a subset of A.

Containment: A contains B, so B is contained by A.

Acyclic Dependencies: A is used by B, so B depends on A.

Dag attributes (in addition to digraph attributes)

Parents & Children: Dag nodes cannot be among their own parents or children, otherwise the graph would be cyclic.

Ancestors & Descendants: Dag nodes cannot be among their own ancestors or descendants, otherwise the graph would be cyclic.

Roots (Sources) & Leaves (Sinks): Dags have at least one root and at least one leaf. For every node to have a parent, there would have to be a cycle. Same thing with children.

Tree traversal: With dags you can follow tree traversals (depth first, breadth first, etc.) by starting at the root nodes and ensuring that each node is only visited once.

Partial Node Ordering: Dags define a partial ordering between the nodes. This makes the nodes a partially ordered set (poset), which extends the more familiar total ordering with the addition of “incomparable” as an option:

A < B: A is in B’s ancestors

A > B: B is in A’s ancestors

A = B: A and B are the same node

A and B are incomparable: A and B are different and not among each other’s ancestors

Topological Sort: Topological sort means sorting the nodes by their partial ordering, as defined above. The order of incomparable nodes doesn’t matter. When visiting nodes in topological order, you know all of a node’s ancestors will be visited before the node itself. This is useful for dependency dags where you want to handle a node’s transitive dependencies before handling the node itself. Topological sorting is not possible in cyclic digraphs.

Dag examples that are not also trees

Citation Graph: Unlike web links, the links between academic paper citations form a dag. This is because citations involve temporal sequencing: the cited paper is published before the citing paper.

Nested Subsets: Multiple inheritance is a specific example of a dag where we have atoms as leaves, the set of all atoms as the root, and various smaller subsets in the middle of a dag. In this dag parents are supersets, and children are subsets. (Apparently if such a dag contains all subsets it is a power set.)

Nested Tags: Folder systems organize every item into a single folder. In contrast tagging systems (sometimes called categories or labels) can associate an item with multiple tags. Folder systems are often nested, allowing folders to contain folders. The same can be done with tagging systems, allowing tags to contain tags. Nested folder systems form a tree, and nested tag systems form a dag.

Sexual Reproduction: A family tree is a dag where every node has two parents.

Causal Dependencies: Bayesian networks are dags where nodes are probabilistic events that conditionally depend on their parent events. Changes in an event’s probability propagate through the graph, affecting the probability of related events. Arcs usually represent causality.

Task Dependencies: Workflow systems manage a set of tasks which depend on other tasks being completed first. Tasks are often asynchronous.

Composition Tree: The composite pattern leads to trees. Nodes are either atomic pieces or they are composites containing nested nodes (which themselves may be atoms or other composites). This is similar to assembling Ikea furniture: you start with the atoms, then you build up intermediate composite pieces, and eventually you have assembled the entire object (the root composite node).

Speciation: Speciation is the evolutionary process that creates new species. Separate groups within a single species evolve in different directions until the species has been split into multiple species. Speciation forms the Tree of Life.

Asexual reproduction: Asexual reproduction is when offspring arise from a single organism, such as an amoeba splitting in two. This forms an ancestry tree where every organism has one parent.

XML: XML documents are trees. XML trees are ordered, meaning a node’s children are ordered. XPath is a query language for querying XML documents.

When designing tools there is a fundamental tradeoff between simplicity and genericism. Think steak knife vs. swiss army knife. Two quality “swiss army” graph libraries for Java are JGraphT and JUNG (discussion on StackOverflow). If you are building a graph centric application you may want such a generic library. But what if your application just happens to have a digraph or two? Could a less generic tool be simpler and easier to use? I think so, and that’s the motivation for this article’s library. Here’s what makes it easy to use:

Focus: The library only supports immutable, directed, unordered, and unweighted digraphs, dags, and trees. This family can solve a wide range of graph problems, but not all of them. A narrower scope means the library’s interfaces don’t have to incorporate graphs that are mutable, undirected, ordered, weighted, hyper, etc.

Node Centric: The library supports node centric usage. Many digraph problems are naturally modeled as recursive node structures, with nodes pointing to their parents or children. However generic graph libraries tend to separate nodes (vertices) from edges (arcs). This effectively makes edges first class concepts, if not actual objects. Such a design makes perfect sense for a generic graph library, but it can be less convenient when you’re thinking in terms of a recursive node structure.

Immutable/Functional: The library’s graphs are all immutable value objects, similar to Guava’s immutable collections. Immutable value objects have many benefits: they are referentially transparent, they can be validated once upon creation, they can be safely shared, their properties can be safely cached, and they are easier to reason about than mutable objects. The tradeoff is that immutable objects require a clean separation between creation and usage, which for some use cases isn’t practical or efficient.

Guava: The library’s only dependency is Google Guava. This makes it easy to drop into a project and also makes it compatible with Google Web Toolkit. The library also fully embraces Guava, making liberal use of the immutable collections: ImmutableList, ImmutableSet, ImmutableSetMultimap, and ImmutableBiMap.

Update: An author of Jung mentioned that he is going to be releasing a new graph library that also has some of these properties.

Overview of library concepts

Nodes: A digraph consists of nodes that are connected by directed arcs. Nodes can be any type of object.

Ids: Nodes are referenced by id. These ids can be any type of object. A common choice is to use string names.

Id Graphs: Id graphs are digraphs without node objects. The IdGraph<Id> interface is extended by IdDag<Id> which is extended by IdTree<Id>. Id graphs contain a graph’s id structure, so their state is a set of ids and a mapping from child id to parent ids, stored in ImmutableSet<Id> and ImmutableSetMultimap<Id, Id> respectively. If your use case doesn’t require node objects, id graphs can be used on their own.

Node Graphs: Node graphs are digraphs with node objects. The Graph<Id, Node> interface is extended by Dag<Id, Node> which is extended by Tree<Id, Node>. A node graph’s state is an IdGraph<Id> and a mapping from id to node, stored in ImmutableBiMap<Id, Node>. Node graphs implement Set<Node>. They also implement IdGraph<Id> by delegating to their nested id graph.

Partial Node Graphs: Partial node graphs are node graphs that don’t necessarily have a node associated with every id. Such graphs with unbound ids are occasionally useful.

Up and Down Nodes: Node graphs place no constraints on the type of node objects they can contain. However it is often convenient to have nodes that know their own id and either their parent ids or child ids. These interfaces are UpNode<Id> and DownNode<Id> respectively.

An unintended benefit of the interface hierarchy is that all of the library’s methods are listed in the Javadoc page for the Tree interface.

Id graphs
Id graphs are digraphs without node objects. The interface IdGraph is extended by IdDag which is extended by IdTree. These interfaces are demonstrated using sample graphs:

When reading the tests, note that parseList(), parseSet(), and parseMultimap() are just helpers for building Guava’s ImmutableList<String>, ImmutableSet<String>, and ImmutableSetMultimap<String, String>.

The id graph libraries (IdGraphLib, IdDagLib, IdTreeLib) enable you to create graphs from a parent map (child id -> parent ids) or a child map (parent id -> child ids). If all nodes have an arc, then you can just pass in a parent map or child map without also providing the set of ids.

Graphs are validated upon creation, throwing the errors defined in the digraph.errors package:

IdGraphClass throws GraphHadUnexpectedIds if the arc mapping contains ids not in the id set.

The attributes on id graphs are mostly self explanatory. Properties that return collections are lazily computed and most are cached. For efficiency vs. convenience, some methods have two versions: one returning an Iterable and another returning a Collection. As you move from IdGraph to IdDag to IdTree you get more methods that are specific to those types of digraphs (such as rootId() on IdTree). The generic traversal methods are described here.

If you don’t need node objects then id graphs can be used on their own. TestExampleTwitterGraph and TestExampleCategoryTree show example applications that directly use id graphs. If you aren’t using nodes but you do want to associate a payload with each id, then you can just keep around a payload map stored as Map<Id, Payload>.

Node graphs

Node graphs are digraphs with node objects. The interface Graph is extended by Dag which is extended by Tree. These interfaces are demonstrated using the same sample graphs:

The node graph libraries (GraphLib, DagLib, TreeLib) have methods to create graphs from the underlying state: an id graph and a node map. This enables you to provide any type of node objects. However it’s usually more convenient to work with nodes that know their own id and either their parent ids or child ids. These interfaces are UpNode<Id> and DownNode<Id>. The graph libraries have up() and down() methods for creating graphs from collections of UpNodes or DownNodes. These methods can also be used for graph unions because they accept collections of Set<Node>, and because node graphs implement Set<Node>.

The attributes on node graphs are mostly self explanatory. Properties that return collections are lazily computed and most are cached. For efficiency vs. convenience, some methods have two versions: one returning an Iterable and another returning a Collection. As you move from Graph to Dag to Tree you get more methods that are specific to those types of digraphs (such as rootNode() on Tree). The generic traversal methods are described here.

Node graphs can be used as id graphs because they implement the id graph interfaces (by delegating to their nested id graph). So in addition to the node graph methods with “Node” in the name, you have direct access to all the id graph methods with “Id” in their name. You can also access the underlying id graph via the idGraph() method.

Normal node graphs have a node associated with every id. However for a few applications it makes sense to allow unbound ids: ids without nodes. Partial graphs use the same underlying graph classes as the full graphs, however their interfaces exclude node traversals or other attributes that are undefined in the presence of unbound ids.

TestExampleDependencyDag is an example application of partial graphs, where nodes are modules, dags represent the dependency structure between modules, and unbound ids are the external dependencies that the environment is expected to provide.

Custom node classes

When you already have domain objects corresponding to nodes, it is convenient to have them directly implement either UpNode or DownNode. You just have to implement id() and either parentIds() or childIds(). You can subclass UpNodeClass and DownNodeClass if you want, but it’s often easier not to. Once your nodes implement UpNode or DownNode, they work with the up() and down() methods in the node graph libraries: GraphLib, DagLib, TreeLib.

The venerable Design Patterns book includes the good advice to prefer composition over inheritance. This advice suggests that we should wrap graphs with our own objects instead of subclassing graphs. Perhaps we can even hide the existence of a graph object from our clients. Composition is a good starting point, however sometimes we do want to provide a full graph interface to our clients, along with additional problem-specific functionality. In this case it can make sense to subclass one of the library’s graphs.

TestExampleFileTree is an example application with a custom graph class. It extends TreeClass to add a custom traversal and domain specific node sets.

Custom traversals

The graphs’ built-in traversals include depth first and breadth first orderings, as well as traversing through a node’s ancestors or descendants. You can define your own custom traversals using the generic traverseIdIterable() or traverseNodeIterable() methods:

The library’s graphs are unordered, meaning there is no ordering between a node’s children or parents. However custom traversals can apply such an ordering inside the expand() function. This is why expand() returns List<Id> instead of Set<Id>. TestExampleFileTree shows an example of this, using traverseNodeIterable() to visit child nodes in alphabetical order.

Digraph: TestExampleTwitterGraph sketches an example application of digraphs. It is modeled after Twitter accounts with a following/follower digraph. An account authority score is derived by the number of transitive followers an account has. This example doesn’t use nodes, so it directly employs an IdGraph.

Dag: TestExampleDependencyDag sketches an example application of dags. It uses a PartialDag with a custom node class that implements UpNode<Id>. Nodes are modules, dags represent the dependency structure between modules, and unbound ids are the external dependencies that the environment is expected to provide.

Tree: TestExampleFileTree sketches an example application of trees. It uses a custom tree class and a custom node class that implements UpNode<Id>.

Tree: TestExampleCategoryTree sketches another example application of trees, adding a tree structure to an enum of animal categories. This example doesn’t use nodes, so it directly employs an IdTree.

Reusable algorithms

The few graph algorithms used are decoupled from this library’s graph interfaces, which makes them reusable in other digraph contexts.

Generic Graph Traversal: TraverseLib provides the implementation for the generic traversal methods. I find it interesting that a single iterator class can support any depth first or breadth first traversal, providing both node and id traversals.