Abstract

We present some techniques that may be used to perform computations on general trees implemented in a relational database as a self-joined table in the widely used child-parent format, and we prove the validity of this technique under very general hypothesis. Another, much better actually format.

Introduction

Relational databases are not making much progress lately in the path finding problem in a persisted tree. Vendor specific SQL extensions, such as the CONNECT BY clause in Oracle, are a further step in this area. But the casual developer still finds their performance unacceptable for most real life situations, and it's also not recommended to tight its application to a vendor specific database. It is acceptable to implement different versions of stored procedures or SQLs for different databases, but when all the application logic is based on tree facilities only provided by a specific vendor, it gets much harder to port the application to other database systems. And it's usually true that the clients have not bought the most expensive commercial database systems.

We propose some solutions to the path finding problem, that are not database specific while still giving better performance than some of the database-specific extensions supporting trees.

Changes to the child-parent format to keep the entire path for every node

Database structure

The reader should note that the method presented in this section is efficient especially for read-only access to such persisted trees. The write operations complexity is roughly presented in this paper. Also, the alternate format is what you probably are looking for.

For performance reasons, one could try to add to every node the information representing the path to that node from the root of the tree. I don't actually know any applications that could benefit from this, and if I can't find such stuff in a short time, I'll remove this section.

We add a column named Path to the table containing the child to parent links. This column must not be too big, so we'll analyze it's space complexity first. The time needed to compute the Path column values is less important because we primarily discuss here read-only trees.

The Path column should be some kind of bit sequence. Some examples are RAW in Oracle or Binary in SQL Server. We define the Path value recursively (trees are to be processed recursively, isn't it ?):

Path(RootNode) = Null

For nodes other than the root,

Path(Node) = Path(Parent(Node)) + Index(Node, Parent(Node))

where + means concatenation of bit sequences, and Index is the child ordinal in the domain of the parent nodes children list.

Note that the Path will uniquely identify every node in the tree. An unique index on this column might be ok to use, if needed in case a single tree is stored in the table.

All the nodes at a given level will have the Index represented at the same bit size. Let the node count in a tree be denoted by N, let the levels be 0,1,..., L and the max(Index(i)) be Ci, where 0 <= i <= L – 1. If we want we can also define max(Index(CL)) = 0, because the nodes on the highest level have no children. It becomes clear that the bit size of the Index for nodes on level i is equal to ceil(log2(Ci)), where ceil is an integer value equal to the smallest integer greater than or equal to its numeric argument, and log2 is the base 2 logarithm of a specified number.

Having this Path format, we can easily select an entire subtree when given the subroot, searching for nodes where Path(Node).startsWith(Path(SubRoot)) and we also have at each node the information needed to identify the path to the root. But how long must be the Path column to contain this plenty of information?

If the tree is really static (read-only), that the Path size is calculated specifically for it. One can use this approach if he has strong guarantees that there is no need to change the tree structure, or that modifying a column length is an acceptable operation (especially at run-time). Another approach would be to first compute the maximum Path size for a given tree node count. Even if the first approach does not need knowledge about the maximum path size for trees of N nodes, it is still needed to mathematically prove that both approaches are appropriate for real life usage.

Path size bounds

For any given node count N, we find out the worst case for the required Path size, and also compute that maximum Path size value. If we allocate that maximum size for the Path column, then we are guaranteed that no matter how “strange” a tree with N nodes or less looks, it “fits” into our table.

The path size for any given tree TN with N nodes and L levels is equal to

Note that C0 + ... + CL - 1 <= N – 1, the equality holding when every node in the tree has at most one child with children on his own. We find an upper bound for PathSize(TN) in the domain denoted by the previous inequality.

Update behavior

In order to optimize update behavior, a Path size greater than the actual needed value can be allocated, but it should always be less than or equal to the maximum Path size computed for the maximum number of nodes accepted by the application to reduce wasted space. The format of the Path field need not be exactly the one described previously. Space can be inserted between fields as necessary in order to allow growth of some levels. For example, if inserts are more likely to appear at some levels of the tree, than more free space should be allocated to those parts of the Path field. If a node is inserted into a level and the part of the parent's Path field cannot hold the new children count, than all the Path values must be changed to accommodate a new bit corresponding to the parent's level.

The most important thing to take care of is avoiding node inserts that cause some Paths not to fit anymore into the Path field. As proved before, this can be avoided computing the maximum path size according to the formula 2(N – 1) / (E * ln(2)). As long as the node count is less than or equal to N, this path size will accommodate any updates of the tree, be them inserts or removes.

If a node is removed or inserted, all the Path values of its subtree must be updated accordingly. So leaves can always be removed in O(1) time, and leaves can be added in O(1) time only if the Path part of the parent's level can accommodate an incremented value. Otherwise, we have an O(n) time complexity.

Numeric values

If N = 10,000 than the maximum Path size is less than 1.3 Ko, while for N = 1,000,000 we get the 130 Ko bound.

Oracle provides support for bit sequences in the UTL_RAW package, while SQL Server seems to have more limited functionality related to them.

In order to ask queries, for instance, of the form WHERE Path starts With '11011', one could cast the Path to a hex string, and check if the Path is lexicographically between D8 and DF, in case the database in use does not provide enough support for bit sequences manipulation.

Further study

First of all, some averages must be computed, especially for the Path size.

A few transformations from a general tree T to a binary tree B with the property that, whenever x is a child of y in T, x remains a descendent of y in B are evident. These are based on adding new, fake nodes, into the tree to make it binary in case a node has more than 2 children. Some more sophisticated transformations could try to keep the fake nodes grouped into complete binary subtrees.

One method would be to associate the 0 bit to left edges, and the 1 bit to right edges.

Another idea would be to overlay the binary tree on top of the Stern-Brocot tree, and use the fraction m/n as the Path column (actually, there could be 2 integer columns). If the binary tree has L levels, than m and n could get equal to FL, where F is the Fibonacci sequence. One advantage of this structure is that if one wants to find a rational number between other two (reduced) rational numbers, it only has to look “between” them.

If one wants to limit the values of m and n, the Farey series could be used instead. This one is actually a part of the Stern-Brocot tree. It could also be interesting to study how changes of the initial tree are reflected in the Stern-Brocot tree.

The foreign key RightChildIndex referring to NodeIndex has to be a deferred one if the delete operation is to be implemented this way. That means that it has to be only enforced at the end of the transaction, and not during it.

Further study

We could also want to store the Breadth-First Traversal of the tree, but I haven't found any useful applications yet.

Tree archives or composites

The need to “archive” tree structures arises in some applications, especially those provide traceability for some tree structures. If a tree version that existed at some point in time is not frequently accessed, then some shorter encoding for the tree can be used to store in the database. I recommend the Prüfer-Like Codes. With most of these codes, the structure of a tree with N nodes is encoded in N – 2 integers in linear time. Be aware that these encodings only store the tree structure, and not the information in the nodes. That information has to be kept separately anyway.

and to compute the diameter and radius of a tree directly from it's code, without first reconstructing the tree into memory.

Conclusions

A broad set of operations on trees can be efficiently implemented using only basic tree traversal algorithms. It would be interesting to further study how the known tree encodings can be efficiently applied in relational databases.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Comments and Discussions

I may be wrong, but isn't there an error in the alternate format (nodeindex, rightchildindex) SQL? If I define a tree where the rightmost child of the root (0) itself as a child, I think the subtree query will not work. As an example, if you ad a node 7 to your tree below 6. The then 7 will not be picked up.

RightChildIndex is not the proper name for this situation. I should have called it "RightmostDescendent" or even better, "LargestDescendent". In the case you describe, the RightmostDescendent(0) will be 7.