A Practical Suffix-Tree Implementation for String Searches

Source Code Accompanies This Article. Download It Now.

Suffix trees are used for string searches. Our authors describe how to build a generalized suffix tree data structure using as few hardware resources as possible while still approaching the time complexity derived in theory.

Jul00: Algorithm Alley

Bogdan is a graduate assistant and Craig is a professor in the computer-science department at Rutgers University. They can be contacted at dbogdan@caip.rutgers.edu and nevill@cs.rutgers.edu, respectively.

Suffix trees are used for string searching in bioinformatics -- DNA sequencing, protein sequence pattern identification, and the like. In this context, a pattern is a string with an associated score. The score is computed as the sum of fixed values for each character in the string, pondered with weights that vary with the index of each character in the string. Since searches are usually performed on a large number of sequences, they require considerable hardware resources.

Suffix trees are not new to DDJ. Mark Nelson also examined them in his article "Suffix Trees" (DDJ, August 1996). In this article, we will build on Mark's discussion, and describe a practical problem: How to build a generalized suffix-tree data structure using as little hardware resources as possible while still approaching the time complexity derived in theory. We will also provide a suffix-tree implementation (written in Java) and test data that you can use. The complete source code is available electronically (see "Resource Center," page 5).

Searching in protein sequences is more complicated than searching in DNA sequences because the alphabet is larger (24 characters instead of 4) and the sequences are relatively long. The same problem appears when searching text. For a protein, a sequence is the string of the characters representing amino acid and special symbols. For a sentence in text, a sequence may be the array of words in the sentence. As such, a text can be seen as a multisequence of words.

When dealing with many sequences, the time required per sequence for detecting a pattern improves (for example, more than 40 percent better for protein identification) if we use a generalized suffix tree built from the multisequence consisting of all the sequences concatenated and separated by delimiters. Unfortunately, multisequences are usually very long and suffix-tree implementations following the theory-like description require too much time and space when the alphabet is considerable and the number of sequences large.

A classic suffix-tree implementation requires a huge storage space when the original sequence is very long, because of the information needed to build the tree in linear time. If the core implementation is object oriented, we also experience the undesired effects of memory fragmentation. Because each edge, node, or suffix represents an object, the program spends most of its execution dealing with memory management. This process takes not only more space (more information is needed to handle the objects and it has to be stored somewhere), but also more time. The time is no longer linear, due to the memory management.

The problem here involves too many objects and too much precomputed information. The former is a classic programming problem not related to suffix trees; the solution is to use primitive data structures. We can obtain the same tree by representing the node objects as subsequent cells in a long array of integers.

What do we do when we insert a node in the tree? How do we move edges, nodes, and suffixes, or split edges in the array? We can avoid those by implementing the tree as if we do not know its special data structure -- treating it as a classic multinode tree. Therefore, the information is not stored in the edges.

Once all the information has been moved in the nodes, we still have a problem with the node size. Whenever a child node is inserted, its parent size increases with the space occupied by the link to the child. The parent node is stored as a sequence of cells in an array, and other nodes may be stored after it. Increasing the parent's size means shifting all cells of the nodes that follow it in the array towards the end of the array. This makes managing the nodes in a long array a time-consuming process.

Fortunately, we can represent the tree such that all the nodes have the same size, no matter how many children they may have. Instead of having a variable number of links to other nodes (besides other useful information), each node can contain only two links -- the first child and the left sibling. Figure 1 is a suffix trie for the sequence ANANAS (constructed by inserting the suffixes from sequence ANANAS in their index order: ANANAS, NANAS, ANAS, NAS, AS, and S) and the transformed suffix tree.

Figure 2 is a suffix tree for the same sequence and the transformed suffix tree. The objects keeping data about suffixes in the classic tree and the leaves in our tree are not shown.

There is a solution for the space problem: To decrease memory usage and avoid fragmentation, the data should be represented with primitive types (no objects) and all the nodes should have the same size, regardless of how many children they may have.

Each node has three fields:

Index = index(node), the starting index of a substring in the multisequence.

Child = child(node), the index of the first child in the node array, stored as a positive value, or a suffix value stored as a null or negative value.

Sibling = sibling(node), the index of the right sibling in the node array.

There is no information stored in edges. This technique allows each node to represent a substring from a valid suffix in the multisequence, which starts at the index index(node) and ends right before the index index(child(node)). We call the length of that substring the "node length" and it has the value length(node)= index(child(node))-index(node). The same technique ensures that, during tree construction, there are creations of nodes (inserted at the end of the node array) and updates of node fields, but no node deletions. Therefore, the node fields can be compactly represented in arrays.

What about inserting a suffix in the middle of the tree when constructing the tree? We avoid that by first constructing a suffix array that contains the array of the suffixes sorted in reverse order using a version of Quicksort. Because the Quicksort algorithm sorts in place, there is no overhead with memory management, and creating the suffix array takes O[N×log(N)] time (where N is the number of characters in the input sequence) multiplied by a constant given by the average common prefix between two suffixes. The necessary time is almost linear in practice due to the slowly increasing curve of the log(N) function. The suffix array (in reverse lexicographic order) constructed from the sequence ANANAS (with the suffixes ANANAS, NANAS, ANAS, NAS, AS, and S) is shown in Figure 3. We then create the suffix tree by inserting the suffixes in the order given by the suffix array, and we obtain a tree with suffixes in direct order. Finally, we free the memory occupied by the suffix array while constructing the suffix tree.

Because we build the tree by inserting suffixes sorted in reverse lexicographic order, the node insertion always takes place either as a child of the tree root, as a node with a leaf, or a leaf on the path visited during the previous insertion. We call that path the "insertion path," and it is always the path starting from the root node and following the child(node) link in each node except the last node (a leaf). On the current insertion path, all the nodes from the root to the insertion point will have their index field updated so that they point to the same valid suffix (that is, the one that is inserted).

This technique lets us always insert a new suffix on the path constructed by the links to the first children and split a node from that path if necessary. Therefore, we never traverse the (horizontal) links amongst siblings, and the number of memory accesses is minimal during the node insertion. We only create new nodes. Splitting an edge in the classic suffix tree is equivalent to splitting a node in our tree, which means creating one or two new nodes and updating the link to the first child of the split node.

The time required to insert all the suffixes is O[N2] (where N is the number of suffixes or characters in the sequence). Consider the worst case -- the suffix tree built from sequence AAAA...A composed from N identical characters. When we insert the suffixes AAAA...A, ..., AAA, AA, A from the suffix array into the suffix tree, the average common prefix between two successive suffixes is ((N-1)+ (N-2)+1)/N=N/2=O[N] linear. Computing a common prefix and inserting the corresponding new node is linear with the length of that prefix. Because we insert N suffixes, the total time is theoretically O[N2] quadratic.

The quadratic time complexity seems to be a problem because there are theoretical algorithms that can build a suffix tree in linear time. Fortunately, the necessary time is O[N] linear in practice because, statistically, the average common prefix value is a very small value and not related to N, but with the statistical appearance of characters in the sequence. Moreover, if we concatenate the sequences in a long multisequence, the average common prefix variation with N is extremely small. For example, with the protein sequences from the SwissProt database (http://www.expasy.ch/sprot/sprot-top.html), the average common prefix decreases from 24 to 18 when we increase the number of sequences from 1000 (0.4M input characters) to 59,000 (21M input characters). For real applications, therefore, its value can be regarded as a relatively small constant. As mentioned earlier, we are dealing with practical problems (a large number of sequences), and this helps us here. Listing One is the pseudocode for inserting a node.

Figure 4 illustrates the steps for creating the suffix tree for ANANAS (by inserting the suffixes from the suffix array in reverse lexicographic order: S, NAS, NANAS, AS, ANAS, and ANANAS). The tree structure is good for pattern searches because the entire tree structure must be traversed when a pattern is searched. As Listing Two shows, visiting the tree is straightforward.

A suffix tree has to store information about the starting index of suffixes, too. Consider that information as being stored in leaves to access it faster. As an index of the first child, a leaf has a negative integer, which represents not an index of a node, but a suffix in the original sequence. The minus sign discriminates between inner nodes and leaves. Therefore, no supplementary information is needed to indicate the type of a node -- inner node or leaf.

A suffix tree constructed from an N-character sequence has at most 2N nodes. Because there are N suffixes in the sequence and we decided to keep the information about suffixes in the tree as well as leaves, the final tree will have at most 3N nodes. This is not a deviation from theory, because the inner nodes of our tree (at most 2N) represent the real suffix tree. We have only organized the data in an easy-to-access way. The three node fields are represented with successive cells in three arrays of integers, and the entire tree needs at most 9N integers for storage (and no additional information). Therefore, the space requirement is O[N].

This is not an increase in memory usage compared with the implementation described by Mark Nelson. His implementation has objects with information about suffixes pointed from each leaf node, therefore has at most 2N nodes (one integer each) with information about edges and children, and at most 2N edges (four integers each) with information, and N so-called "suffix objects" (3 integers each) with information about suffixes. Overall it uses at most 5N objects represented on about (2N×1+2N×4+N×3)=13N integers, besides the other data structures (like a hash tables of size 2N for edges and an array of fixed size of 2N with links to nodes) needed to manage the tree.

For comparison, we took a sequence of N=21M characters (or suffixes) and built a suffix tree with 51M nodes. Our tree requires five minutes for construction and 580 MB for storage. Mark's implementation requires much more time and 850 MB for the edges, nodes, and suffixes, not including the edge hash table for edges, and the overhead for object management in memory (overall, it goes over 1 GB).

There is always a price to pay depending on where we place an implementation between theory and practice, and between obtaining a linear-time or a linear-space requirement, and the type of application for which we intend to use the suffix tree.

After we have solved the main problems related with suffix-tree construction from a single long sequence, we continue with a generalized suffix tree; that is, a suffix tree constructed from an array of sequences (multisequence). In theory, a multisequence is built by concatenating the sequences and separating them through unique delimiters.

If we start with M sequences, we need M distinctive delimiters that must not be in the alphabet from which the sequences are composed. When M is large, each delimiter has to be represented on several bytes (more than one character), and storing and managing them creates a sensible time overhead. Also, a suffix of the multisequence that contains delimiters inside it is not entirely useful, because the valuable part of it extends from its beginning until the appearance of the first delimiter.

Let us consider two sequences -- BMBK and BK. The multisequence constructed from them is BMBK<*1>BK<*2>, where <*1> and <*2> are distinct sequence delimiters for this example. From the second suffix of the multisequence MBK<*1> BK<*2>, only the substring MBK is useful (valid) because it is a suffix in an original sequence (BMBK) of the multisequence.

Therefore, we defined the concepts of valid suffix, valid suffix array, and valid suffix tree. A valid suffix is a substring in the multisequence that represents a suffix in a sequence included in the multisequence. The starting indices of the valid suffixes have to be memorized as leaf nodes in the suffix tree because the same valid suffix may appear more than once in a multisequence.

The valid suffix array is a suffix array constructed with the valid suffixes of a multisequence. A valid suffix tree is a suffix tree constructed from a valid suffix array.

This approach has some advantages. The delimiters in the multisequence can have the same value. No test is needed for sequence delimiters during tree traversal for a pattern search (unlike in a classic suffix-tree implementation). A path in the tree represents and contains all the necessary information about a valid suffix, not a multisequence suffix. Paths in our tree are shorter than or equal to the paths in a classic suffix tree. Our tree has fewer nodes and paths than a classic suffix tree because it contains no multisequence suffix starting with a delimiter (like <*1>BK<*2>).

The valid suffixes of the multisequence BMBK*BK* (* is the common sequence delimiter) are exactly the suffixes of the original sequences: BMBK, MBK, BK, K (from the first sequence BMBK), BK, and K (from the second sequence BK). The valid suffix array constructed from the multisequence is in Figure 5.

The valid suffix tree built from the array (drawn along with the leaves with information about suffixes) is in Figure 6. There is no memory used to store information about the multisequence suffixes that start with the * delimiter.

The suffix-tree implementation we present here was designed for representing multiple sequences and searching strings or patterns in a multisequence. It requires in practice almost linear time and linear space to build and is free of memory-management overhead. Table 1 presents some experimental results for suffix trees of different sizes built from multisequences of increasing size (the sequences are protein sequences). All the results were obtained on a PC with a single processor 500-MHz Pentium with 1 GB of memory.

If the inserted suffix is the first suffix in the suffix array then
Create the root and add as child a node with a leaf.
Else
Compute l = longest common prefix between the current suffix and
previous suffix in suffix array.
If l = 0
Add as child to root a node with a leaf.
Else
Let s = starting index of the inserted suffix. Follow each node
n on the insertion path updating index(n) with s so that
it points in the current inserted suffix and decreasing s
and l with the value length(n), until (l == 0) or (l <
length(node)). Then add a leaf to the node n if (l == 0) or
split the node n (after splitting, the updated node n will
have two children: 1) child(n) is a leaf or a node with a
leaf and continues the new inserted path, and 2)
sibling(child(n), which is a node with all the children
of the split node).

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!