I am trying to solve the problem of largest common substring between 2 Strings. I will reduce my problem to the following: I created a general suffix tree and as per my understanding the largest common substring is the deepest path consisting of nodes that belongs to both strings.

My test input is:

String1 = xabc
String2 = abc

It seems that the tree I build is correct but my problem is the following method (I pass the root of the tree initially):

What I was aiming to do is, in order to find the deepest path with nodes that belong to both strings, I would traverse the tree (pre-order) and add nodes that belong to both strings in a list (current). Once I am in a node that is not part of both I update max list if current is bigger.

But the code is erroneous. And I am confused on how to implement this, since I haven't written code for general (non-binary) trees in ages.

Could you help me figure this out?

Update:
Modified as per @templatetypedef. Could not make this work either.

4 Answers
4

One issue I see here is that the depth of a node in the suffix tree is not the same as the length of the string along that path. Each edge in a suffix tree is annotated with a range of characters, so a string encoded by a series of nodes of depth five might have shorter length than a string encoded at depth two. You will probably need to adjust your algorithm to handle this by tracking the effective length of the string that you've built up so far, rather than the number of nodes in the path that you've traced out up to this point.

A second issue I just noticed is that you seem to only have one current variable that is getting shared across all the different branches of the recursion. This probably is messing up your state across recursive calls. For example, suppose that you're at a node and have a path of length three, and that there are two children - the first of which only ends in a suffix of the first string, and the second of which ends in a suffix of both strings. In that case, if you make the recursive call on the first string, you will end up executing the line

current = new ArrayList<SuffixNode>();

in the recursive call. This will then clear all your history, so when you return from this call back up to the original node and descend into the second child node, you will act as if there is no list of nodes built up so far, instead of continuing on from the three nodes you found so far.

To fix this, I'd suggest making current a parameter to the function and then creating a new ArrayList when needed, rather than wiping out the existing ArrayList. Some of the other logic might have to change as well to account for this.

Given this, I would suggest writing the function in pseudocode like this (since I don't know the particulars of your suffix tree implementations):

If the current node is null, return 0.

If the current node doesn't come from both strings, return 0.

Otherwise:

Set maxLen = 0.

For each child node:

Compute the length of the longest common substring rooted at that node.

Add to that length the number of characters along the edge to that child.

I have updated the OP with the suffix tree for the example.The result should be {a1,2},{b1,2},{c1,2}.Right? So I am a little confused on your distinction between depth of node and length of string
–
CratylusJan 3 '13 at 21:01

What you've posted above isn't a suffix tree - it's a suffix trie (note that there are no endmarkers on the strings and that there are nodes with exactly one child). In that case, there is no distinction between node depth and string length. That probably means that the other bug I've mentioned might be at issue. Have you investigated this?
–
templatetypedefJan 3 '13 at 21:06

But isn't a suffix tree just a trie with all the suffixes?Does it have to be compressed?I will test your bug with the parameter and let you know asap
–
CratylusJan 3 '13 at 21:09

@Cratylus- Oh, I just realized there's another bug - you should be updating the longest match you've found so far at every iteration, not just when you find a node that doesn't belong to both strings. Look at your above trie - in your case, the longest substring that works ends up terminating at a leaf node. Your recursion won't try to update the maximum substring in that case. Try moving the updating code to the top of the recursion, just below the null check.
–
templatetypedefJan 3 '13 at 21:09

@Cratylus- A suffix tree is a compressed suffix trie in which any node with just one child is merged with that child. There's also the addition of special "endmarkers" to the string to denote where the string ends. They're much more space efficient than the trie, but dramatically harder to construct.
–
templatetypedefJan 3 '13 at 21:10

I have not run this...but I will. Basically, this will start with the smallest of the two strings and will attempt to find a common string between the two strings using IndexOf and substring. then if it does it will check again but this time check by adding one more character from the smaller string to the check. It will only store the string in the commonStr variable if the string found (foundCommonStr) is larger than the commonStr. If it doesn't find a match then it has already stored the largest commonStr for being returned.

The suffix tree approach makes it possible to find the longest repeated substring in time O(n + m), where n and m are the lengths of the two strings. Your approach runs in time O(mn), which is very slow for sufficiently long strings.
–
templatetypedefJan 3 '13 at 21:07

Understood. Makes sense. Plus what I had there wouldn't work "as is" anyhow. I just tried to run it. :-)
–
user1901482Jan 3 '13 at 22:03