Finding the Longest Palindromic Substring in Linear Time

November 28, 2007

Another interesting problem I stumbled across on reddit is
finding the longest substring of a given string that is a palindrome.
I
found the explanation on Johan Jeuring's blog somewhat
confusing and I had to spend some time poring over the Haskell code
(eventually rewriting it in Python) and walking through examples
before it "clicked." I haven't found any other explanations of the
same approach so hopefully my explanation below will help the next
person who is curious about this problem.

Of course, the most naive solution would be to exhaustively examine
all \(n \choose 2\) substrings of the given \(n\)-length string, test each
one if it's a palindrome, and keep track of the longest one seen so
far. This has complexity \(O(n^3)\), but we can easily do better by
realizing that a palindrome is centered on either a letter (for
odd-length palindromes) or a space between letters (for even-length
palindromes). Therefore we can examine all \(2n + 1\) possible centers
and find the longest palindrome for that center, keeping track of the
overall longest palindrome. This has complexity \(O(n^2)\).

It is not immediately clear that we can do better but if we're told
that an \(\Theta(n)\) algorithm exists we can infer that the algorithm
is most likely structured as an iteration through all possible
centers. As an off-the-cuff first attempt, we can adapt the above
algorithm by keeping track of the current center and expanding until
we find the longest palindrome around that center, in which case we
then consider the last letter (or space) of that palindrome as the new
center. The algorithm (which isn't correct) looks like this
informally:

Set the current center to the first letter.

Loop while the current center is valid:

Expand to the left and right simultaneously until we find
the largest palindrome around this center.

If the current palindrome is bigger than the stored maximum
one, store the current one as the maximum one.

Set the space following the current palindrome as the
current center unless the two letters immediately surrounding
it are different, in which case set the last letter of the
current palindrome as the current center.

Return the stored maximum palindrome.

This seems to work but it doesn't handle all cases: consider the
string "abababa". The first non-trivial palindrome we see is "a|bababa", followed by "aba|baba". Considering the current space as the
center doesn't get us anywhere but considering the preceding letter
(the second 'a') as the center, we can expand to get "ababa|ba". From this state, considering the
current space again doesn't get us anywhere but considering the preceding
letter as the center, we can expand to get "abababa|". However, this is incorrect as the
longest palindrome is actually the entire string! We can remedy this
case by changing the algorithm to try and set the new center to be one
before the end of the last palindrome, but it is clear that having a
fixed "lookbehind" doesn't solve the general case and anything more
than that will probably bump us back up to quadratic time.

The key question is this: given the state from the example above,
"ababa|ba", what makes the second 'b' so
special that it should be the new center? To use another example, in
"abcbabcba|bcba", what makes the second
'c' so special that it should be the new center? Hopefully, the
answer to this question will lead to the answer to the more important
question: once we stop expanding the palindrome around the current
center, how do we pick the next center? To answer the first question,
first notice that the current palindromes in the above examples
themselves contain smaller non-trivial palindromes: "ababa" contains
"aba" and "abcbabcba" contains "abcba" which also contains "bcb".
Then, notice that if we expand around the "special" letters, we get a
palindrome which shares a right edge with the current palindrome; that
is, the longest palindrome around the special letters are proper
suffixes of the current palindrome. With a little thought, we
can then answer the second question: to pick the next center, take
the center of the longest palindromic proper suffix of the current
palindrome. Our algorithm then looks like this:

Set the current center to the first letter.

Loop while the current center is valid:

Expand to the left and right simultaneously until we find
the largest palindrome around this center.

If the current palindrome is bigger than the stored maximum
one, store the current one as the maximum one.

Find the maximal palindromic proper suffix of the current
palindrome.

Set the center of the suffix from c as the current center
and start expanding from the suffix as it is palindromic.

Return the stored maximum palindrome.

However, unless step 2c can be done efficiently, it will cause the
algorithm to be superlinear. Doing step 2c efficiently seems
impossible since we have to examine the entire current palindrome to
find the longest palindromic suffix unless we somehow keep track of
extra state as we progress through the input string. Notice that the
longest palindromic suffix would by definition also be a palindrome of
the input string so it might suffice to keep track of every palindrome
that we see as we move through the string and hopefully, by the time
we finish expanding around a given center, we would know where all the
palindromes with centers lying to the left of the current one are.
However, if the longest palindromic suffix has a center to the right
of the current center, we would not know about it. But we also have
at our disposal the very useful fact that a palindromic proper
suffix of a palindrome has a corresponding dual palindromic proper
prefix. For example, in one of our examples above, "abcbabcba",
notice that "abcba" appears twice: once as a prefix and once as a
suffix. Therefore, while we wouldn't know about all the palindromic
suffixes of our current palindrome, we would know about either it or
its dual.

Another crucial realization is the fact that we don't have to keep
track of all the palindromes we've seen. To use the example
"abcbabcba" again, we don't really care about "bcb" that much, since
it's already contained in the palindrome "abcba". In fact, we only
really care about keeping track of the longest palindromes for a given
center or equivalently, the length of the longest palindrome for a
given center. But this is simply a more general version of our
original problem, which is to find the longest palindrome around
any center! Thus, if we can keep track of this state
efficiently, maybe by taking advantage of the properties of
palindromes, we don't have to keep track of the maximal palindrome and
can instead figure it out at the very end.

Unfortunately, we seem to be back where we started; the second
naive algorithm that we have is simply to loop through all possible
centers and for each one find the longest palindrome around that
center. But our discussion has led us to a different incremental
formulation: given a current center, the longest palindrome around
that center, and a list of the lengths of the longest palindromes
around the centers to the left of the current center, can we figure
out the new center to consider and extend the list of longest
palindrome lengths up to that center efficiently? For example, if we
have the state:

<"ababa|??", [0, 1, 0, 3, 0, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?]>

where the highlighted letter is the current center, the vertical line
is our current position, the question marks represent unread
characters or unknown quantities, and the array represents the list
of longest palindrome lengths by center, can we get to the state:

<"ababa|??", [0, 1, 0, 3, 0, 5, 0, ?, ?, ?, ?, ?, ?, ?, ?]>

and then to:

<"abababa|", [0, 1, 0, 3, 0, 5, 0, 7, 0, 5, 0, 3, 0, 1, 0]>

efficiently? The crucial thing to notice is that the longest
palindrome lengths array (we'll call it simply the lengths array) in
the final state is palindromic since the original string is
palindromic. In fact, the lengths array obeys a more general
property: the longest palindrome d places to the right
of the current center (the d-right palindrome) is at least
as long as the longest palindrome d places to the left of the current
center (the d-left palindrome) if the d-left
palindrome is completely contained in the longest palindrome around
the current center (the center palindrome), and it is of equal length
if the d-left palindrome is not a prefix of the center
palindrome or if the center palindrome is a suffix of the entire
string. This then implies that we can more or less fill in the
values to the right of the current center from the values to the left
of the current center. For example, from [0, 1, 0, 3, 0, 5, ?, ?, ?,
?, ?, ?, ?, ?, ?] we can get to [0, 1, 0, 3, 0, 5, 0, ≥3?, 0,
≥1?, 0, ?, ?, ?, ?]. This also implies that the first unknown
entry (in this case, ≥3?) should be the new center because it
means that the center palindrome is not a suffix of the input string
(i.e., we're not done) and that the d-left palindrome is a
prefix of the center palindrome.

From these observations we can construct our final algorithm which
returns the lengths array, and from which it is easy to find the
longest palindromic substring:

Initialize the lengths array to the number of possible
centers.

Set the current center to the first center.

Loop while the current center is valid:

Expand to the left and right simultaneously until we find
the largest palindrome around this center.

Fill in the appropriate entry in the longest palindrome
lengths array.

Iterate through the longest palindrome lengths array
backwards and fill in the corresponding values to the right of
the entry for the current center until an unknown value (as
described above) is encountered.

set the new center to the index of this unknown value.

Return the lengths array.

Note that at each step of the algorithm we're either incrementing
our current position in the input string or filling in an entry in the
lengths array. Since the lengths array has size linear in the size of
the input array, the algorithm has worst-case linear running time.
Since given the lengths array we can find and return the longest
palindromic substring in linear time, a linear-time algorithm to find
the longest palindromic substring is the composition of these two
operations.

Here is Python code that implements the above algorithm (although
it is closer to Johan Jeuring's Haskell implementation than to the
above description):

def fastLongestPalindromes(seq):
"""
Behaves identically to naiveLongestPalindrome (see below), but
runs in linear time.
"""
seqLen = len(seq)
l = []
i = 0
palLen = 0
# Loop invariant: seq[(i - palLen):i] is a palindrome.
# Loop invariant: len(l) >= 2 * i - palLen. The code path that
# increments palLen skips the l-filling inner-loop.
# Loop invariant: len(l) < 2 * i + 1. Any code path that
# increments i past seqLen - 1 exits the loop early and so skips
# the l-filling inner loop.
while i < seqLen:
# First, see if we can extend the current palindrome. Note
# that the center of the palindrome remains fixed.
if i > palLen and seq[i - palLen - 1] == seq[i]:
palLen += 2
i += 1
continue
# The current palindrome is as large as it gets, so we append
# it.
l.append(palLen)
# Now to make further progress, we look for a smaller
# palindrome sharing the right edge with the current
# palindrome. If we find one, we can try to expand it and see
# where that takes us. At the same time, we can fill the
# values for l that we neglected during the loop above. We
# make use of our knowledge of the length of the previous
# palindrome (palLen) and the fact that the values of l for
# positions on the right half of the palindrome are closely
# related to the values of the corresponding positions on the
# left half of the palindrome.
# Traverse backwards starting from the second-to-last index up
# to the edge of the last palindrome.
s = len(l) - 2
e = s - palLen
for j in xrange(s, e, -1):
# d is the value l[j] must have in order for the
# palindrome centered there to share the left edge with
# the last palindrome. (Drawing it out is helpful to
# understanding why the - 1 is there.)
d = j - e - 1
# We check to see if the palindrome at l[j] shares a left
# edge with the last palindrome. If so, the corresponding
# palindrome on the right half must share the right edge
# with the last palindrome, and so we have a new value for
# palLen.
#
# An exercise for the reader: in this place in the code you
# might think that you can replace the == with >= to improve
# performance. This does not change the correctness of the
# algorithm but it does hurt performance, contrary to
# expectations. Why?
if l[j] == d:
palLen = d
# We actually want to go to the beginning of the outer
# loop, but Python doesn't have loop labels. Instead,
# we use an else block corresponding to the inner
# loop, which gets executed only when the for loop
# exits normally (i.e., not via break).
break
# Otherwise, we just copy the value over to the right
# side. We have to bound l[i] because palindromes on the
# left side could extend past the left edge of the last
# palindrome, whereas their counterparts won't extend past
# the right edge.
l.append(min(d, l[j]))
else:
# This code is executed in two cases: when the for loop
# isn't taken at all (palLen == 0) or the inner loop was
# unable to find a palindrome sharing the left edge with
# the last palindrome. In either case, we're free to
# consider the palindrome centered at seq[i].
palLen = 1
i += 1
# We know from the loop invariant that len(l) < 2 * seqLen + 1, so
# we must fill in the remaining values of l.
# Obviously, the last palindrome we're looking at can't grow any
# more.
l.append(palLen)
# Traverse backwards starting from the second-to-last index up
# until we get l to size 2 * seqLen + 1. We can deduce from the
# loop invariants we have enough elements.
lLen = len(l)
s = lLen - 2
e = s - (2 * seqLen + 1 - lLen)
for i in xrange(s, e, -1):
# The d here uses the same formula as the d in the inner loop
# above. (Computes distance to left edge of the last
# palindrome.)
d = i - e - 1
# We bound l[i] with min for the same reason as in the inner
# loop above.
l.append(min(d, l[i]))
return l

Note that this is not the only efficient solution to this problem;
building a suffix tree is linear in the length of the input string and
you can use one to solve this problem but as Johan also mentions,
that is a much less direct and efficient solution compared to this
one.