Friday, 10 August 2007

Finding palindromes

--------------------------------------------------------------------------------
This blog post receives thousands of hits per month. I do not want to change the blog post from 2007, since it is part of a larger context.

Visit Finding palindromes for a more extensive description of the algorithm and the underlying concepts and ideas.

Visit http://www.jeuring.net/homepage/palindromes/ to obtain the software for finding palindromes.
--------------------------------------------------------------------------------
A palindrome is a number, verse, word or a phrase that is the same whether you read it backwards or forwards, for example the word `refer'.

When I started as a PhD student in Computer Science in 1988, my professor gave the following problem assignment to me. Construct an algorithm that, when given a string, finds the longest palindrome substring occurring in the string. For example, given the string "yabadabadoo", the algorithm should return the substring "abadaba". The algorithm should not only return the longest palindrome substring, it should also return it fast. The requirement was that the algorithm should only use a number of steps linear in the length of the argument string. This was a difficult problem, which gave me severe headaches in the months to follow. I spent four months on the problem, and solved it. The one comment I got from my professor was "The best thing about this problem and its solution is that they have absolutely no practical relevance."

He was wrong.

Since 1988 I have found that palindromes play an important role in the world around us. Palindromes appear in music, in genomes, in art, architecture and design, mathematics, physics, chemistry, etc.

Particularly interesting is the fact that palindromes occur in the male genome: of the about 20,000,000 characters representing the male-specific region of the Y chromosome, 5,700,000 characters form together eight large palindromes (off by a couple of characters). I don't think there is any proven explanation for the presence of palindromes in the male genome, but I suspect it has something to do with repairing genomes. If you know more about it: please let me know.

As you may or may not have noticed when working on repairing Endo's DNA, the cloud is corrupted. It turns out that the cloud's genome also is a palindrome (off by a couple of characters), and that the corruption can be repaired by taking the mirror image of the palindrome. To do so, you have to find the palindrome. In the case of Endo, this is pretty easy: `duolc' follows cloud immediately in the symbol table. Had this information not been in the symbol table, you would have to find this palindrome in the DNA by means of an algorithm for finding palindromes. Actually, the DNA for the cloud is only 6,500 acids long, and using a naive quadratic-time algorithm for finding palindromes is not a real problem if you have a sufficiently fast machine. A quadratic-time algorithm for determining palindromes in human male DNA isn't going to work: it would take a long time to find the palindromes.

This blog message explains how to find palindrome substrings in linear time. I will develop a program in Haskell for finding exact palindromes. The problem of finding approximate palindromes is left as an exercise to the reader :-) Interestingly, the algorithms for finding palindromes developed by computational biologists generally use suffix trees, and both space and time consumption seem to be worse compared with the algorithm I give in this blog message. But descriptions of the algorithm given here have been published in the eighties (by Galil and others, described in RAM-code) and nineties (by me, described in Squiggol, which is probably even harder to decypher than RAM-code) of the previous century, which is a long time ago of course.

This blog message is a literate Haskell file, which can be interpreted with ghci.

The type of a function for finding palindromes

I want to find the longest palindrome substring in a string. The first attempt at a type for a function longestPalindrome is hence:

longestPalindrome :: String -> String

To determine whether or not a string is a palindrome, I have to compare characters at different positions in a list. It would be helpful if I had random access into the string. For that purpose, I'm going to change the input type of the longestPalindrome function to an array. Furthermore, the longestPalindrome algorithm can also be used to find the longest palindrome in a list of integers, or any other type on which I can define an equality function. So here is the second attempt at a type for a function for finding the longest palindrome:

longestPalindrome :: Eq a => Array Int a -> Array Int a

To find the longest Palindrome in an array, I have to calculate the longest palindrome around each position of the array, where a position in an array is either on an element, or before or after an element. For example, the array corresponding to [1,2,3] has 7 positions: before the 1, on the 1, before the 2, ..., until after the 3. So I want to express the function longestPalindrome in terms of a function longestPalindromes of type

longestPalindromes :: Eq a => Array Int a -> [Int]

where the result list is the list of the lengths of the longest palindrome around each position in the argument array. For example,

?longestPalindromes (arrayList (0,2) [1,2,2])[0,1,0,1,2,1,0]

I will omit the definition of longestPalindrome in terms of longestpalindromes, and define function longestPalindromes below.

A naive algorithm for finding palindromes

A naive algorithm for finding palindromes calculates all positions of an input array, and for each of those positions, calculates the length of the longest palindrome around that position.

For each position, this function may take an amount of steps linear in the length of the array, so this is a quadratic-time algorithm.

A linear-time algorithm for finding palindromes

I now describe a linear-time algorithm for finding palindromes. Although the program is only about 15 lines long, it is rather intricate. I guess that you need to experiment a bit with to find out how and why it works.

The algorithm processes the input array from left to right. It maintains the length of the current longest tail palindrome, and a list of lengths of longest palindromes around positions before the center of the current longest tail palindrome, in reverse order. Function longestPalindromes is expressed in terms of function extendTail.

Function extendTail takes an array as argument, the current position in the array, the length of the current longest tail, and a list of lengths of longest palindromes around positions before the center of the current longest tail palindrome, in reverse order. There are four cases to be considered in function extendTail. If the current position is after the end of the array, we only have to add the final palindromes in the array to the list of longest palindromes, for which we use the function finalCentres. If the current longest tail palindrome extends to the start of the array, we extend the list of lengths of longest palindromes by means of function extendCentres. If the element at the current position in the array equals the element before the longest tail palindrome we extend the longest tail palindrome. If these elements are not equal, we extend the list of length of longest palindromes.

Function extendCentres adds palindromes to the list of lengths of longest palindromes. It takes the array as argument, the current position in the array, the list of palindromes to be extended, and the list of palindromes around centres before the centre of the current longest palindrome tail. It uses the mirror property of palindromes to calculate longest palindromes around centres after the centre of the current longest palindrome tail. If the last centre is on the last element, we call extendTail with a longest tail palindrome of length 1, and the position moved tot he right. If the previous element in the centre list reaches exactly to the end of the last tail palindrome, use the mirror property of palindromes to find the longest tail palindrome. In the other cases, we've found the longest palindrome around a centre, and add that to the list of length of longest palindromes. We proceed by moving the centres one position, and calling extendCentres again.

Function finalCentres calculates the lengths of the longest palindromes around the centres that come after the centre of the longest tail palindrome of the array. These palindromes are obtained by using the mirror property of palindromes.

At each step in this algorithm, either the longest tail palindrome is extended, and the current position in the array is moved, or the list of lengths of longest palindromes is extended. Since both the array, and the list of lengths of longest palindromes are linear in the length of the array, this is a linear-time algorithm.

32 comments:

Jon Bentley's 'Programming Pearls' has a neat related example, finding the largest repeated substring of a string.

His solution is n log n, but given that repetitions are nonlocal this seems reasonable.

Spoiler: He creates a list of pointers into the string, and sorts them by alphabetical order of the text they point to. It's then guaranteed that any repeated substring will be next to its duplicate, and the longest can be found by scanning the list.

Just came back here for a look (after a few days absence) to see if this blog had anything new.

Found you have written about Palindromes (on the 10th), so I checked my notes (that I am making as I plough very slowly through the processing of learning about Endo's DNA) and in another odd coincidence I found that the 10th was also the exact day that I first spotted the cloud/duolc thing and also played with finding other exact palindromes.

Back then I also replaced "cloud" with the mirror of "duolc" and ran the integrity checking for the relevant gene page again but still got a "Fail", and assumed there must be errors in both parts. At least I thought I got a "Fail" - because I've just repeated the exercise and it PASSed this time. I guess I must have made a silly typo somewhere the first time. Again!

So thanks for "helping me" get that little detail sorted out!

All I need now are some pointers with the (apparently) ambiguous specs for pelcgb nytbevguz O! I don't want to be accused of laziness but perhaps you can tell me when your next blog entry will be posted? :-)

Pelcgb nytbevguz EP4 eh? Well, I never knew that. I did eventually (15 August) work out what was required but only by painstakingly stepping through the Fuun implementation and comparing intermediate results to those from mine to work out what I was doing wrong.

I'm afraid that the error in 5B (presumably deliberate?) in the specs on page 4405829 was too much for me - perhaps I was supposed to recognise EP4 from what was given? Maybe I will next time. :) I must have tried about a million variations (or so it seemed!) in a hit and miss approach to getting it to work properly before finally resorting to stepping through the DNA.

Anyway, after finally managing a correct implementation of EP4, I then spent many more hours, perhaps even days before finally (19 August) realising that page 23 needed nothing more than EBG2. Arrgghh!

Clearly I needed more like 72 days rather than 72 hours for this year's contest!!

That is, functions longestPalindromes and longestPalindromesQ should return the same result. This property passes 1000 tests.

And indeed, function longestPalindromes returns the list of lengths of palindromes in reverse. This is easily reversed of course, but requires some tricks in order to make the algorithm still `on-line' (a result is printed whenever it has been computed).

Ok so I thought about this and wrote this code:_______________________

string in; //load in as the input stringvector(int) a; //integer array //initialized to 1 for(int i=1;i< a . size();i++) { if(in[i]==in[i-1]) a[i]=2; if(i-a[i-1]-1 >=0) if(in[i]==in[i-a[i-1]-1]) a[i]=a[i-1]+2; }ans=max(a);It seems to work for me. So why do you need to maintain anything except the palindrome number for the ith element?

as expected. (Here I use the software available via http://www.jeuring.net/Palindromes/.) I did not change anything in the software for calculating palindromes, so it should work for the code posted in this blog as well. Can you maybe contact me with the exact details?

Interesting. Both ideas (using a zipper-like structure and returning results lazily) make sense. I'll have a look at your implementation, and will try to find out how the efficiency compares with my implementation.

To Noah: I've now compared your lazy solution with my random-access solution using Criterion on the text of the King James' Bible. It turns out that the random access solution is about 1.7 times faster under unoptimized compilation. Using -O2 makes both solutions go slower.

To rishi: the theoretical analysis of the algorithm shows it is linear-time (O(n)). My implementation might be worse than linear time, but not because of the algorithm used. It might be caused by printing, reading, etc.

Consider that you're trying to find the longest palindrome for the string "aaaaa". We will be performing O(n^2) operations in this case because every time the current center of the palindrome moves to the right by only one place till it reaches the center of the string (at which a time we stop checking because of the mirror property). Hence, we have (2n+1)/2 movements of the center of the palindrome and in the worst case the number of characters it has to check to the left and right of the center to see if it's a palindrome is O(n). Hence, O(n^2) in the worst case. Am I missing something?

Yes, I'm afraid you are missing something. I don't check naively for the longest palindrome around each centre. I calculate this list of palindromes around each centre, together with the longest tail palindrome, in one sweep over the list. At each step I either move the centre, or I extend the longest tail palindrome, and there are only a linear amount of such steps available.

I'm really not counting the time taken or adding complexities due to reading or printing or addition / comparison. I'm simply counting the no of iterations that are happening in the loop. I've tried this algorithm in Java / C / python / JS / PHP and I simple counted the no of iterations that are practically happening. Usually, Its has slight variances but always larger than O(n). As I said before, its still the most efficient algo I've come across but it doesn't seem to have a practical complexity of O(n). I will still digg a bit deeper and check if I've gone wrong somewhere or may be my understanding of the algorithm is still not as correct as it should be.