Input: a set of \$M\$ strings \$S=\{s_i\}\$ of length \$N\$ with alphabet \$\Sigma\$

Make an index of each string by chars.
Maintain a position \$p=(p_i)\$ where \$p_i\$ is an index into string \$s_i\$.
Set the initial position such that \$\forall i: p_i = |s_i|+1\$
(1-based indexing).

While chars left in strings do:

Select the next point among matches (same char in every string) by looping through the alphabet shared by the strings and picking the point with the minimal m-dimensional euclidean distance to the current point.

\$\begingroup\$@janos I typed it up but it's not one-to-one with the text in the image. Image removed for clarity.\$\endgroup\$
– user72935May 8 '15 at 19:16

\$\begingroup\$Here are strings for which the result is sub-optimal at least using this implementation. strings = ['cggbcgeefcffaafbcgdbdadgfbdegbbcgbgefebeeegbfegfeedfdbccgfceebfefbgbfeddeefeacbcffacbbceccgfcfdecdgfefgfdfgccfeccbecdfadgagdegdb', 'gbggadbeedfbfcfdddfaggbcfcegggefebebacdddcabaedegfegfeedfbgacfebabedbfeafgafdeebbefbcedffegfafebbgdadfbbecbaccfdgffgfbeefggadebg', 'cbgggdcceedfaafffbcgdgafedbgcbbggefebecaecegfeccgafecccebdddfgfcebefabefgdaadfbeeefeffeagcdffegacfbecbccbfgcffgfdedbefgbgcbbebag']\$\endgroup\$
– user72935May 9 '15 at 8:09

\$\begingroup\$Another definition of the distance to consider is $\Sigma_i=1^|S| (s_i-ind(i))$ that is the sum of distances along each string with one past end as the reference point.\$\endgroup\$
– user72935May 15 '15 at 11:31

2 Answers
2

1. Analysis

This is a greedy algorithm: at each step it prefixes one letter to the result, choosing the letter than minimizes the change in the indexes into the strings.

The main loop of the algorithm adds one letter to the result, so can execute at most \$ N \$ times. Each iteration of the loop considers each letter in the alphabet \$ Σ \$, and each of the \$ M \$ strings, and searches (by bisection) the list of occurrences of that letter in that string, taking \$ O(\log N) \$. Thus the overall runtime is \$ O(\left|Σ\right|NM\log N) \$. (You missed a factor of \$ N \$.)

The code takes the union of the letters in the strings. But the only letters that can appear in the longest common subsequence are those letters that appear in all the strings, so the intersection is needed here here, not the union.

The code accumulates a list by repeated addition:

for i in range(len(x)):
tx[x[i]] += [i]

This wastes space because Python might have to allocate a new list each time and copy the old list across. It's more efficient to accumulate a list using the append method.

Iteration over the indexes of a sequence can often be simplified using enumerate, for example the loop above can be written:

for i, letter in enumerate(x):
tx[letter].append(i)

pos is a dictionary mapping input string to the current position in that string in the search. Since strings is a list of strings, it would make more sense for pos to be a list of positions, so that pos[i] was the current position in strings[i]. Then match simplifies to:

Just as I recommended making pos into a list, I also recommend making ind into a list. If you did, then distance would become:

def distance(v, w):
return sum((i - j)**2 for i, j in zip(v, w)):

and since this is only called from one place, you could easily inline it.

Similarly, indexes would be more conveniently organized so that indexes[letter][i] is a list of the indexes of the occurrences of letter in strings[i].

A magic number like this:

dr = 12777216

needs to be explained. It looks to me as though this needs to be some number larger than the biggest possible distance between pos and ind. But for very long strings (tens of thousands of letters), this won't be the case. It would be more reliable to start at infinity:

min_distance = float('inf')

But it would be better to reorganize the code to call the built-in min function, and catch the ValueError that is raised when there are no elements. See the revised code in §3 below for how this is done.

Similarly, the magic number -128 needs explanation. The idea seems to be that this is so far away from the index of any character in the string that the distance will be large and the candidate will be rejected. But is this really true? If the strings are longer than 128 characters then a candidate might be included erroneously. It would be better to reject candidates where bisect_right returns 0.

The handling of the case where bisect_right returns 0 seems incorrect to me. In this case find returns -1, and then this leads to the assignment ind[x] = indxc[-1] which succeeds (getting the last element in indxc) but this is wrong.

The question has been answered above, but here are some further thoughts:

The problem with solving LCS exactly comes from the fact that there are an exponential number of matches in the worst case, $O(N^M)$. This is the case when no deduplication is used and all the strings are equal.
The LCS grid is a DAG, where each grid cell is a vertex, so a longest path and longest sequence could be found in time linear in the number of matches. But for this to be efficient (polynomial time) only a polynomial number of matches must be considered, which may lead to an inexact solution.

The number of matches is exactly $\Sigma_{k}\Pi_{i}|s_i|_{a_k}$, where $a_k$ is the kth letter of the (common) alphabet and $s_i$ is the ith string, and $|s_i|_{a_k}$ is the number of occurrences of a letter in a string. This is exponential in the number of strings.