If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register or Login
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

I would like an algorithm that converts a sequence to the minimal possible sequence of its kind (minimal when reading left to right,
for example 1 1 2 3 2 3 < 1 2 3 1 2 3, since the 2nd letter is 1 in the first seq. but 2 in the 2nd).

Of course it is possible to do it in quadratic time, since we can just loop through all starting positions and renumerate so the first occurence of a letter is 1, the second is 2, ...

Since I have millions of such sequences (all about 20 - 30 letters long) I would like a more efficient one (say running in linear or n ln(n) time).
Don't need the code, just the idea.

Re: Minimize sequence with rules

Originally Posted by nuzzle

I may be missing something but finding the "minimal" sequence seems trivial. It's just to order the items. With your example it would be,

1 1 2 2 3 3 4 4

What cuber64 actually means is that he/she wants an algorithm to find the canonical form for a given sequence, under the equivalence relation whereby permutations and circular re-orderings are equivalent.

@cuber64: I think the best you could get for this is O(n), as any algorithm will likely have to look at all the letters of the sequence to determine the canonical form. Or at least all except the last one or two letters, as these can be determined by what is left over after looking at the rest.

Not much help, I know, but sometimes having a bound on what you can likely achieve will help you know when to stop looking for a better approach

Re: Minimize sequence with rules

Originally Posted by Peter_B

What cuber64 actually means is that he/she wants an algorithm to find the canonical form for a given sequence, under the equivalence relation whereby permutations and circular re-orderings are equivalent.

To me it looks like he's defined the "canonical form" to be the ordered sequence. It can be reached for any sequence by sorting it. It involves a series of permutations and can be achieved in O(N*lnN) time.

Re: Minimize sequence with rules

Well I described the equivalence relation already.

two sequences are equal if they differ by a circular shift, or if they differ by a permutation of lettes (not permutation of position of letters).
the permutation of letters perhaps would be better described as renumeration of elements.

Re: Minimize sequence with rules

Originally Posted by nuzzle

Or rather how do you define it in this case?

two sequences are equivalent if one can be obtained from the other by rotations or permutation of the labels ( note that both form a proper subgroup of the group of permutation of order 2n; but given a sequence its sorting permutations may not belong to it, in general ).

to the OP, if no more info is given ( like some statistical regularity of the input data ) I bet you can hardly do better: consider a sequence for big n with uniformly distributed labels: it's ordered canonical form will start with 123...k-1k... with high probability with k growing with n ( presumably linearly ); if this is the case, then any algorithm scanning the wanted form incrementally ( like the one you suggested ) will be n^2.

anyway, I think you can boost the algorithm runtime by finding invariants. For example, the histogram of distances between each label on the circle ( and more generally, any rotationally invariant statistics on such ordered list of distances, like higher order moments ) clearly is such an invariant . Now, I conjecture that given sensible distributions of the input data the probabilty that two inequivalent sequences have the same such invariant goes to 0 for big n. So, combining your o(n^2) algorithm with a prelimary check of a sub-O(n^2) invariant may give an overall sub-O(n^2) algorithm on avarage.

BTW, given that your sequences are bounded in the range 20-30, trying to reduce that algorithm complexity could be ineffective. I would rather focus on the global problem ( the millions of sequences ) instead. One obvious thing to do would be parallelizing it ...

Re: Minimize sequence with rules

Well, two sequences are considered equal if you can rotate and renumerate them into each other.

I have a suggestion based on the observation that these two operations don't change the relative distances between equal letters (since the sequence is circular there are two distances really so lets use the shortest).

If you look at this sequence,

1 1 2 4 4 3 3 2

there are 4 letters and thus 4 distances which are 0, 0, 0, 2. Note that it doesn't matter which letter has which distance so they're presented in sorted order. That would be the canonical form of that sequence. Maybe characteristic form would be a better name so I use that instead. Regardless of how the sequence is rotated and/or renumerated like,

4 4 3 3 2 1 1 2 (rotated)
4 4 3 1 1 2 2 3 (renumerated)

the characteristic form stays 0,0,0,2.

Another example is 1 1 2 2 and 1 2 1 2. They have the characteristic form of 0,0 and 1,1 respectively which are different meaning they're not equivalent.

If my suggestion holds it would be possible to get an O(N) algorithm I think.

Re: Minimize sequence with rules

Originally Posted by nuzzle

Well, two sequences are considered equal if you can rotate and renumerate them into each other.

I have a suggestion based on the observation that these two operations don't change the relative distances between equal letters (since the sequence is circular there are two distances really so lets use the shortest).

If you look at this sequence,

1 1 2 4 4 3 3 2

there are 4 letters and thus 4 distances which are 0, 0, 0, 2. Note that it doesn't matter which letter has which distance so they're presented in sorted order. That would be the canonical form of that sequence. Maybe characteristic form would be a better name so I use that instead. Regardless of how the sequence is rotated and/or renumerated like,

4 4 3 3 2 1 1 2 (rotated)
4 4 3 1 1 2 2 3 (renumerated)

the characteristic form stays 0,0,0,2.

Another example is 1 1 2 2 and 1 2 1 2. They have the characteristic form of 0,0 and 1,1 respectively which are different meaning they're not equivalent.

If my suggestion holds it would be possible to get an O(N) algorithm I think.

While this suggestion of using a sorted list of distances (i.e. 'characteristic form') will be invariant under rotations and renumerations, it is possible for two sequences which are not equivalent according to the OP's criteria to have identical 'characteristic forms'.

So both have same characteristic form using the sorted distances, but the first sequence has two consecutive pairs of letters (3344) whereas the second one doesn't (it has two individual pairs - 33 and 66).

Re: Minimize sequence with rules

Originally Posted by Peter_B

it is possible for two sequences which are not equivalent according to the OP's criteria to have identical 'characteristic forms'.

Well, it was too good to be true I guess .

Still, to be equal under the OP rules, two sequences must have equal 'characteristic forms'. So maybe it could be useful as a cheap pre-screening of equality before a conclusive but more expensive equality test is applied. If pre-screening is efficient it improves complexity in amortized time. An often used example of this is when strings are compared. First the lengths are compared and only if they're equally long a char by char comparision needs be done.

Thanks for checking this out. If I have time I'll have another look at the 'characteristic form' and try to figure out why it's not stringent enougth. I suspect it allows sub-rotations of sequences to slip through. But maybe there's a simple symmetry criterion that can be added to the 'characteristic form' to make it conclusive.

Re: Minimize sequence with rules

Originally Posted by nuzzle

I'll have another look at the 'characteristic form'

I redefine the 'characteristic form' somewhat. It's still based on the (shortest) distance between equal letters but the differences are calculated from left to right and kept in the order they appear. So for example

1 1 2 4 4 3 3 2

becomes

0,2,0,0

and

4 4 3 3 2 1 1 2 (rotated)
4 4 3 1 1 2 2 3 (renumerated)

become

0,0,2,0
0,2,0,0

The 'characteristic form' can be calculated in one sweep of a sequence that is in O(N).

The 'characteristic form' takes care of the renumeration operation (which is part of the equivalence relation defined by the OP). That is any renumerated sequence has the same 'characteristic form' as the original sequence.

The other part of the equivalence relation is rotation. It states that all rotations (circular shifts) of a sequence should be considered equal to the original sequence. This cannot be handled by the 'characteristic form'. Instead all rotations of the 'characteristic forms' of two sequences must be considered to establish equality. If any rotation of the 'characteristic form' of one sequence equals the 'chacateristic form' of another sequence the two sequences are considered equal.

If there are N distances in a 'characteristic form' there are N possible rotations. And to compare two 'characteristic forms' means N distances must be compared. This gives a combined maximum complexity of O(N^2). But this is highly pessimistic. Non-equality should manifest itself after far fewer compares on average. I'd say the complexity is likely to be much closer to O(N).

If my assumptions hold, using the 'characteristic forms' of sequences to establish equivalence (under the OP rules of rotation and renumeration) is an O(N) operation (amortized).

In Peter_B's post these sequences were considered,

121233445656
121233454566

The 'characteristic forms', are,

110011
110110

and they cannot be rotated into each other so the sequences are not equal.