johnnywang has asked for the
wisdom of the Perl Monks concerning the following question:

Seeking wisdom: I have a set of number sequences, given a length parameter n, and minimum occurance parameter m, I'd like to find common continuous subsequences of length n that occur at least m times. Here's an example:

Maybe this will be fast enough for you. Maybe you need to be a bit more specific with your spec. This will not scale very well. The 27 element data set consumes 120 loops in the generation phase and 34 in the output phase for a total of 154. It is roughly O(n^2) but it is quite data dependent.

I wrote Algorithm::LCSS which is based on Algorithm::Diff and may be a better option depending on the real task. The problem with that approach is that it is a one to one comparison not many to many which is what you seem to want.

"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail
"Time is a poor substitute for thought"--theorbtwo
"Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

As I reread his question, you may be correct. The problem is that his output does not match his stated conditions. According to his stated conditions, in the example he gives, he wants subsequences of length 2. However, his output includes a subsequence of length 3. For this reason, I am assuming that he misstated his conditions and means to say "I'd like to find common continuous subsequences of AT LEAST length n." If I am correct, then the time to solve an input of 1000's of sequences of 1000's of numbers would be prohibitively long, especially if n is small. If I am not correct, then you are closer to the actual runtime.

If I understand the problem correctly, then from an algorithmic standpoint, what you are looking for is unrealistic, at least to be solved in any reasonable amount of time. The problem is this: the only way to guarantee finding every common continuous subsequence of any length is to do it by brute force. You have to check every subsequence from size m up to the size of the smallest sequence.

In the example you give, you have to check every subsequence from size 2 to size 5. I did a quick math check (so the exact number might be wrong), but the number of check would be in the neighborhood of 600. And this is with only 3 sequences of sizes 11, 11, and 5. Just adding 1 number to each of the 3 sequences would add hundreds of more required checks. The growth of the problem with every additional sequence is polynomial.

For you to want to check 1000's of sequences that may be 1000's of numbers long would require in the billions of required checks which would take much longer than I'm sure you are willing to accept.
This is one of those NP problems that CS majors learn about in algorithm classes (traveling salesman, coloring problem, hamiltonian circuit problem, etc).

If someone can come up with an algorithm that is not brute force, thus making the problem not NP, I would love to see it.

I suspect this isn't as fast as you'd need: on my machine, it took about 30 seconds to search an array of 100 x 100 random numbers for $m = 3, $n = 4, and runtime will be O(n^2) the total number of integers in your list of sequences.

Also this finds only common subsequences of the exact length specified, so in this case it returns (5,10), (10,5), (6,21), (21,5). I can't see a way to adapt this to return directly only the longest common sequences, but you could maybe save all the length-2 sequences, then search for length-3 sequences and discard their subsequences, iteratively until you've found the longest subsequence.

So you mean subsequences of at least length n? Do the m occurences have to be in different elements of @a? If so, do the elements have to be adjacent? (I'm trying to figure out what you mean by 'continuous'.)

Update: If you really meant at least, would you want (1,2,3,1,2,4,1,2,4) to show (1,2,4) and (1,2) or just (1,2,4)?

Update: assuming "continuous" only meant not considering (1,3,5) to be a subsequence of (1,2,3,4,5), and that you
were just abbreviating the fact that there were matching
(6,21) and (21,5) by saying (6,21,5), and assuming all
your data are integers > 0:

I cannot add much now to the interesting responses above, but would like to note that your problem has some resemblences to the problem of finding matching sequences in genomes.

Also I found a nice paper of academic interest about sequence searching (not exactly your problem, but I couldn't resist as it has nice hashing of multidimensional tables at around section 4.3) It involves sliding a window over the target and storing the slopes of segments in very big hashes (if I understand as much as I read). I'd like to mess more with this but it's 3am here..

If you do not need an exhaustive list of all pattern s but just the most interesting ones, or statistically significant ones allowing for a number of sequence errors, biological code is more interesting for you. You could look for example at how they do RepeatFinder at TIGR if curious.

Also you could google about "hidden Markov", "interpolated Markov" or Viterbi which are used often to find hidden sequences or attempt predictions of what will come next in a sequence.