internal repeats

In article <1survlINNdfl at phage.cshl.org> wli at phage.cshl.org ( CSHL) writes:
>>i'm looking for a C code which lists *all* internal repeats
>in a sequence. no restriction on the repeating unit size,
>the distance between the repeat units, or anything. this
>list has to be complete.
>>is it a trivial task or a tricky one? i'm new to this topic,
>so i have no idea. any pointer is appreciated...( public
>domain code, reference... or your own program-- i don't need
>a complete code, only the subroutine dealing with this task
>would be fine).
>>thanks in advance! -- w. li
>>wli at cshl.org>===================================
This is a trivial task. Consider your sequence with n
amino acids as n sequences, one starting at pos 1, one
at pos 2, etc. until one starting at pos n. This is
done without any copying, just pointing to each position.
Now sort the pointers in lexicographical order by the pointed
text. All the information that you want will be directly
available from the sorted list.
An example helps.
If the sequence is: MARRARRPRGRFYAFRRG (20 aa long)
these are the 20 subsequences
MARRARRPRGRFYAFRRGRW
ARRARRPRGRFYAFRRGRW
RRARRPRGRFYAFRRGRW
RARRPRGRFYAFRRGRW
ARRPRGRFYAFRRGRW
RRPRGRFYAFRRGRW
RPRGRFYAFRRGRW
PRGRFYAFRRGRW
RGRFYAFRRGRW
GRFYAFRRGRW
RFYAFRRGRW
FYAFRRGRW
YAFRRGRW
AFRRGRW
FRRGRW
RRGRW
RGRW
GRW
RW
W
These are the above strings sorted in lexicographical order
AFRRGRW
ARRARRPRGRFYAFRRGRW
ARRPRGRFYAFRRGRW
FRRGRW
FYAFRRGRW
GRFYAFRRGRW
GRW
MARRARRPRGRFYAFRRGRW
PRGRFYAFRRGRW
RARRPRGRFYAFRRGRW
RFYAFRRGRW
RGRFYAFRRGRW
RGRW
RPRGRFYAFRRGRW
RRARRPRGRFYAFRRGRW
RRGRW
RRPRGRFYAFRRGRW
RW
W
YAFRRGRW
Notice that all the repeats are next to each other. For example,
both instances of ARR, three instances of RR, nine instances of R, etc.
Gaston H. Gonnet, ETH Zurich.