In most systems today, which have lots of RAM and which typically have only one “big” program busy at one time (not much contention for that RAM...), it is reasonable to use (virtual...) memory as what it is – a large disk file. Your design does need to be mindful of the fact that, with virtual memory, everything works very well until a so-called “thrash point” is reached, at which time performance abruptly starts to go to hell fast.

As has been said, the largest piece of data is probably the input dataset(s), and so you want to process that one record at a time. My further suggestion is that perhaps you can arrange things so that the output is effectively generated in “batches,” such that, at some particular turning-point, the program is able to release some of the memory it has been holding up to that time. Let me now, specifically, explain what I mean and how you might be able to accomplish this.

I am not familiar with the task of DNA sequencing, but let me now suggest that perhaps it would be possible, and therefore advantageous, to design your algorithm so that it assumes (and requires, and checks) that the input file is sorted. (Caution: Any program which relies upon this assumption must be made to check the sort-order and to die() outright if an error is found.) When a file is known to be sorted, the data is now quite-naturally grouped, and the boundaries of each group can be determined without searching. All of the sequences beginning with (say) AGAGTTT (or any number of leading-characters desired) will occur all-at-once, such that, when you observe that this leading-character sequence has changed, you know that this group of records has ended and that there will be no more. (At each gap in the sequence, you know, again without searching, that the gap is empty.) The payback comes if this means that a turning-point has been reached such that in-memory data structures related to the group just-ended can now be released. The capacity of the program is no longer bounded in any way by the worst-case size of the file, but by the worst-case size of each group.

On-disk sorting is one of the most-studied algorithms in all of computer science. (Before that, it was done with tapes; before that, and before computers, it was done with punched cards.) Even files consisting of millions of records can be sorted in a few seconds. The time required to sort the data, then, is pragmatically inconsequential. Meanwhile, the operational paybacks of designing algorithms which require (identically) sorted input streams can be huge.

I am not a Perl expert, and have learned from many authors on PerlMonks, including both BrowserUK and sundialsvc4.

If the comments by sundialsvc4 are inaccurate or not appropriate for the OP's problem, please tell us why. Pre-sorting the data seems like good advice to me, but I don't have your experience and knowledge of Perl. Please enlighten.