You mentioned somewhere that hashes are not useful as there are only 24 keys. You also mentioned about using a hash of arrays, but dismissed it.

That's premature. If you load your main file into a hash of arrays, then you only need compare each of your annotated records against 1/24th of the main records. And that reduces your overall workload and runtime to 1/24th of what it would be.

As others have pointed out, your range checking is far too complex and computationally expensive. Although there are many (7?) permutations of relative positioning of the starts and ends of the ranges, there are only 2 that do not overlap:

If both your files are already in sorted order (if they are not, (learn to) use your system sort utility and sort them ascending by the tag, start, end), then when Main Start moves greater than Annot End, then you need process no further main records for this tag, so we can skip to the next Annot record. (Outer loop)

Equally, whilst Main End is less than Annot Start, you can skip to the next Main record. (Inner loop).

And actually, that covers both (all) conditions when we do not want to print output, so no further range testing is required.

Finally, there is one further optimisation possible. As both files are sorted, when we start processing the second Annot record for a given tag, then we know it cannot be earlier than the last Annot record for this tag.

Therefore, if we rejected and skipped the first N Main records for this tag for the last Annot record, we know we can skip at least that many for this record. This complicates the code a little, but yields a huge saving in processing which make it worth while.

The final cut of the loops code looks like this:

my( $first, $lastTag ) = ( 0, '' );
OUTER: while( <ANNOT> ) {
chomp;
## ((simplified for testing)
my( $atag, $astart, $aend, $mess ) = split "\t";
## If the tag changed, reset the start position for the Main recor+ds
## and remember the new tag
$first = 0, $lastTag = $atag if $atag ne $lastTag;
## Skip completely if this Annot tag has no corresponding Main tag
next unless exists $main{ $atag };
## For each Main record with this tag,
## But skipping those we rejected last time
for my $md ( @{ $main{ $atag } }[ $first .. $#{ $main{ $atag } } ]+ ) {
## (simplified for testing)
my( $mtag, $mstart, $mend ) = split "\t", $md;
## Increment the skip and goto the next Main record
## If the end of this Main record is less than
## the start of the Annot record
++$first, next if $mend < $astart;
## Skip the rest of these Main record
## if the start of this Main record is greater than
## the end of this Annot record
next OUTER if $mstart > $aend;
## We have an overlap, so output the info
print OUTPUT "$atag\t$mess";
}
}

My best guess is this 3 if not 4 orders of magnitude quicker. Let's see anyone come close to that using an RDBMS.

Full test code:

#! perl -slw
use strict;
my %main;
## Get the filename for and open the main file
my $mfile = prompt( "Main file?" );
open MAIN, '<', $mfile or die "$mfile: $!";
## Build a HoAs keys by the tag (chromosome?)
## Don't split the records yet to save ram.
m[^([^\t]+)\t] and push @{ $main{ $1 } }, $_
or die "Bad record in $mfile @ $." while <MAIN>;
close MAIN;
chomp @$_ for values %main; ## Chomp the arrays
## Get the filename for and open the annotations file
my $afile = prompt( "Annotation file?" );
open ANNOT, '<', $afile or die "$afile: $!";
## Open the output file
open OUTPUT, '>', 'output.txt' or die $!;
## Record our start time
print scalar localtime;
## Variables to remember the llast Annot tag we processed
## And how many Main records we skipped before we found an overlap
my( $first, $lastTag ) = ( 0, '' );
OUTER: while( <ANNOT> ) {
chomp;
## Read and split the Annot record
my( $atag, $astart, $aend, $mess ) = split "\t";
## If the tag changed, reset the start porsition for the Main reco+rds
## and remeber the new tag
$first = 0, $lastTag = $atag if $atag ne $lastTag;
## Skip completely if this Annot tag has no corresponding Main tag
next unless exists $main{ $atag };
## For each Main record with this tag,
## But skipping those we rejected last time
for my $md ( @{ $main{ $atag } }[ $first .. $#{ $main{ $atag } } ]+ ) {
## Split the Main record. We do it over and over, but it's not+ hugely expensive.
## Split when building the HoAs if you have enough memory.
my( $mtag, $mstart, $mend ) = split "\t", $md;
## Increment the skip and goto the next Main record
## If the end of this Main record is less than
## the start of the Annot record
++$first, next if $mend < $astart;
## Skip the rest of these Main record
## if the start of this Main record is greater than
## the end of this Annot record
next OUTER if $mstart > $aend;
## We have an overlap, so output the info
print OUTPUT "$atag\t$mess";
}
}
close ANNOT;
close OUTPUT;
print scalar localtime;
sub prompt {
printf "%s", $_[ 0 ];
chomp( my $in = <STDIN> );
return $in;
}
__END__
C:\test>664760-gen > 664760.main
C:\test>wc -l 664760.main
300000 664760.main
C:\test>664760-gen > 664760.annot
C:\test>wc -l 664760.annot
300000 664760.annot
C:\test>664760-b
Main file?664760.main
Annotation file?664760.annot
Tue Jan 29 06:36:34 2008
Tue Jan 29 06:42:12 2008
C:\test>wc -l output.txt
112805 output.txt

Went to join the gridlock to see it
Held an eclipse party
Watched a live feed
I cn"t see tge kwubosd to amswr thus
I tried to see it, but 8000 miles of rock got in the way
What eclipse?
Wanted to see it, but they wouldn't reschedule it
Read the book instead