Optimizing Searches In Perl

So, I needed something to search a string in one text file, and compare it with a string in a second text file. If there is a match, I need to print the results to a new text file. So, because I'm new to Perl, but have a programming background, I figured this would be fairly easy, even if I'm not fluent with the language. Rather than get into the nitty-gritty about the details, I'll just show you what I did starting off:

As you can see, this is a very inefficient algorithm. If "phones.dat" is n-lines long, and "P1514.FIN" is m-lines long, then for every n-record, I am looking at EVERY m-record. This means, my search algorithm is O(n*m). Now, I'm going to say, that this works for small files. Looking at "phones.dat" which only has 330 lines, and "P1514.FIN" which only has 500 lines, what am I worried about? It takes less than a second to execute the code, and spit out the result. Heh. If only every file I dealt with was that small. Today, I was looking at two files that had about 12,000 lines. It took ~20 seconds to finish the script. Still, not really a length that I have to be concerned about, but I can make it better.

First, thanks go to Levi Pearson and Jason Hall for helping me realize the errors of my ways. I can increase the efficiency of the algorithm greatly by using hashes. In fact, I can take the order of the algorithm to O(n+m) by only traversing each file only once. Here's the code I came up for that:

In this case, "P1514.FIN" still has the 500 lines it used to have, and "p1514.afo" has well over 12,000. Executing the code is actually faster than previous. We should expect this as rather than traversing the m-file n times making my comparison, I traverse each file only once. The first file to store the key and value in a hash, and the second file to make the comparison. The only thing that may make the program a bit more efficient, would be to store the smaller of the two files in the hash, and use the larger of the two for comparing against the hash. But really, in that case, we're just concerned about memory, which isn't a worry for me.

The funny thing about this little programming example? I'm applying what I've learned in Discrete Math to a very real everyday problem. And who'd-a-thunk that any of the stuff was important?!

Posted by Aaron Toponce on Monday, December 18, 2006, at 1:51 pm.Filed under Perl.Follow any responses to this post with its comments RSS feed.You can post a comment or trackback from your blog.For IM, Email or Microblogs, here is the Shortlink.

{ 7 } Comments

# I realize a lot of these are not
# specifically related to your problem
# but I'm pedantic.

# First, filehandles in perl are
# idiomatically uppercase, just to
# help distinguish them.
#
# Second, use variable filehandles. It
# makes error handling easier.
#
# Third, predeclare as much as possible.
# It'll save a little bit of time. Right
# now you're declaring a variable every
# time through both loops, but it's not
# changing.

# I don't use (un)pack that often so I
# can't comment on that.

my %phone;

my $layout1 = '@1384 A10';
my $layout2 = '@0 A10';

if ( open my $IN1, "P1514.FIN" ) {

while( ) {

chomp;

# You don't need to assign a variable
# for temporary use, it just slows you
# down and takes up memory.

# Why are you saving the line if you're
# not using it later?
$phone{ unpack $layout1, $_ } = '';

}

# $IN1 loses scope here. Perl will
# automatically close the file.

} else {

# It's confusing when you get bogus data,
# or no data.

die "Unable to open P1514.FIN: $!\n";

}

if ( open my $IN2, "p1514.afo" ) {

my $counter = 0;

while( ) {

# I'm not sure what you're trying to
# accomplish here, but taking what you
# have, you can increase the efficiency
# quite a bit by using a little more
# idiomatic perl.

1) WordPress likes to lowercase anything that sits between < and >. As in the case with this code, it got lower case, but I do use uppercase filehandles. Also, you’ll notice “# </in1>”. This is because WordPress will automatically close my tags if I don’t, so I just put it in the code.
2) I store the line of the key in the hash, because I actually am using it later. I probably shouldn’t have put it in my example code.
3) I think some of your example is preference to the programmer, but at any case, goes to show that there is more than one way to skin a cat! But I agree that it could get a bit more efficient.

In this case, no. The files that I am searching have no rhyme and reason to the layout. Well, they do, but it differs from file to file, and I don't know before hand what the layout will be. I only know where certain variables are in the layout. So, with that, I can just use substr() or unpack(), as seen here, to get right to where I need to be in the file.

Matthew - In fixed width data especially, unpack is much faster, it has no need to save state, load a complex engine, or other background tasks it would need. He still *could* have used them, but this is much more efficient.

Harleypig - as for : print “Match: ” . $counter++ . “\r”

just to be pedantic to you dont' use the concat operator (.) in a print. If you think about it the print operator is set to handle arrays, so why not just give it one. Benchmark the difference to verify, but using concat will force it to concat a new string first, then print, instead of print array.

{ 1 } Trackback

[...] One of the strong features in GeSHi, is the ability to recognize keywords and functions built into the language, and provide a link to the official documentation about that keyword. You may have noticed this with my last post about optimizing searches in Perl. Clicking on one of the functions will take to the Perl documentation site about that function. Try ‘print’ or ‘open’ to see what I am talking about. [...]

Post a Comment

Your email is never published nor shared.

Name

Email

Website

Comment

Wordpress Hashcash needs javascript to work, but your browser has javascript disabled. Your comment will be queued in Akismet!