I have a data warehousing application that identifies unique files by the MD5 of their content. I am using Digest::MD5 to calculate the digests. The database uses the binary digest (as returned by Digest::MD5::digest(), but I'd like the hex string too for logging.

I think I have discovered a bug in Digest::MD5. If I call its digest() method, subsequent calls to hexdigest() return a bogus value. The test program below illustrates this.

Please check my work before I file a bug report. In the meantime, I suppose I can workaround the problem by getting the hex string first.

On a related note, It would be great if I could independently check the values returned by Digest::MD5::digest(). I have some old code that purports to convert digests to strings, but it just does 'sprintf("%x", $digest)'. This doesn't seem to work anymore, if it ever did. I've been looking around at various techniques and haven't found an answer that produces the same string as md5sum. Now I'm getting worried that the hash itself may be wrong. What's the right way?

I have Perl 5.10 and the latest (2.38) version of Digest::MD5. I've tried this on an x86_64 machine running Linux FC9 and on an x86 box running Windows XP and cygwin.

UPDATE *deleted*

UPDATE2 I thought the problem transcended Digest::MD5 instances but I forgot to reinitialize the file handle. Never mind.

First of all, thanks a lot for replying! Now, onto the meat of your comments:

1. Cleaner test code. Ok, I'll try to do better in future posts. 2. use warnings - yes, good idea. 3. my $IN vs *IN, cool, but irrelevant. 4. use binmode. Agree, safer. Turns out to be Irrelevant for linux/cygwin though. 5. seek to file start before recalculating. Yes, it works, but it implies that Digest::MD5 needs to read all of the bytes for each call to *digest(). That's fine for passwords, but not so good for calculating the digest of large file contents, as I'm doing.

I guess the bottom line is that I (naively, as it turns out) thought that Digest::MD5 was caching the digest (or content) at addfile() then using the cached representation for subsequent calls to *digest(). I will recode accordingly. Perhaps something like this:

When I posted the adjusted code I decided not to go over the details and I didn't make all of the changes that I thought it needed. Basically what I'm talking about is writing code that not only works but is clean, efficient and overall good quality.

Your updated script is a bit obfuscated and does a few odd things.

You name your sub getFileDigest and call it it boolean context but if it fails (or should I say if the open call fails), you return a true value and if it succeeds, you return a false value. That's backwards from what it should do.

When an open call fails, you should include the reason. It is also better to use the 3 arg form of open.

Why are you applying the exact same map/unpack statement against the same var?

The use of `md5sum $file` only serves to make the script less portable.

The use of binmode is recommended in the module's doc and altho it doesn't give a clear reason part of the reason would be for portability.

I agree that seeking to the beginning of the file and recalculate the digest is inefficient. I'd need to run a few tests, but there probably is a way to run both hexdigest() and digest() without using seek and a second addfile.

Here's a cleaned up version that accomplishes the exact same thing as your script.

All your comments are appropos; however, the getFileDigest subroutine is part of a larger module. The driver program is throw-away code I whipped up for this forum thread only.

I think you are treading perilously close to a religious argument w.r.t. whether methods should fail with true or false In this case, I stuck to the convention used elsewhere in my application.

Agree that there's really no point in the Digest::MD5 OO interface for my use case. In fact, one could say that my problems with this module started with (erroneous) assumptions about the stateful-ness of the object.

I think you are treading perilously close to a religious argument w.r.t. whether methods should fail with true or false

Not at all. But the return value should reflect the context in which the sub is being used. If the the sub was called FailedToGetFileDigest then your return values would make sense. But returning true when you mean false and false when you mean true only adds confusion to anyone else that needs to work with the code. At the very least a comment should be added to clarify the meaning of the return value.

Quote

BTW, this program doesn't work Shocked. In order to use the functional interface to Digest::MD5, you have to pass the file content, not the file handle

OK, I'll bite. I stripped out all non-essential stuff from your program, parameterized it for method (oo or functional) and put the guts in a loop on a glob.

I then pointed this program at a directory containing 222 text files whose contents totalled 543kb. After running it a few times to populate the system file cache, the program took about 3-3/4 seconds to iterate over the directory on my laptop. The times differed by at most .03 second regardless of method.

So, to summarize this thread:

1. The functional interface requires your program to read the file into memory and pass it each time you use the interface. Therefore, contents which can be processed by the functional interface are limited to the available RAM.

2. The functional interface is idempotent, in that you get the same results for the same parameter, call after call, with no special treatment required.

3. The oo interface streams the data from the file handle you pass as a parameter. This makes the oo interface more appropriate for summing large files.

4. The oo interface is sensitive to the state of the passed file handle. Idempotent results are only achievable if the state of the file handle is reset between calls (e.g., seek $fh, 0,0).

5. There is no significant difference in execution speed between the functional and oo interfaces.

This was fun!

Cheers,

Larry

Code

#!/usr/bin/perl

use strict; use warnings; use Digest::MD5 qw< md5 md5_hex md5_base64 >; use Getopt::Std;