Me on the Web

Meta

Computing cryptographic hashes of files on iOS and Mac OS X using the CommonCrypto APIs is fairly easy, but doing it in a way that minimizes memory consumption even with large files can be a little more difficult… The other day, I was reading what some people were saying about this on a forum about iPhone development, and they thought they found the trick, but they still had a growing memory footprint with large files because they forgot something fundamental about memory management in Cocoa.

Updated

Friday, October 1, 2010: removed comment about the fact that I used character arrays on the heap with the more modular solution described at the end of the post; this is now fixed, and that more general solution is now as efficient as the simple one described here.

Sunday, October 17, 2010: added link to a simple GitHub repository that I created to show exactly how to integrate my function FileMD5HashCreateWithPath with a simple iOS or Mac application.

What was wrong with that solution?

Even though they had a solution to read bytes from the file progressively instead of reading everything at once, it did not improve the memory consumption of their program when computing hashes of large files. The mistake they made is that the bytes read in the while loop were in an autoreleased instance of NSData. So, unless they create a local autorelease pool within the while loop, the memory will just accumulate, until the next autorelease pool is drained. But I think it would be very inefficient to add an autorelease pool in the while loop, because you would end up allocating a new object in every pass of the loop.

So, in my opinion, the right question is: how do we read those bytes without getting an autoreleased object?

How to get around that problem?

I looked for a solution, and I couldn’t find anything that would do the same thing as -[NSFileHandle readDataOfLength:] at the Foundation level without returning an autoreleased object. So I thought: we have to go deeper. I looked for something similar in Core Foundation, and sure enough, I found the CFReadStream API.

And since I was going to do this using Core Foundation to read those bytes, I decided to go all the way with Core Foundation, with a solution in pure C.

Here’s how you can compute efficiently the MD5 hash of a large file with CommonCrypto and Core Foundation:

Remember that FileMD5HashCreateWithPath transfers ownership of the returned string, so you must release it yourself.

I also created a small GitHub repository that may help you understand how to integrate that code in your project. It contains a very simple Xcode project, with a target for iOS and another one for Mac OS X. In both cases, the application just provides a simple button to compute the MD5 hash of the executable file (the binary). Here is where you can find that repository: FileMD5Hash GitHub repository.

Advantages of this solution

There are several nice things about this implementation:

first, it works as advertised: it computes the MD5 hash of the file correctly, and it doesn’t make the memory footprint of your app grow, even if you give it the path to a huge file;

even though the path argument is a CFStringRef, it’s really easy to use this from Objective-C, thanks to the fact that NSString and CFStringRef are toll-free bridged; cf. example above for usage;

it works just fine both on iOS and on Mac OS X;

by reusing sizeof(digest), I avoided the pitfall of exposing the real value of CC_MD5_DIGEST_LENGTH, which would make it more difficult to adapt to other cryptographic algorithms.

How about SHA1, SHA256, and others?

It’s really simple to adapt this function to other algorithms. Say you want to adapt it to get the SHA1 hash instead. Here’s what you need to do:

replace CC_MD5_CTX with CC_SHA1_CTX;

replace CC_MD5_Init with CC_SHA1_Init;

replace CC_MD5_Update with CC_SHA1_Update;

replace CC_MD5_Final with CC_SHA1_Final;

replace CC_MD5_DIGEST_LENGTH with CC_SHA1_DIGEST_LENGTH;

Or more simply, just do a find and replace to transform every occurrence of the string “MD5” with “SHA1“. Voilà, you got it!

Another way to extend this to other algorithms is to make this function more modular, and basically take all of those things as arguments. This is a little more difficult, but I did it for my project TagAdA. With this more advanced and more modular solution, you have a third argument that represents the algorithm that you wish to use, and you only have one instance of the code associated to that logic in your binary, even if you use several of those cryptographic algorithms in your app. I even went to great lengths using the preprocessor to minimize the amount of duplicated code in my source file.

Please make sure to add FileMD5Hash.c to the list of files that Xcode is supposed to compile for your target. One way to do that is to drag and drop FileMD5Hash.c to the “Compile Sources” build phase of your target.

Andrey September 18th, 2010

This didn’t work for me because I tried to use it in a .mm file. The solution is simple:

I decided to fork Pierre’s GitHub repository, and to add a simple Xcode project that shows how to integrate this code with a simple iOS or Mac application. This should document in more detail things that I intentionally omitted in the blog post (to keep it simple, and more readable).

@Neil You don’t even need to mention me in your README. The only thing I care about is that you keep my copyright notice in the source files, and that if you change the files in any way, you mention that in a comment in the source file. So just enjoy!

@Joan Your idea would work too, that’s true. However, when you say “much clearer”, I just want to say that it has to do with how familiar you are with Foundation and CoreFoundation. Some people might prefer to use CoreFoundation.

I don’t mind using CoreFoundation for some things, and this implementation is actually a little more efficient than what you’re suggesting. Cf. the Cocoa Fundamentals Guide:

Because in iOS an application executes in a more memory-constrained environment, the use of autorelease pools is discouraged in methods or blocks of code (for example, loops) where an application creates many objects. Instead, you should explicitly release objects whenever possible.”

So I guess what I should tell you is this: if you feel more comfortable using Foundation level APIs and you don’t mind or can’t notice the slight performance hit, then you should definitely do it your way.

Joan Lluch March 20th, 2011

Joel. Actually I feel very comfortable with CoreFoundation. My background is raw ‘C’ and I even programed in assembler so you can imagine what kind of things I am used to. I even have a strong preference (we could call it obsession) in using core foundation collections instead of their cocoa equivalents, specifically I use use NULL retain/release callBacks all the time on CFArrays and CFDictionaries.

Even when using cocoa I avoid explicit autorelease calls in my code. If I have to return a new object I always implement ‘create’ methods. I only leave implicit autoreleases when the object will be immediately retained anyway so the memory overhead is zero.

So my post was not really about what I would do but about most developers could consider to do.

Said that I still believe that using cocoa tends to be easier, and more convenient for most developers. What the docs recommend about autorelease pools is precisely to avoid doing what the original code did, that is actually *using* autorelease pools. By creating and draining an insider autorelease pool as per my suggestion, what we achieve is to release the objects right there, so in fact avoiding the use of the global autorelease pool, which is what has really to be prevented.

At the end of the day we both are thinking alike and possibly using the same coding patterns, so that’s the important thing.

I use your trick with success. Great job.
But now I’ve a question for you: can I use your trick with a file on a remote site, then with a ‘filePath’ that is similar to ‘http://…’ ?

Thank’s,
Alex.

danny May 12th, 2011

thanks for the code, it’s very helpful!

Best Regards,

Daniel Oliva

danny May 12th, 2011

Hey Everyone!,

Here is the easiest way that I had everything up and running,
1.) Download the FileMD5Hash.c & FileMD5Hash.h from the linked github.
2.) Xcode -> New Project -> Foundation Tool (Command Utility) ->Drag both .c & .h Files into Source folder in Xcode
3.) Follow Andrey's advice in regards to the modification of FileMD5Hash.h;

*If a Exec_Bad_Access Error occurs it's probably because your trying to CFRelease(md5hash) when md5hash is nil; & md5hash would be nil because the CFStreamOpen Probably failed…

Okay, lastly thanks! and sorry for blowing up your forum with an error message !

Regards,

Daniel

vinnyt August 31st, 2011

on line 49 is there any reason you are declaring the array inside the for loop. I would hope the compiler is smart enough to allocate that array and keep it around. It just scares me a bit that the C compiler might be dumb enough to reallocate that at every iteration of the loop, and even though that is small chunk when you run that loop thousand times it is going to be troublesome.

jacksonadams December 13th, 2011

thank you very, very much! lifesaver

jacksonadams December 13th, 2011

THANK YOU!!! lifesaver

Matt March 15th, 2013

Hey Joel, this is fantastic! One issue — I can’t seem to compile your sample app for Mac 64-bit. Any plans to get that working? Thanks so much!