Sorting enormous files using a C# external merge sort

This post is kindof a follow on from yesterdays fast but memory intensive file reconciliation post. On StackOverflow it was suggested to me that when reconciling large files, it'd be more memory efficient to sort the files first, and then reconciling them line by line rather than storing the entirety of the files in memory. Which brings up another challenge: how to sort a file that's bigger than RAM. For this example, i'll be using a 1 gig file, with random 100-character records on each line, and attempting to sort it all using less than 50MB of RAM. The technique i'm using is pretty close to the external merge sort. Here's the overview: Split into smaller chunks, then quicksort each chunk, then merge all the sorted chunks. Splitting the chunks is pretty straightforwards - and unoptimised! I simply read the input file line by line and output it to a 'split' file, until i've reached 10megs and then open the next 'split' file:

Sorting the chunks is also pretty straightforwards, as i use C#'s Array.Sort because it does the job just fine thank you very much. C# purists won't like my use of GC.Collect here, but it was mainly just to keep things under the 50mb limit. In production code, you'd probably not use that, and rather just let C# figure out when to free the memory itself. Anyway:

The merge is the only really tricky bit of code. But basically, it opens a FIFO queue of all the sorted chunks simultaneously. It then outputs the lowest-sorted record from all queues one at a time, until all queues are empty. You can get a better explanation than mine on wikipedia: merge algorithm. Anyway, here's my code, which actually runs surprisingly quickly:

One thing i noticed as well, was that the splitting operation was slower than the sort and merge. It should really be faster than them, since it is simpler and involves the sequential rather than random IO. So, i believe it wouldn't be hard to optimise that to take less than1 minute, which would bring the whole process down to 3 1/2 minutes. Which is pretty awesome, for sorting a 1-gig file of 10 million records - roughly equivalent to sorting the Australian phone book! Another thing is, i limited myself to 50MB mainly for the challenge of it. I checked to see how much faster it ran with more RAM: 10x bigger chunks and merge buffers. I was surprised to find that it was slower overall. And here's the project if you want it. You'll need to write your own 1-gig sample file, i'm not uploading that! github.com/chrishulbert/ExternalMergeSort

Thanks for reading! If you found this article helpful, do me a favour and check out my upcoming web-app: ScrumFox agile team management. And if you want to get in touch, I'd love to hear from you: chris.hulbert at gmail.

Chris Hulbert

(Comp Sci, Hons - UTS)

iOS Developer in Sydney.

I have worked at places such as Google, News Corp, Fox Sports, NineMSN, FetchTV, Woolworths, and Westpac, among others. If you're looking for a good iOS developer, drop me a line!