Until a few weeks ago, something I've never needed to do was sort a file that was huge - like unable to fit in memory huge. I think the basic algorithm for an external merge sort is easy enough, but it did take some thought and I didn't find much useful in a web search, so I decided it was probably worthy of posting even though it turns out to be rather simple.

Here's the basic algorithm for an external sort in English (I can provide it in Java on request, since that's what I wrote it in, but I'm just posting it in English to keep it generally useful).

Until finished reading the large file

Read a large chunk of the file into memory (large enough so that you get a lot of records, but small enough such that it will comfortably fit into memory).

Sort those records in memory.

Write them to a (new) file

Open each of the files you created above

Read the top record from each file

Until no record exists in any of the files (or until you have read the entirety of every file)

Write the smallest record to the sorted file

Read the next record from the file that had the smallest record

Does that make sense? I kept it in very high level language, but I'm happy to answer any questions regarding smaller details.

Update: I noticed a slight bug in the algorithm. The line "Read one record from each file" was inside the last loop, but should have
been outside of it. The post was changed to reflect the correct way to do it.

Hey! Why don't you make your life easier and subscribe to the full post
or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate
wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!

I have just built some abstract structures called big queue and big array to simplify big data sorting and searching task on a single machine with limited memory. Basically, the algorithm used is similar to yours - external merge sort.

I can sort 128GB data in 9 hours on a single machine, and then binary search the sorted data with almost no time.

Your email address is not displayed.
It is used only to respond to you if needed, and
send comments if you subscribe to this comment thread.
It is stored in a cookie if you choose to "Remember my details".