donkeykong has asked for the
wisdom of the Perl Monks concerning the following question:

Hi guys,
how would you mathematically prove that a hash is a more efficient and scalable data structure than an array, in a situation where you have to read a file, find all its unique keys and show the # of occurrences each unique key had in the file?
I came to the conclusion that the hash in big-oh notation is O(1) and the array is O(n), but again, how can I mathematically prove it? I appreciate your help.

Most hash insertions are O(1). When the hash has to be expanded, the insertion becomes O(N). Since Perl grows hashes exponentially, inserting N elements is O(N log N). ( See tilly's reply. Inserting N elements into a hash insertions is O(N). )

Most array appends are O(1). When the array has to be expanded, the append becomes O(N). Since Perl grows arrays exponentially, inserting N elements (one at a time) is O(N log N). ( See tilly's reply. Inserting N elements into a hash insertions is O(N). ) Inserting all N elements at once is just O(N), though. Once the array is loaded, you need to sort it, for a total of O(N log N).

So you end up with O(N log N) either way. The question that remains is what's factor, and how close to the bound (N log N) is the actual performance for your actual data.

The percentage of duplicates is going to matter. The more duplicates, the better the hash approach will be.

No, no, no. Perl's strategy for growing data structures makes the amortized average insert time O(1). So growing a data structure of size n is O(n), not O(n log(n)).

For example consider a hash in the worst case scenario, where you have just done a hash split. Then every element has been moved once. Half of them were around for the previous hash split, so have been around twice. Half of them were around the previous time, and so have been around 3 times. And so on. So the total amount of moving is

With arrays the logic is similar except that each time we only grow an extra 1/5, and the overhead turns out to be 6n.

The result is that building either the hash or the array is O(n). But sorting the array is O(n log(n)). Therefore the hash scales better.

But (very big but) note that log(n) grows very slowly. If the data structures are big enough that they live on disk, then accessing your hash involves a disk seek, while merge sort involves accessing data on disk sequentially. Disk seeks are painfully slow, while accessing data on disk sequentially is orders of magnitude more efficient. That means that for data that lives on disk, the array approach is faster. Given the necessary numbers, generally by a couple of orders of magnitude.

But (very small but) hashes still scale better than arrays. So for very big data structures, hashing will again be faster than the array approach. However you are extremely unlikely to ever encounter a data set that large, if you did it probably wouldn't fit on your hard drive, and processing it sequentially by either approach would be too slow anyways. So you will want to parallelize your data processing. And when you do that, you will find that merge sort parallelizes better than hash lookups so arrays still win.

Thereby demonstrating that the theoretical big-O is not always the end of the story.

A final random note. Contrary to your final comment, it is the array that benefits more from duplicates. That is because if you're slightly clever in your merge sort, then eliminating lots of duplicates will reduce the size of your large passes, speeding up the sort. Hashing, by contrast, goes at the same exact speed but will take up less memory. (Until you seek to disk...)

For example consider a hash in the worst case scenario, where you have just done a hash split. Then every element has been moved once. Half of them were around for the previous hash split, so have been around twice.

That assumes a hash split only after Ω(n) new inserts. But then in the worst case scenario where all elements hash to the same bucket, searching for a key has to walk a list, making the search linear instead of O(1).

I know hashes have changed in the past so that the linked list hanging of each bucket cannot grow unbounded. But that means that worst case, you either have hash splits in o(n) (lowercase o), or search times that aren't O(1).

Any "mathematical proof" will have to look at the C code and consider how often hash splits occur worst case, and how long the linked lists can be.

While there are other data structures available to him (such as disk-based structures and tries), I chose to only speak about the ones he mentioned (hashes and arrays) due to time constraints.

Contrary to your final comment, it is the array that benefits more from duplicates. That is because if you're slightly clever in your merge sort, then eliminating lots of duplicates will reduce the size of your large passes, speeding up the sort.

You'll reduce the size of large passes you never have to do with a hash.

You'll speed up the sort you don't have to do with a hash.

With a hash, you never have to deal with more than $num_duplicates items. With an array, you'll deal with at least $num_duplicates items. I don't understand how come you say the array benefits more.

reading the file is ~ n work and inserting the keys into a hash is ~ 2m work (where m is the number of unique keys, taking into account the rearrangements required as the hash grows -- assuming no pathology).

reading the file into an array is ~ 2n work (one to read and one to insert), but to sort it is ~ n log n (assuming no pathology), then you need to scan the array to spot the duplicates, which is a further ~ n work.

So hash: ~ n + 2m; array: ~ 2n + n log n.

Not all work is the same, however. Scanning the array certainly requires ~ n comparisions, but this is in a Perl loop -- which may dominate the running time. Further, this does not account for the cost of reorganising the array to remove duplicates, if that's what you want to do...

However, that may not be the complete story. What do you intend to do with the information ?

Suppose you want to print out the unique keys in key order:

the keys need to be extracted from the hash ~ m work, and sorted ~ m log m, and then printed ~ m.

the array can simply be printed for ~ m work -- most efficiently if that is combined with the scan for duplicates step.

so the totals now are hash: ~ n + 4m + m log m; and the array ~ 3n + n log n + m. So, where there are a lot of duplicates the hash approach is doing less work; where there are few duplicates there's apparently not a lot in it.

Suppose you want to print out the unique keys in original order (duplicates sorted by first instance):

with the hash you keep an auxiliary array with the keys in original order, so we have ~ n to read, ~ 2m to add to hash, ~ m to add to auxiliary array, ~ m to print, total: ~ n + 4m -- not a log n or log m in sight, so the total work decreases !

with an array... requires at least an auxiliary array containing the original array indexes; then the array indexes are sorted according to the key values; then the array indexes are scanned to count the number of times each key value appears; .... from a work perspective this is not apparently a big increase, but the extra code to work with the index array will affect the running time.

in this case the hash is a clear winner, not only is there no n log n component, but the work in the array sorting and scanning has clearly increased.

So what's the point of all this ? Well:

big-O-wise filling the hash is O(n) and filling the array and sorting it is O(n log n). But, for the whole process (assuming key order output) the hash is O(n + m log m) and the array is O(n + n log n) -- big-O is broad-brush, but you do need to consider the entire problem.

hashes carry quite a hefty memory overhead (and for key order output the entire hash is overhead)... so a hash implementation will start to page earlier than an array one -- big-O is all very well, but it's not the whole story.

while the array implementation appears little different (unless m is a lot smaller than n), it does involve the scan step, which is a Perl loop -- so we might have, roughly:

If you're trying to prove this for every file, I think you're doomed. In the special case where the file's keys are all small integers, the array will work faster than the hash. Operations on the array are generally faster, and its storage will be more compact to boot.

It's a rare special case, to be sure. Still, my point is that the input to some algorithm can have a significant effect on how efficient it is.

Which is faster? It depends on how much of @big_array gets eliminated by the filter(). If most of the array is filtered out, it's faster to do it with grep. If little of the array is filtered out, it's faster to skip the extra loop (grep).

All that having been said, I'd use a hash for the task that you describe. I just wouldn't try to prove that it will always be the fastest way to do it.