All your observations are totally correct, and I would suggest to everyone to re-read your reply before deciding on architecture for system like this one.

I had different end-goal: make map over perl structure scale across commodity low-end desktop machines. To be honest I was hoping to be able to do it on single machine, in-memory but it didn't work out for me.

Dataset isn't huge: 540 Mb with 121887 records with about 4.5K per record.

And I wanted to run this query as fast as possible on all my data (which affected my decision not to use disk for processing), so simpliest possible solution I could come up with is to load it all into perl hash.

This worked well for most fields, until I discovered that I have 4.5 milion entries for CR running following code:

Perl hash structures are nice, and code is clean and concise, but memory usage for results pushed me to swap (confirming that my anti-disk bias is correct). Having said that, I have used swap very effectively as first line of on-disk offload before, but with a result set which has random access pattern it doesn't mix well (disks are tipically SATA drives, so fast in linear reads, but slow for almost anything else.

To work around this problem, I implemented conversion from long fields names in CR (err... longer than 4 bytes for int ignoring perl decorations) into simple integer which enabled me to put 4.5 million CR records onto single machine.

I decided to use MD5 hash for each key, keep mapping of md5 to integer in-memory and replace (memory hungry) key with integer while preserving value.
This way, I pushed full key names to disk in form of int -> full_name mapping.

On disk storage had two iterations, in first one I used BerkeleyDB. It saved so much memory that whole database file could fit in /dev/shm :-)

But, storing fields on disk did come with huge performance penalty to my query time. It took longer than 3 seconds to complete query and I wasn't prepared to admit defeat. To speed it up, it was logical to shard it across machines. Even better, I could control worst possible query time by changing size of shard for each node to adjust for different systems.

To implement communication between nodes, I first implemented fancy protocol, but then decided to just ship Storable object directly to socket. Well, not quite directly, since I'm using ssh with compression to speed up network transfer. Since it's mostly bulk (send data to node or receive results) it improves performance on my 100Mb/s network for about 30%.

Whole idea is to have fast throw-away calculations on data which comes from semi-formatted text files (e.g. Apache logs) so your note about limited lifespan is so very right. This was design decision. With this model, I can start on single machine until I fill up memory (or query becomes too slow) and then spread it across other machines until I have whole dataset available or scale out for query speed.