I should be clear: the data structures are exactly as presented. @array is a simple array, while %hash is a hash of hashes of arrays. This HoHoA is really a DBM::Deep object, however, as I had expected this to be more efficient on memory usage.

I should also point out: this code works properly with smaller datasets, but not with my full dataset, which includes millions of data-points in the HoHoA and tends of thousands in the arrays.

In my simple-minded view this indicates a memory-leak in the foreach loop. Am I missing something? In general, should I be doing something different rather than a foreach? Or am I vastly confuzzled and missing something obvious to ye great and mighty ones?

The scenario implies that the HoHoA is already loaded at the point that you go into this three-level loop over the data structure, and you obviously have not run out of memory at that point. So there must be something about how you are trying to access the data structure that is causing perl to consume a lot more memory than was needed to store the original data structure.

You say the structure is "not sparse", but if loading the structure happens to take up, say, 80% of available memory, then you might end up going over the limit if you auto-vivify a relatively small percentage of previously non-existent cells in the overall structure.

If the OP code actually works on a given (smaller) data set, it might turn out that this different form of the nested looping could produce the same output in less time (but I haven't tested that).

The point of this approach is that you can easily check whether a given hash element exists before trying to use its contents as an array ref, and skip it if it does not exist; this involves extra steps within the loops, but these could end up saving a lot of execution time overall. And if they save enough unnecessary memory consumption as well (making the difference between crashing and finishing), speed may be a lesser concern.

update: I think the code as suggested above would be "good enough", if you're confident about how the data structure was being created, but I would be tempted to be more careful:

I think I should have been specific -- when I said "not sparse", I actually meant that the data structure had been fully created in advance with a loop exactly like this. I've added in the checks, and before the program the program crashes (about 3 hours of CPU) it does not every autovivify.

One thing that goes on inside the #do great deeds is to change values in the Array part of the HoHoA, and I had expected that to be the cause of my memory leak. I'm really confuzzled as to why this:

would eat up memory. I would have guessed it would use one scalar's worth (perhaps 15 bytes of data), and that this memory would get reused over & over again. Instead, the script runs on this loop for a few hours before using up all the memory.

and... wouldn't ya know it, there *were* elements being autovivified. I've fixed that problem, and now the code appears to be running correctly (appears = its been running for a day without crashing yet -- it still has another couple of days to run, though).

Many thanks -- this has led me to one key problem in the code. Any monks who haven't done so yet, please ++ the parent.

How much memory do you have? Your data structures are immense. It looks like you have perhaps 100,000,000 arrays in the hash, and even if most are not there you are autovivifying their keys ( scalar(@array) different anonymous hashes) and a reference to an empty array.

All those structures cost fifty-some bytes each plus a hefty multiple of the number of data elements.

I don't see a leak, just prolific use of memory. Can you read and use less of the data at a time?

The machine itself has 32 GB, but I'm using a 32-bit compiled Perl so presumably it can only address 4 of those.

Also, 100 million arrays is a realistic estimate (its not sparse). I had thought that by using DBM::Deep I would be storing the majority of this data on disk, so that the actual amount stored in memory would be much smaller than 100 million x size(data-structure). Am I misguided on that?

That isn't possible, because Perl doesn't use a separate garbage collector. It simply reference-counts all its data (scalars, arrays, hashes etc) and frees an item when its reference count drops to zero.

Actually this originally was in a database, but the performance.... Benchmarking suggested it would take 5 months to complete, and since this was to be an annual task we're aiming to get it down to two or three weeks. Since DB access was the second biggest cost, it appeared to be an obvious place to tune, so this is an attempt to *replace* the DB look-ups and to a space/time trade-off.

I can't say I've ever even tried to put that much data into memory at once, but I have dealt with some large collections. You might try chunking up the original data and storing it in temporary tables you can iterate through so you have less to work with at one time. If your database provides something like LOAD DATA INFILE, you can make use of it to dramatically reduce transaction times when it's time to put all the data back.

That all said, if it were mine to do the single word that would keep me awake at night is "cluster".

Saying the code crashes on large datasets doesn't tell us much. Without some kind of error message I can only speculate at the problem.

With that said, I don't see an obvious error in your loop. This would lead me to believe that the problem is somewhere in the "# do great deads" section, but without any more information it is hard to say.

If you really need to access 10,000x10,000 arrays, each containing gosh-only-knows-how-many elements, perhaps you want to rethink your algorithmn, and maybe implement it in C, at least the core components.

As much as I love Perl, there are places where it is not the appropriate solution.

No, the C-style for ( initialisation; termination; iteration ) given in the OP does not duplicate the dataset, only the perl-style for ( list ) and the c-shell-style foreach ( list ) (which did cause the crash in the OP) do that.

The only time it makes sense to convert the c-style for to while is where there is no initialisation or termination, e.g.: for (;$i<$j;) {} seems a likely candidate for changing to a while.

Although arguably more perlish than for, foreach has the downside of duplicating the entire set in memory before iterating through it. Converting the foreach to a for ( k=0; ...; ) construction would avoid that.