crenz has asked for the
wisdom of the Perl Monks concerning the following question:

We have a system here that uses very large datastructures. For some projects, the size of e.g. a certain hash can reach up to 100_000 elements: The script becomes notably slower when reaching this size. Is there any way to speed this up? I was thinking of using tie with a suitable Tie:: module that is faster than perl's default implementation. I guess it should be possible to be much faster if you limit yourself to storing literal values only, not references. Any recommendations?

100_000 elements is going to be, roughly, 1-2K per element. That's 100M-200M of RAM. Unless each element is some huge data structure in its own right (like an array of hashes or somesuch), you're probably not doing a lot of swapping to disk or anything like that.

foreach has to build the list in memory, then iterate over it. each will only bring one value in at a time.

My bet is on your algorithms, not your data structures. Maybe if you posted a few snippets of how you use this massive hash, we might be able to help you out.

Being right, does not endow the right to be rude; politeness costs nothing.Being unknowing, is not the same as being stupid.Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

And if it IS, then why should it be? There are specialized magic iterators when the parser recognizes an impending iteration with constants, like for (1 .. 1_000_000). Why does perl not implement an automatic iterator when the parser notices a simple sort-free for (keys %foo)? That's such a common idiom I would be amazed it wasn't getting special attention.

Anonymonk beat me to the punch. But, the reason for why foreach (keys) and while (each) behave differently has nothing to do with keys being an iterator or not. (Well, it does, but not really.) It has to do with the difference in behavior between foreach and while. foreach is defined to operate on a list. If you give it a list, then you're good. If, however, you give it a function or keyword, then it has to call that function/keyword and construct a temporary list with the return value(s). Incidentally, this is why the following doesn't DWIM:

Being right, does not endow the right to be rude; politeness costs nothing.Being unknowing, is not the same as being stupid.Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

What do you do with the data? Do you construct a new structure often? Do you need to iterate through the whole structure or do you typically only need to access a few elements? Is this in a persistent application or in a (CGI) script that needs to start for each request?

100_000 elements is not that much (it takes about 11 Mb on my machine if filled with simple integers), so what do you put in the keys and values?

If you suddenly see a notably drop in performance, it's most likely you have hit the treshold where it starts swapping.
I'm a bit surprised that you hit the limit so soon, what are you storing in your hash?

That's less than 8Mb, which shouldn't be much of a problem on a modern machine.

But fact is, Perl is memory hungry. The more structures you have, the more memory you use. The more complex the structures are, the more memory you use.

Speeding it up is only possible by using less memory at a time. Using tie as you propose it is not going to solve it. If there would be a known way of speeding up hash access, it would already be in the core! Not to mention that tieing is a slow mechanism, as it means that for each hash access, no matter how trivial, a Perl subroutine will be called. It is possible to use the tie mechanism to store the hash on disk instead of memory, but unless you otherwise would run out of memory, that's not going to change your performance for the better. Regular disk access is not likely to be faster than accessing your swap area.

Well, yeah, he could do anything else in his program that could be made faster. But since we don't know the rest of his program, nor is his question about that, it's nothing more than pure speculation, and rather pointless.

There are caveats with each method, in as much as, they are not general-pupose hash replacements--Perls hashes are about as good as you will get if you need the full power of them. But for any individual application, there are often more compact representations that can be utilised.

The trick is to work out what subset of the properties of a hash you require, and look for a solution that meets those requirements with a reduced memory footprint. Obviously that will come at the sacrifice of other properties that you don't need for the given application.

To be able to suggest an appropriate solution requires that you describe the use you are making of your large hash. Any solution that might result would probably not be a "module".

Thanks all for your comments, they have been very helpful. My colleague is sure that the slowness of the script is due to using (big) hashes of hashes of hashes, and that it would be much faster to reimplement the system in C. I'm not so sure that would be worth the effort, and would like to do some profiling first to find out what's going on. Your comments are good ammunition for that :-)

I think you're the smarter of the two of you. Re-writes in my experience should be a last resort. There is so much history attached to legacy products that it is nearly impossible to incorporate the years of bug fixes and gotchas that previous employee's have packed in.

I'm not saying its never warranted, but it certainly shouldn't be one of the first options IMO.

If your problem is lookup time, consider making your hash multidimensional so perl has to look at fewer entries. This is akin to moving /home/joe to /home/j/joe on a system with thousands of users -- daemons only have to search 1/26th of the filespace to find their target file.

When putting a smiley right before a closing parenthesis, do you:

Use two parentheses: (Like this: :) )
Use one parenthesis: (Like this: :)
Reverse direction of the smiley: (Like this: (: )
Use angle/square brackets instead of parentheses
Use C-style commenting to set the smiley off from the closing parenthesis
Make the smiley a dunce: (:>
I disapprove of emoticons
Other