Lady_Aleena has asked for the
wisdom of the Perl Monks concerning the following question:

I have a little script that is running my computer out of memory, and I was wondering if there might be a way around this problem. I have been thinking about it for the last few days and haven't come up with anything. Someone a few days ago mentioned something called forking. I can guess at what that is, but for some reason I get the feeling that forking isn't well liked, but I may be wrong there.

When I run that with 5 initial children and 20 generations, I run out of memory. 5 and 19 doesn't. I have a feeling that my $children = int(rand(6)); is what is killing my memory when the amount of children in a generation is in the billions. By the way, as with all my scripts, this is just a bit of fun, so I am not in any great rush.

The history of mathematics and (more recently) computer science has been moved along with respect to innovation with almost a singularly common goal; finding ways to calculate things that require less work. Your calculations are simple, but the work you're doing is hard, and significant. It would be better to look again at the model, and see if there's a mathematical solution (ie, a better algorithm) to accomplish the same thing with less work.

Your current solution is going O(c^n), ie, exponential. I suspect that there is an O(1) solution to the question of population growth rates. What that means is your current solution requires an exponential amount of work per additional generation, while there probably exists a non-iterative solution instead.

I recall we talked in CB about this a little. I may have been the person who mentioned forking, but it was in jest. My point was "If you think this solution crashes your computer, you ought to see what it does if the children are forks." (Which is akin to fork-bombing). That was intended as a muse, not as a suggestion. The concept behind a fork bomb is that each child spawns a new process, which in turn spawns a few new processes, and so on until the operating system and hardware simply can't accommodate them all. But that situation is actually similar to what's happening here, just without forks; You're spawning so many values, and holding onto so many of them, that the system cannot accommodate them all. And to what goal? To solve a problem that can be solved with a mathematical equation that doesn't require repetition.

I also briefly touched on the concept of Big O notation. You mentioned that is an unfamiliar topic. So we might as well discuss it a little here. Think of "Big O notation" as a measure of resources consumed. The resource may be computational work (time), complexity, or memory, for starters. Your code is fairly intensive of both time and memory. But we'll focus on how much work is being done.

I'm going to briefly discuss Big-O, and in terms that are oversimplified. Common Big-O notation is written in the format of O(x), where 'x' is a formula that describes how much work is being done in terms of 'n' (n being number of items being worked on). O(n) means the amount of additional work per additional item is equal as the number of items grows.

Other common units include:

O(1) => No increased work as n grows.

O(log n) => work increases logarithmically as n grows. (ie, work increases, but at a rate slower than the growth of n.)

So if you look at your program, each new generation requires exponentially more work than the previous generation. This is a pretty 'bad situation' to be in from a programming standpoint. You have to start saying to yourself, can this be reduced to a statistical model that is characterized by a simple formula? If that's the case, your work drops to O(1).

Normally distributed random variables can be easily and efficiently implemented with the Box–Muller transform, which turns two uniformly distributed random numbers (what rand returns) into two normally distributed random numbers.

If this were a project of mine (and I needed an efficient approach), I'd follow the approach of individual random numbers per population item as long as the population is less than 20, and for higher numbers I'd use the normal distribution approximation.

If there's interest, I can also show how the mean and variance of the random variable is calculated, but I'm tired right now and I'd do it tomorrow :-)

That is better. We might also change "a fixed amount of extra work gets added" to "a single unit of extra work gets added." Even that is probably more vague than just coming to an understanding of what logarithmic graphs look like as 'n' increases. Unfortunately, that's sort of difficult to post here. I do like trying to improve the description though. Wikipedia has a great writeup on Big-O notation. Why didn't we have that when I was in college? ;)

Billions of scalars would certainly blow out your memory since you only have a very few billion bytes of RAM and there is overhead.

If you're going to merely sum over all the children in @generation, why not simply do $generation{$outerLoopIndex +1} += int(rand(6)); directly, and avoid burning all that memory on very temporary values?

By the way. One thing I noticed and forgot to mention. You are making the assumption that each generation has up to 5 offspring. That means each couple must have up to 10, or each woman up to 10... or that both men and women can bare offspring of up to 5 each.

The assumption is that into each generation is born a certain number of children, and those children have a certain number of children, etc. Gender is not being taken into account, and the children's spouses are not being tallied here. And no, this is not a worm population growth chart. I am just trying to get an idea of how many generations it would take to get to about 250 billion people born into it, give or take a couple billion. I haven't taken interbreeding into account either, as after about a 5 generation gap some consider it safe to interbreed though some consider sooner than that safe. Doing that however would add yet even more complexity that I haven't even tried to think about adding.

Look, if you know the generational growth rate (i.e. for each person in generation N, there are 2.5 people in generation N+1, or whatever),
then just start with your 250 billion and work backwards.
(2.5 is the average of rand(6).)

First two off topic optimizations: If you used an array for %generations you wouldn't need to sort the keys (just store generation x into arrays slot x). Also max(keys %generations)==$generations.

Also you are summing up a lot of random values between 0 and 5 (distributed evenly). Basically the average number of offsprings per person you will get is 2.5, if you do this a lot. Change your last line to

to see this effect. It also shows how to calculate the next generation directly if you don't mind to get a clean statistical average instead of your calculated numbers

By the way, if you want to try to not have an evenly distributed number of offspring (i.e. families with two kids should be more common than with five, at least in our century), you might define an array with a different distribution:

Thank you for the idea of distribution, I toyed with the idea while writing it, but thought int(rand(6)) was easier. You are right though, it should be weighted towards fewer children per person of the previous generation. However, I found a situation where I have had to stop the loop. In the first few iterations of the main loop, a generation could end up with 0 children in it. At that point, going any further would be pointless. So while I could still push the generation with 0 children to an array instead of adding another key to the hash, max(keys %generations) != $generations. I am still working the array versus hash idea, I still have to figure out everything I would have to change. I am just more comfortable dealing with hashes than arrays. Meanwhile, here is the updated script.

It can be made faster (by using a formula to calculate the number of children in each generation without the inner loop), but the above is simple, and most of all, won't run out of memory (assuming there's a few Mb available to start perl).

Thanks for that tip, JavaFan. I used it, and you are so right. It is much faster and my memory doesn't run out! Unfortunately, I did come against another error "Range iterator outside integer range". I haven't quite got that figured just yet.

You can also cut the number of 'concurrent' generations. Just keepeing 3 generations is a good approximation for human comunity. You can keep the number of offsprings per generation in a array and keeping the array 3 values long.
So, borrowing ideas from previuss replies

wiredrat, you got me thinking about what I was really after and are right about me wanting the total population. Up in Re^2: A script with a loop is running my computer Out of memory I was talking about the number 250 billion. Well, that is the total population I am trying to get without something clunking out on me. So, I did a little finagling and came up with the following which returns the amount born into a generation and the total population which include the previous two generations. So, while I am still storing every generation generated, the current code is faster even with the new addition of total population.

I have a little script that is running my computer out of memory, and I was wondering if there might be a way around this problem.

Someone a few days ago mentioned something called forking.

Whatever you do, forking is probably not going to fix and "out of memory" problem. Why? Because at the moment you fork, you make an exact clone of your process. If your original process is using, say, 221MB of memory, then the child process will also be using 221MB.

Is your script running in circles, and only running out of memory after a long time? Then a better strategy to think about is to regularly wipe your memory in the big loop. Either restart the program, by launching it again and then exiting; or undef your big data structures.