If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register or Login
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

I'm experiencing a weird problem that makes me wonder if there is a problem with the std::list.sort member function when used on very large lists. I have a very large list (~20GB) that is made up of a class (382bytes each) of DNA sequence data. When the list is on the order of 12GB it sorts fine, but if it gets up into the 17GB range when I call sort() it seems to block. The list has already been built so I don't think it's a malloc issue. I also don't think it's a complexity problem because the 12GB list sorts fine (it takes a minute or so), so I'm wondering if the std sort method doesn't properly handle the excess memory available in 64bit mode? Has anyone experienced this or know a good place to ask this question?

I'm experiencing a weird problem that makes me wonder if there is a problem with the std::list.sort member function when used on very large lists. I have a very large list (~20GB) that is made up of a class (382bytes each) of DNA sequence data. When the list is on the order of 12GB it sorts fine, but if it gets up into the 17GB range when I call sort() it seems to block.

What is your sorting criteria? What does it look like (in code)?

Problems with sorts can stem from writing incorrect sorting predicates. If you wrote one, please post it.

yes for this sort I'm just comparing the first item in the hash array...
I'm using the Eclipse Galileo release as an IDE which I believe calls the GCC built into the Mac dev tools: i686-apple-darwin9-gcc-4.0.1

yes for this sort I'm just comparing the first item in the hash array...
I'm using the Eclipse Galileo release as an IDE which I believe calls the GCC built into the Mac dev tools: i686-apple-darwin9-gcc-4.0.1

thanks
-jason

OK, the last thing now is to see what your fitnessrecord class looks like, and if it has a user-defined copy ctor or assignment operator, what it does.

There is no need to post the implementation of the functions in your class, just the declaration, and if you have written one, the copy ctor/ assignment operator for the class.

MAX_HASH is currently set to 25...but I'm only actually filling the first 4 fields currently. Could it be an issue if I'm not properly initializing all the hash array elements? The number of fitnessrecords is approximately ~40 million.

I've not fully researched your posts yet, I'll join in here too, but I suspect Paul is busy with that and I'll wait on his return (my evening is about to get busy).

Generally the STL is reliable even at high volumes like this. Paul's already mentioned some of the likely things to investigate, but you're problem may be little more than VM issues.

For example, is the problem simply that it seems to stop?

As the size gets larger, the performance of VM will degrade badly. A little VM dependence might add 10&#37; to the time, a little more adds 50%, a little more than that and you're suddenly seeing triple or quadruple runtime, then after that (which you're reporting is about the limit here) - it turns from hours to weeks to years - literally.

I assume the algorithm you're running doesn't suit databases, and you intend to execute this on a machine with suitable RAM at some point?

Why list as the container?

Perhaps a vector of pointers?

The reason I ask...

For about 40 million elements, the vector of pointers would be under 1Gbyte (320 Mbytes). The data itself would still occupy room, but when using containers, if you're sorting must swap or copy entries, the copy involves the entire object, which involves VM.

When the container is a set of pointers to your object, your storage method is more complicated, but the swap or copy during sort only involves the pointers, not the full objects. This 'lightens' the load on VM overall.

It might be helpful here.

If my post was interesting or helpful, perhaps you would consider clicking the 'rate this post' to let me know (middle icon of the group in the upper right of the post).

I don't have access to g++ source code for list right now ... but
Visual C++ had a similar problem with sorting a very large number
of elements. See the link below. Maybe there is a similar array
that needs its size increased in g++. (The first two changes had to
do with a coding error ... the last dealt with performance in sorting
large lists).

Try letting it go over night. It could be that OSX is screwing up it's swapping for some reason and slowing down. UNIX uses lazy page swapping. In 99&#37; of the time it's faster than Window's aggressive page swapping, it's just that 1% of the time that things go screwy and it slow.

Check your activity monitor to see what it's doing. If it's still working you should see Page-ins and Page-outs incrementing.

Next, I tried a list<fitnessrecord *> - so, as I mentioned earlier, the swap was less burdened. This does improve timing some.

Then, I split the data into a pair, something like

pair< fitnesshash, fitnessrecord *>

This would work with a class, a pair, however you like - the point is to separate the hash from the data, holding the data as a pointer.

This had the best performance.

The reason is observed in this behavior.

In the list<fitnessrecord> or vector<fitnessrecord> strategy, the entire time the example ran, CPU usage was nearly zero while VM was the dominant bottlneck (and the disk was always busy). In fact, SQL was faster at sorting this volume than the C++ test.

With the list< pair<fitnesshash, fitnessrecord *> > strategy, the CPU usage was nearly zero after the data was built, until the sort algorithm pass through it (loading it back in from VM) - essentially leaving all the other data on disk (VM). At which point sorting consumed 100&#37; CPU (of the core paying attention to the sort), disk action nearly stopped, the sort finished as quickly as smaller runs from that point, then it took it's time deleting the data (it has to load from VM to call destructors, etc).

I was able to complete a test on 50 million samples in a few minutes from start to sorted. The shutdown of the application took longer because of the VM involvement there, but at that point the work had long been finished.

Worth a thought.

Oh, another strategy may also be of interest, merge sorting. The idea is to sort lists of manageable size (say 1 million), write that to disk, sort the next, then "merge" sort the collection. The merge sort is essentially "point to the first entry in each sorted collection, move the "least" of the topmost into the output, and repeat". It's very simple.

Frankly, SQL is a good idea, too.

Either that, or lots of RAM

I'm about 85% sure that's the problem you're having. The time consumption is geometric with volume when VM is involved. It becomes ridiculous (as in days) at some point.

Last edited by JVene; July 16th, 2009 at 10:33 PM.

If my post was interesting or helpful, perhaps you would consider clicking the 'rate this post' to let me know (middle icon of the group in the upper right of the post).