A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, BFS and -ck.

Wednesday, 13 April 2011

Scalability of BFS?

So it occurred to me that for some time I've been saying that BFS may scale well only up to about 16 CPUs. That was a fairly generic guess based on the design of BFS, but it appears that these more-thread machines and multi-core machines seem to quite like BFS on the real-world benchmarks I'm getting back from various people. With the latest changes to BFS, which bumped the version up to 0.400, it should have improved further. I've tried googling for links to do with BFS and scalability and the biggest machine I've been able to find that benefits from it is a 24 core machine running F@H (folding at home). Given that this was with an older version of BFS, and that there were actually advantages even at 24 cores, I wonder what the point is where it doesn't scale? Obviously scalability is more than just "running F@H" and will depend entirely on architecture and workload and definition of scalability, and so on, but... I wanted to ask the community what's the biggest machine anyone has tried BFS on, and how well did it perform? If someone had access to 16+ cores to try it out I'd be mighty grateful for your results.

Do you use a fixed quantum? and whats the overhead in ms of the scheduler?

Ive skimmed over your code, I think you have a lock on each process in the runqueue, and on the runqueue itself. Could you not use an atomic compare and swap on the processes lock and so ignore the runqueue lock? That would allow multiple CPUs to access the queue without waiting.

There is no meaningful process lock, only the runqueue lock. The runqueue lock contention is the most likely culprit for a scalability limit here since it's a shared runqueue. Atomic compare and swap would be extraordinarily painful to use here because you'd look over a list, and processes would be disappearing all the time onto other CPUs so you'd have to re-check it was still there every time you went back to the task you had chosen if you had not already taken it. The whole design would have to be rewritten from scratch to do so and the multiple checking of the existence of the process each time would greatly increase the overhead in the non-contended case - which is the target audience for BFS as it was designed for commodity hardware and desktops. Lower CPU count machines already show less overhead with BFS than mainline, but I've been unable to test 16x or more. The scheduler overhead cannot be measured in ms but in nanoseconds. Runqueue contention was demonstrable as occurring on the 8x with a load of 1000 but was still not contributing significantly in overall time compared to any other lock in the kernel. I estimated once that lock contention would be relevant from 16 CPUs but that was an arbitrary figure that I didn't have hardware to prove it on. As mentioned, the fact it shows scalability outstripping mainline at 24 CPUs suggests the limit is somewhat beyond that. No doubt when the number of CPUs becomes very high the contention becomes relevant.

I have an i7 920 running Gentoo amd64 w/ HT (so 8 cores). Feel free to give a holler on IRC if you need testcases run on this hardware and I'd be more than happy to.

It's very hard for me to objectively tell the difference. Especially lately I notice very little difference in {C,B}FS. My feeling just from staring at htop is BFS gives slightly better throughput on big compiles.