A development blog of what Con Kolivas is doing with code at the moment with the emphasis on linux kernel, BFS and -ck.

Monday, 25 July 2011

3.0 BFS delays

Hi all

I haven't blogged much lately because I've been distracted from kernel hacking by bitcoin mining. For some crazy reason I took it upon myself to make mining software that did what I wanted, writing it the way I write kernel code. Anyway since it's unrelated I haven't posted about it here before, but if anyone's interested, the development thread is here:

Now about the kernel. To be honest I haven't followed the development of 3.0 almost at all, being totally pre-occupied with other things as I've taken time out from work as a sabbatical while I reassess work-life balance, long term career management (even to the point of considering changing line of work - anyone need a c programmer?) and spend time with family, friends and random other personal development things. No, I'm not quitting kernel development any time soon (again).

Anyway the thing is I'm going with Interplast next week to Nauru (of all places) as a volunteer anaesthetist for needy children for 10 days. I'm not sure if I'll find time to port BFS to 3.0 before then, or if I'll be able to do it while I'm actually there (doubtful). So just a heads up that it might be a while before we BF the 3.0 kernel.

It is NOT NUMA aware in the sense that it does any fancy shit on NUMA, butit will work on NUMA hardware just fine. Only the really big NUMA hardwareis likely to suffer in performance, and this is theoretically only, sinceno one has that sort of hardware to prove it to me, but it seems almostcertain. v0.300 onwards have NUMA enhancements.

@RalphIt is too bad BFS can't get into the mainline as an optional scheduler just like we have three different disk schedulers (noop deadline cfq). This would take some pressure off CK for development of BFS and also allow input of others in a collaborative sense.

@CKEveryone has a deep appreciation for the work you do! Thank you for it!

apropos mainline:I think Linus had rejected introduction into mainline, because he thought it would complicate development to have different models of schedulers. So I wait for a moment when a new linux version will be patchable by a previous bfs version without errors. This indicates end of linux scheduler development. Optimal moment to request inclusion of bfs into mainline!

But: If the actual mainline scheduler has no optimum this moment will never ocure ...

Bfs is not about benchmarks: We want a responsive Desktop even if benchmarks go worse. Because it is about the Desktop and many of us are using battery powered notebooks, the longer sleeping cpus the better!

Are there any tests around there to take these preferences into account?

Uuuuh, I had a look about patching linux-3.0 with Bfs: This is going to be a big task for Kolivas.

At first it seems easy, one function to transfer. But there is a new feature which will cost Con a lot. There had been work done in mainline linux-3.0 to implement restrictions on LinuxContainers. Huge task I guess to make a Bfs patch now ....

I dont understand all the talk about 4096 cpu Linux servers? All of them are HPC servers, essentially a large cluster. The largest SMP servers for sale today, have 32 or 64 cpus. There are no larger SMP servers for sale. The biggest IBM Mainframe z196 has 24 cpus.

The Linux trick is to connect lot of nodes on a fast switch, and then use software to make it look like a single kernel. This is how Linux servers can have 4096 cpus: it is a large cluster on a network emulating a single kernel.

For instance, SGI Altix Linux server works this way. If you study the SGI Altix Linux customers, they all do HPC work (embarrasingly parallell work). None use such servers for SMP work.

Also, scale MP which has up to 8192 cores, works this way:http://www.theregister.co.uk/2011/09/20/scalemp_supports_amd_opterons/"Since its founding in 2003, ScaleMP has tried a different approach. Instead of using special ASICs and interconnection protocols to lash together multiple server modes together into a shared memory system, ScaleMP cooked up a special hypervisor layer, called vSMP, that rides atop the x64 processors, memory controllers, and I/O controllers in multiple server nodes. Rather than carve up a single system image into multiple virtual machines, vSMP takes multiple physical servers and – using InfiniBand as a backplane interconnect – makes them look like a giant virtual SMP server with a shared memory space."

I dont understand all the talk about 4096 cpu Linux servers? All of them are HPC servers, essentially a large cluster. The largest SMP servers for sale today, have 32 or 64 cpus. There are no larger SMP servers for sale. The biggest IBM Mainframe z196 has 24 cpus.

The Linux trick is to connect lot of nodes on a fast switch, and then use software to make it look like a single kernel. This is how Linux servers can have 4096 cpus: it is a large cluster on a network emulating a single kernel.

For instance, SGI Altix Linux server works this way. If you study the SGI Altix Linux customers, they all do HPC work (embarrasingly parallell work). None use such servers for SMP work.

Also, scale MP which has up to 8192 cores, works this way:http://www.theregister.co.uk/2011/09/20/scalemp_supports_amd_opterons/"Since its founding in 2003, ScaleMP has tried a different approach. Instead of using special ASICs and interconnection protocols to lash together multiple server modes together into a shared memory system, ScaleMP cooked up a special hypervisor layer, called vSMP, that rides atop the x64 processors, memory controllers, and I/O controllers in multiple server nodes. Rather than carve up a single system image into multiple virtual machines, vSMP takes multiple physical servers and – using InfiniBand as a backplane interconnect – makes them look like a giant virtual SMP server with a shared memory space."

Indeed, mentioning 4096 CPUs in the BFS patch is mostly tongue-in-cheek because virtually all systems with any realistic availability are <= 64 logical devices. Nonetheless, scalability is an issue at 64 devices with the current BFS patch, though how big an issue, and in what areas, is up for debate. At low loads I expect BFS will not remotely have any scalability issue even at this number of cores/threads.