So you have taken the test and you think you are ready to get started with OS development? At this point, many OS-deving hobbyists are tempted to go looking for a simple step-by-step tutorial which would guide them into making a binary boot, do some text I/O, and other "simple" stuff. The implicit plan is more or less as follow: any time they'll think about something which in their opinion would be cool to implement, they'll implement it. Gradually, feature after feature, their OS would supposedly build up, slowly getting superior to anything out there. This is, in my opinion, not the best way to get somewhere (if getting somewhere is your goal). In this article, I'll try to explain why, and what you should be doing at this stage instead in my opinion.

Morin, "There *are* multi-chip x86 systems (e.g. high-end workstations), there *are* ARM systems (much of the embedded stuff, as well as netbooks), and there *are* systems with more than one RAM..." Sorry, but I'm not really sure what you're post is saying?

Then I might have misunderstood your original post:

However, considering that shared memory is the only form of IPC possible on multicore x86 processors, we can't really view it as a weakness of the OS.

I'll try to explain my line of thoughts.

My original point was that shared RAM approaches make the RAM a bottleneck. You responded that shared RAM "is the only form of IPC possible on multicore x86 processors".

My point now is that this is true only for the traditional configuration with a single multi-core CPU and a single RAM, with no additional shared storage and no hardware message-passing mechanism. Very true, but your whole conclusion is limited to such traditional systems, and my response was aiming at the fact that there are many systems that do *not* use the traditional configuration. Hence shared memory is *not* the only possible form of IPC on such systems, and making the RAM a bottleneck through such IPC artificially limits system performance.

As an example to emphasize my point, consider CPU/GPU combinations with separated RAMs (gaming systems) vs. those with shared RAM (cheap notebooks). On the latter, RAM performance is limited and quickly becomes a bottleneck (no, I don't have hard numbers).

I wouldn't be surprised to see high-end systems in the near future, powered by two general-purpose (multicore) CPUs and a GPU, each with its own RAM (that is, a total of 3 RAMs) and without transparent cache coherency between the CPUs, only between cores of the same CPU. Two separate RAMs means a certain amount of wasted RAM, but the performance might be worth it.

Now combine that with the idea of uploading bytecode scripts to server processes, possibly "on the other CPU", vs. shared memory IPC.

"my response was aiming at the fact that there are many systems that do *not* use the traditional configuration. Hence shared memory is *not* the only possible form of IPC on such systems, and making the RAM a bottleneck through such IPC artificially limits system performance."

"As an example to emphasize my point, consider CPU/GPU combinations."

I expect the typical use case is the GPU caches bitmaps once, and doesn't need to transfer them across the bus again. So I agree this helps alleviate shared memory bottlenecks, but I'm unclear on how this could help OS IPC?

I'm curious, what role do you think GPU's should have in OS development?

"I wouldn't be surprised to see high-end systems in the near future, powered by two general-purpose (multicore) CPUs and a GPU, each with its own RAM (that is, a total of 3 RAMs) and without transparent cache coherency between the CPUs, only between cores of the same CPU."

This sounds very much like NUMA architectures, and while support for them may be warranted, I don't know how this changes IPC? I could be wrong, but I'd still expect RAM access to be faster than any hardware on the PCI bus.

"Now combine that with the idea of uploading bytecode scripts to server processes, possibly 'on the other CPU', vs. shared memory IPC."

I guess that I may be thinking something different than you. When I say IPC I mean communication between kernel modules. It sounds like you want the kernel to run certain things entirely in the GPU, thereby eliminating the need to run over a shared bus. This would be great for scalability, but the issue is that the GPU isn't very generic.

Most of what a kernel does is IO rather than number crunching, it isn't so clear how a powerful GPU is helpful.

> This sounds very much like NUMA architectures, and
> while support for them may be warranted, I don't
> know how this changes IPC?

NUMA is the term I should have used from the beginning to avoid confusion.

> > "As an example to emphasize my point, consider CPU/GPU combinations."
> I expect the typical use case is the GPU caches
> bitmaps once, and doesn't need to transfer them
> across the bus again. So I agree this helps
> alleviate shared memory bottlenecks, but I'm
> unclear on how this could help OS IPC?

It seems that my statement has added to the confusion...

I did *not* mean running anything on the GPU. I was talking about communication between two traditional software processes running on two separate CPUs connected to two separated RAMs. The separate RAMs *can* bring performance benefits if the programs are reasonably independent, and for microkernel client/server IPC, data caching and uploaded bytecode scripts improve performance even more and avoid round-trips.

The hint with the GPU was just to emphasize the performance benefits of using two separate RAMs. If CPU and GPU use two separate RAMs to increase performance, two CPUs running traditional software processes could do the same if the programs are reasonably independent.

Not to say that you *can't* exploit a GPU for such things (folding@home does), but that was not my point.

> I could be wrong, but I'd still expect RAM access to
> be faster than any hardware on the PCI bus.

Access by a CPU to its own RAM is, of course, fast. Access to another CPU's RAM is a bit slower, but what is much worse is that it blocks that other CPU from accessing its RAM *and* creates cache coherency issues.

Explicit data caching and uploaded scripts would allow, for example, a GUI server process to run on one CPU in a NUMA architecture, and the client application that wants to show a GUI run on the other CPU. Caching would allow the GUI server to load icons and the like once at startup over the interconnect. Bytecode scripts could also be loaded over the interconnect once at startup, then allow the GUI server to react to most events (keyboard, mouse, whatever) without any IPC to the application process.

The point being that IPC round-trips increase latency of the GUI (though not affecting throughput) and make it feel sluggish; data transfer limit both latency and throughput, and in a NUMA architecture you can't fix that with shared memory without contention at the RAM and cache coherency issues.