Linux and Symmetric Multiprocessing

“As evidenced by major central processing unit vendors, multi-core processors are poised to dominate the desktop and embedded space. With multiprocessing comes greater performance but also new problems. This article explores the ideas behind multiprocessing and developing applications for Linux that exploit SMP.”

About The Author

41 Comments

As far back as 1998/9 x86 smp motherboards were actually quite affordadale, even for the serious hobbyist.

I bought a Supermicro P6DBE dual P3 mobo about this time and although I’m not using it now, it’s still in my cupboard in perfect working order.

At that time there was only one OS that (truly) could take advantage of such a luxury. That was BeOS.

BeOS on an smp board at that time was an altogether different user experience. Multitasking was smooth and effortless even on a modest pair of 400Mhz P3s with 256 Mb ram.

I have yet to discover a similar experience on a uni processor machine even though I now use 1Ghz of ram and an AMD 3500+ cpu.

Of course, the main factor in all this is the ability of the OS and the native apps written for it, to take advantage of multiple cores, something as a result of it’s breeding, Haiku should excel at without much trouble at all.

Interstingly, the concept of threading, which is vital for parallizable applications, is something taken very conservatively by the unix/linux developers. IE: they don’t use threads at all, as long as they aren’t forced to do so.

In contrast, BeOS was a system that from the start on propagated threading a lot. Even Windows is more about threading, while Unix folks still think fork() is a cool concept .

One problem was, for example, the Linux threading implementation that sucked. At least nowadays POSIX threads are supported, but they still suck a little performance wise… So if you wanted to have a fast application (without having scaling over SMP in mind), using threads was generally a _bad_ idea!

Many Linux and UNIX developers appear to be positively rabid about avoiding the use of threads. Mention threads and you can be fairly certain that someone will tell you how hard threads are, how difficult it is to manage shared resources, how enforcing concurrency is a headache and that debugging threaded code is just impossible.

It’s not really true of course, but it can certainly seem that way if you’re not used to working with threads. So there is a degree of self-enforcment going on; developers think threads are hard, they don’t use threads, so they never learn that threads arn’t all that hard after all. Add to that the historically poor threading support on Linux and the existing catalogue of thread-unsafe libraries and code, and you can begin to see why threads are having a hard time gaining traction on traditional UNIX systems.

So there is a degree of self-enforcment going on; developers think threads are hard, they don’t use threads, so they never learn that threads arn’t all that hard after all.

Face it, most of programmers can’t write multithreaded code. It’s just not easy. Your attitude remembers me security-experienced programmers “it’s not that hard to write secure code”. And expert windows users will tell you “it’s not that hard to use windows without getting infected by virus. And linux users will tell you that Linux desktop is just as easy to use as any other desktop OS. Yeah, whatever.

And yes, the unix world doesn’t uses threads that much in the desktop, but is not that UNIX doesn’t knows nothing about threading. The Unix server market (the market that BeOS didn’t care about) has been using heavily threaded apps for a LONG time. Big unix SMP servers have existed well before beos was born. Companies have spent millions tring to make their databases/whatever faster by designing their apps from scratch to use threading to get more performance.

There’s no doubt that beos wrote the first desktop os that cared about threads, an it’s obvious that x/gtk/qt/vista/mac os x will need to catch up with that. But the world is not a desktop, beos wasn’t the first operative system that used heavily threads. These days unix OSes (linux included) can handle threads quite well. It’s the desktop userspace world who needs to catch up – it’s software that was designed to run in machines with 1 CPU, unlike server-oriented apps. There’s nothing magic in what beos did, they just used threading a lot.

Who said anything about BeOS? Syllable uses threads just as heavily as BeOS did, but beyond that I have no associatation with Be or BeOS. You make a good point that threads have been available on UNIX servers for a very long time. Why hasn’t their use filtered down into more applications?

I maintain that writing threaded code is easy. It just requires practice; you have to be capable of thinking concurrently. Perhaps a lot of developers arn’t able to do that, or perhaps they’ve convinced themselves that they arn’t capable of doing that. I don’t know. But I don’t buy the argument that dealing with concurrency is particuarly hard and I certainly don’t believe that I nor my fellow Syllable (& BeOS) developers somehow have a latent ability to write threaded code.

I maintain that writing threaded code is easy. It just requires practice; you have to be capable of thinking concurrently. Perhaps a lot of developers aren’t able to do that

Low level threading is not easy – you have to take a new approach to coding, and a lot of developers aren’t easily able to think that way. Although I’m sure they could learn. The key is that you can create abstractions to make this much easier.

From personal experience, it’s not easy. At least not the first times.

I know this because all my fellow students(well not all) had problems with multi-threading and actually spend from a couple days to a week implementing a multi-threaded project which would take a day to develop if it were single-threaded.

In fact, for most small apps it adds complexity, bugs etc. This might discourage a lot of new programmers to write their apps multi-threaded.

Do note: the project assignment wasn’t easy and specific to have lot’s of threads to simply test if the programme knows enough of synchronization etc

Well it might sound silly to you guys but here it goes (project was about threading not a complex app)

The project was twofold:

1: Write a linux device driver to control the keyboard leds, so one could do: echo +++ > /dev/led to make all the keyboard leds go on. Granted there was initial starting code.

2: Write multi-threaded app: that would read a file , save each word to a buffer, let a different thread read the word from the buffer and translate it into morse code (one for each led) and let again another thread output it to de /dev/led device.

So 1 thread to read the file AND put the words in a word buffer, then 3 threads(for each led) to translate those words to morse code and save in it’s own buffer and let the 5th thread read those signals and write it to the /dev/driver (respecting naturally that signals from led thread 1 only affect the first keyboard led ETC).

There’s no doubt that beos wrote the first desktop os that cared about threads, an it’s obvious that x/gtk/qt/vista/mac os x will need to catch up with that.

Nope, sorry, that would be GEOS. They were a little more sane about threading than BeOS was too. BeOS went a little overboard, giving you another thread for every window you created, IIRC. GEOS just defaulted to a processing thread and a UI thread for each app. Adding more threads was easy enough.

The big thing GEOS did was using their own dialect of Objective C that integrated thread support into the language. All objects were assigned to a thread. Sending a message to an object in a different thread was no harder than to one in the same thread.

Of course, the big downside to GEOS was that it was designed at the time when a 286 was a luxury and the 386 was just starting to be talked about. So you got the joys of 16 bit code and memory management. Too bad MS blocked them out of the market before they got to do a 386 version.

As for the difficulty of threaded programming… yes, it’s very hard to graft threading onto old garbage code. Code written with the proper abstractions usually isn’t too bad to graft threading onto. If you’re designing something from scratch, threaded code really isn’t much harder to write.

“In contrast, BeOS was a system that from the start on propagated threading a lot. Even Windows is more about threading, while Unix folks still think fork() is a cool concept .”

Somewhere else, I mentioned this problem regarding POSIX threads. Linux did not fully implement this standard, while BSD’s system kernel completely did. If a multiprocessing environment is available, it will be used by default (concrete parallel), else threading is used (pseudo parallel). Linux needs libpthread and fork() for this, which does not conform to the POSIX specification. But I think this situation has improved already.

I remember the good old times when SGI had great multiprocessor systems. Not only cores – true processors! With own channels, even with own memory! Wow, were they fast! 🙂

> threading a lot. Even Windows is more about threading, while Unix folks

> still think fork() is a cool concept .

There is an important advantage of processes over threads: They force you to think in independent data sets kept by the different processes. If you use threads and their data working sets overlap, a lot of effeciency is wasted for the sake of cache coherency. Shared caches in multicore CPUs improve the situation (thus threads aren’t useless), but the problem still exists.

That doesn’t make sense. Either your data is shared or it isn’t, regardless of whether you use threads or processes.

If your data isn’t shared, there isn’t much difference between threads and processes.

Otherwise, you need a way to sync the data. You can either use shared memory, which brings you back to the same locking issues as threading, or you can use IPC to copy data back and forth, which is a much bigger drain on efficiency (both in execution time and programmer time).

Threads make it easier to use the data in a shared way when that isn’t needed (i.e. data could be used in a non-shared way with a little refactoring). Processes force you to do that refactoring.

Processes also give the OS an indication that data is non-shared which can be used to better distribute the workload unto multiple CPUs/cores. Of course, that indication can also be given to the OS manually through system functions, but the programmer then has to remember to do that.

What are you trying to do with threads that you run into issues like this? Threads have a lot less overhead to manage than processes. Context switching between threads is much faster than between processes, as you only have to swap the register contents – no need to switch page tables and invalidate portions of the cache.

Interstingly, the concept of threading, which is vital for parallizable applications, is something taken very conservatively by the unix/linux developers.

Most modern UNIX systems feature thread-level scheduling on both the kernel and userspace level, providing all three fundamental threading models: 1:1, M:N, and M:1.

The Linux kernel has supported 1:1 threading back to 2.0, since threads are really just processes that happen to share certain resources like address space and file descriptors. glibc provided linuxthreads, a library that allowed userspace processes to use POSIX threading calls to create threads via the clone() system call.

Linuxthreads had some major problems for two primary reasons: Traditional UNIX IPC services such as signals weren’t designed with threading in mind, and so to this day, signal delivery to threads is implemented in subtly different ways on the major UNIX distributions (so portable apps usually block signals on all but one peer thread). Also, prior to 2.6, Linux assigned a unique PID to each peer thread, which created various problems from (relatively) slow thread creation to confusing ps and other system monitor applications.

During the 2.5.x development series, IBM developers contributed a sophisticated M:N implementation called NGPT (Next-Generation POSIX Threads) to the Linux kernel and glibc. Meanwhile, kernel developers such as Ingo Molnar addressed linuxthreads’ problems directly in the major rewrite of the process management subsystem. The new clone() system call didn’t need any extra glue to support POSIX threads, so the glibc developers called it NPTL (Native POSIX Threading Library).

Although Ingo wasn’t surprised when NPTL ran circles around NGPT in every benchmark they threw at them, most everybody else had thought the snazzy M:N model from IBM would surely prevail. In hindsight, however, it seems obvious that the kernel can only optimally schedule threads if it knows about them. Particularly as 64-bit hardware alleviates kernel address space restrictions, even the pioneers of M:N threading admit that userspace thread schedulers continue to exist mostly for compatibility reasons.

Unix folks still think fork() is a cool concept

On Linux, fork() is really boring, since it merely invokes the clone() system call handler with some additional parameters. On other UNIX systems, however, process and thread creation are completely different code paths.

At least nowadays POSIX threads are supported, but they still suck a little performance wise.

Linux 2.6 implements copy-on-write semantics for process and thread management, which results in thread creation and destruction on Linux being faster than on any other major UNIX implementation or on Windows. Yes, it also bypasses the zombie stage for orphaned children of init, and I would like UNIX buffs to stop claiming that this trick and kernel preemption are reasons why <insert UNIX flavor here> is superior to Linux. Commercial UNIX is superior to Linux in various ways, but process management and multi-processing are no longer among them. Fans of big UNIX will have better luck arguing on the basis of memory management.

Finally, to address the issue of the difficulty on multi-threaded application programming. In short, it’s difficult, it requires being very careful and disciplined when programming, and it requires a thorough understanding of the code (with you might not have written) before making any changes. It breaks the assumptions made by “copy-and-paste” programmers and generally trips up developers–even experienced programmers–who normally code in high-level languages. Newer APIs for multi-threaded programming are a double-edge sword. They make it easier to write good multi-threaded applications, but they also make it easier to write very bad multi-threaded applications.

Part of what makes kernel development hard is that you need to be careful about where you put things and how you access memory. Does this need to be atomic? Do I need to have a lock for this? If I try to take this lock, will we deadlock? Is this on the heap or on the stack? And so on. Multi-threading brings many of these same concerns up to userspace. Userspace developers get to experience what it’s like to have multiple threads executing with the same address space. Although they don’t service interrupts or execute on behalf of untrusted processes, multi-threaded programming is a taste of what it’s like to work in the kernel.

In fact, it could even be argued that supporting threads in the kernel is easier than writing multi-threaded applications. The kernel already supports multi-processing, so implementing threads is as simple as not allocating new resources for additional threads and instead linking them to the resources of the primary thread. How the userspace process mediates access to these shared resources is another story entirely.

But that wasn’t really so. I did extensive research during that time and NGPT did perform better when you had a HUGE amount of threads. But I don’t know many applications that uses hundreds or thousands of threads…

Yeah that makes sense. In M:N threading you try to maintain some ratio of user threads to kernel-managed threads for a given process, depending on how often the threads tend to block. If they don’t block very often (CPU hogs), then the ideal ratio drops and M:N threading becomes really inefficient. But if the threads spend most of their time doing I/O (e.g. databases), then the desired ratio is high and M:N can schedule multiple user threads to run within a kernel-managed timeslice. If there are enough users threads to have at least one kernel-managed thread for each logical CPU at this high ratio, then performance for the process will be better than if 1:1 was used.

However, when M:N threading gets really efficient like this, the groups of I/O-bound user threads each look like a single CPU-bound thread to the kernel. Accordingly, the process scheduler will dynamically shrink their timeslices to help ensure that this process doesn’t monopolize the system. This is an example of how userspace threading makes it harder for the kernel to optimally schedule everything on the system.

Basically, if you have one massively multi-threaded I/O-bound process on the system, this is the only performance-critical process on the system, you don’t have a lot of CPUs, and you really optimize the threading library for the application, then you can get better performance out of NGPT. So for all you folks running busy databases on meager servers, you might be able to gain a little extra throughput with NGPT.

Interstingly, the concept of threading, which is vital for parallizable applications, is something taken very conservatively by the unix/linux developers. IE: they don’t use threads at all, as long as they aren’t forced to do so.

Your assertion that threading is vital for parallelizable applications is incorrect.

In contrast, BeOS was a system that from the start on propagated threading a lot. Even Windows is more about threading, while Unix folks still think fork() is a cool concept .

A process based model for multiprocessing will actually scale better than a thread based one, all else being equal. Why? Because the thread-based approach has every thread sharing the same virtual memory space. When you share things you need to do locking or some form of synchronisation, and you’ll generally need to bounce cachelines around.

I’m not talking about sharing memory here — that can be done between processes very easily with shared mmaps. I’m talking about the actual management of the virtual address space.

One problem was, for example, the Linux threading implementation that sucked. At least nowadays POSIX threads are supported, but they still suck a little performance wise… So if you wanted to have a fast application (without having scaling over SMP in mind), using threads was generally a _bad_ idea!

I didn’t know they sucked performance wise. I know that Linux process creation and destruction is much faster than many other operating systems, so threads aren’t so big of a win. But where does Linux thread performance suck?

Well, I already assumed what I have written about fork() would not be understood as how I meant it to.

Most UNIX programmers write programs that neither use threads, nor forking. They tend to believe that a single application just has a single flow, and if it’s the kind of daemon application or has to serve different customers, it can fork to do that.

I didn’t see anyone yet who really thinks in this way: “hey, this computation is really tough, I could parallize it, just for god’s sake, and do it with fork() and mmap”. It’s as simple as that: As long as a SMP system as the primary target is not clearly in mind, even GUI applications will be written threadless without any concurrency happening.

You’re right on the issue with shared memory. Thanks for pointing that out.

What I meant about the performance of Linux, yes, thread creation is fast, it’s just if you have only one single processor, it seems scheduling between the threads wastes a lot of cpu cycles … leading to a huge performance loss (I would say 10% to 20% as a rule of thumb) of the very same application compared to it written without threads. So “nobody” would come to the idea to try to use threading or concurrency everywhere it is reasonable for SMP systems.

In contrast, in BeOS, avoiding threads was really hard, so developers got used to them from the start on.

Well, I already assumed what I have written about fork() would not be understood as how I meant it to.

Most UNIX programmers write programs that neither use threads, nor forking. They tend to believe that a single application just has a single flow, and if it’s the kind of daemon application or has to serve different customers, it can fork to do that.

Practically all the UNIX programming literature I have read on the internet or in print has listed natural SMP scalability among the pros of a fork vs asynchronous state machine programming model.

Not that my claim disputes yours, but I find it odd that you found most UNIX programmers to be unaware of basics like that. Could well be the case though, there are a lot of crappy programs out there.

I didn’t see anyone yet who really thinks in this way: “hey, this computation is really tough, I could parallize it, just for god’s sake, and do it with fork() and mmap”. It’s as simple as that: As long as a SMP system as the primary target is not clearly in mind, even GUI applications will be written threadless without any concurrency happening.

GUI applications aren’t really the first thing that came to mind when one mentions UNIX… but none of my GUI applications I run are slightly CPU intensive, so it would be needless to employ threads for the purpose of SMP scalability.

Actually one that is CPU intensive is a sound format converter. This one actually fork()s a non-graphical task to do the hard work, and shares data via the filesystem.

This is really a pretty widespread and fundamental UNIX concept (using multiple programs connected via pipes or named filesystem objects).

When I build a source tree, I use make -j, which spawns several copies of the compiler, and the compiler forks off first the c preprocessor, then the c compiler proper, then the assembler, then the linker is invoked, etc. All these use fork() and share data via filesystem or file descriptors, and that provides SMP scalability.

In contrast, in BeOS, avoiding threads was really hard, so developers got used to them from the start on.

So I’ll also mention that while multiple threads of execution are a necessary condition for SMP scalability, they are not a sufficient one. So if a BeOS application has multiple threads, it doesn’t say anything about SMP scalability. If you had one compute thread and one GUI thread, you might get 100.1% the speed of a single CPU on a 2 CPU system, because the

GUI thread simply won’t have much work to do.

Threading GUI applications is mainly done for responsiveness, rather than SMP scalability. The hard part of improving scalability remains parallelising

However, your example was noted, and I think people who have tried computers with more than one cpu, should try and explain to Joe User how this new marketing concept of “Dual Core CPU” is a totally different beast altogether and is nowhere near as good as a board with twin separate processors.

One core for wordprocessing and one core for virus scanning ? yeah, right, PC World.

Hum, if this was sarcasm, I don’t get it. Name one case where it makes the slightest difference for Joe User, whether both cores sit in the same socket or not?

Actually, dual-core is mostly superior to dual-socket because core-to-core communication is faster than socket-to-socket communication. You also get the possibility of shared caches. The downside is that each core has less available bandwidth to memory, but currently that isn’t a limiting factor. In the future you could get around this by creating a link to each core just like today they go to each socket.

There was really not that much useful information in the article. PThread mutex?? pffft. A good multi-threading article would talk about mutex implementation a bit (for example whether or not they do some spinning in user-mode before entering the kernel) and about some of the higher-level constructs that you might implement such as rw-locks, read-copy-update, and lockless algorithms.

Could someone with ‘nix experience explain to me why having multiple processes is a bad approach to concurrency?? Can’t you just fork with shared mem and shared handles and call the new process a thread? Isn’t it essentially the same thing? What was slow about the linux thread implementation? Or is it just that the glibc posix functions aren’t well-tuned? I think if you’re actually doing computations on the various threads that a little bit of overhead from kernel synchronization won’t matter because you’re not calling the kernel much. I suppose this really refers to concurrent servers, where the Windows IoCompletionPort mechanism has some real advantages.

The old Linux threading API essentially called fork and tried to share everything. The problem is not everything can get shared cleanly like that. The biggie was signal handling didn’t work like you’d expect. I believe there were issues with privileges and/or file handles in certain cases. The other issue is doing it like this doesn’t scale that well, as you get separate kernel level processes and the related overhead for each, which adds up fast.

What we really need for good threading is programming language support. There are attempts at this.

I suppose what I want is GCC to provide warnings like “variable hash_table used from multiple threads without locking” and being able to build code in debug mode with locking correctness provers similar to what’s been added to the Linux kernel.

This is all because in my opinion, writing multi-threaded code is quite easy. Debugging problems with it is very difficult. Either data is silently corrupted, or you kill performance by going overboard locking everything.

I know tools like Intel’s VTune exist but they’re pretty expensive and not built-in to what I consider the standard Linux development environment: Make, gcc, emacs/vim.

> hash_table used from multiple threads without locking” and being able

> to build code in debug mode with locking correctness provers similar to

> what’s been added to the Linux kernel.

You won’t get this without using more abstract programming language devices. That means, the improvements will really come from changes to the programming languages and libraries. Better error messages from compilers alone won’t work as the compiler cannot detect high-level errors (e.g. lock misuse) when you give it low-level code (with locking semantics not “known” to the language itself).

I have not looked into BeOS, but my guess would be that this is exactly what they do: Build all kind of useful thread support functions into the standard libraries, so that application programmers keep their hands off the things that *are* complex about threads.

Detecting race conditions and violations of locking semantics are probably beyond the scope of the GCC project, but this stuff is possible to check at compile-time.

The most advanced static analysis tool is undoubtedly Coverity Prevent. This is the only product I know of that has static checkers for race conditions and locking semantics. It’s proprietary software, but they offer gratis licenses to any OSS project so long as they mention Coverity in any bug reports that result from its use. As they charge megabucks to proprietary software shops, it’s an unbelievable value. Linux, Apache, Mozilla, and MySQL all use it, and so should every OSS project.

I have done quite a bit of multi-threaded programming, and I can’t even begin express my appreciation of how much object orientation has helped. There is (for the actual processing) no risk of interference, no contention, no locking needed – it just works!

For traditional *nix programmers/programming this may be a hard concept to grasp, as *nix programs are/were (traditionally) inherently single-threaded and often with global data, but just use C++ objects, or even C with proper object encapsulation, and it becomes just obvious. Once you have seen the light, there is no turning back, and I think that’s the place BeOS and it’s API did really shine – it not only used multithreading to a large degree, it displayed how easy it really can be!

OS/2 (and therefore NT) also used/uses multithreading to a large degree – especially for kernel load-management and delayed execution (using e.g. thread pools with work items). No wonder, if one is aware that IBM has been creating multi-CPU (virtual or physical, or combinations) since before I was even born.

Kernels such as Linux that weren’t designed with SMP in mind, but really just got ad-hoc hacks added as an after-thought here and there to attempt to make one part here and one part there work (reasonably efficient without too much lock contention) in such environments, have had many, many both concurrency and performance problems. (hopefully they are all but a memory today, but see this in light of earlier versions).

But seriously, the “problem” of programming using multiple threads is really just the “problem” of proper division of work – delegation. This applies IRL too.

I think you just explained to me one of the reasons why the NT kernel is so object oriented. I thought the Object Manager was cool and stuff, but I never really thought about it in the light of multi-threading support.

At least when you have objects, you can conceptually tie together things under lock.

May be OpenMP is the future of easy safe thread based apps ? gcc take track on locking, from hi-level POV to as low as CPU instruction scheduling. Just place some #pragma in your existing code and you magicaly got multithreaded version.