Unwelcome Advice

Generally speaking, you don’t want to deliver any kind of difficult news to customers, partners, etc. Some of us are lucky enough to talk to folks about the performance and capabilities of our processors, shipping and soon-to-ship. Some of us, however, face a somewhat more challenging situation: explaining how to tap into this performance. I find myself in this situation often, as I frequently talk to external developers about our ongoing research in programming for multi-core and terascale. The discussion typically goes in one of two directions (the relative distribution has changed over time).

1. Sometimes, the developers are trying to do the minimal amount of work they need to do to tap dual- and quad-core performance…and perhaps stretch this to our DP and MP (dual and quad socket…or up to 16 cores) systems. I suppose this was the branch most discussions took a couple of years ago.

2. Increasingly, we are discussing how to scale performance to core counts that we aren’t yet shipping (but in some cases we’ve hinted heavily that we’re heading in this direction). Dozens, hundreds, and even thousands of cores are not unusual design points around which the conversations meander. Over time, I find that developers migrate their thinking from the first kind of discussion to the second.

We have starkly different conversations about these two paths. For the incremental path, the performance bar is often much lower and the tools that programmers want support a more incremental adoption path. We tend to discuss how to use new tools with old tools, support legacy code that (in some cases) is scarcely supported internally by the developers themselves, and so on. The second path usually requires at least some degree of going back to the algorithmic drawing board and rethinking some of the core methods they implement. This also presents the “opportunity” for a major refactoring of their code base, including changes in languages, libraries, and engineering methodologies and conventions they’ve adhered to for (often) most of the their software’s existence.

Ultimately, the advice I’ll offer is that these developers should start thinking about tens, hundreds, and thousands of cores now in their algorithmic development and deployment pipeline. This starts at a pretty early stage of development; usually, the basic logic of the application should be influenced because it drives the asymptotic parallelism behaviors. Consider a common pattern of optimization we’ve seen in single core tuning: the use of locally adaptive algorithms to heuristically reduce the computation time. By definition, this introduces dependences in the computation that are beneficial in the single core case but limit parallelism for multi-core. Similar choices are made about libraries and programming languages that optimize for single core performance (or even small-way parallelism), but sacrifice long-term scalability.

Eventually, developers realize that the end point is on the other side of a mountain of silicon innovations, but there are two routes: a flat, but potentially longer and more circuitous route around the issues that arise with increased parallelism; and a direct route that the developer largely pays for earlier. Front-loading at least some of this transition is often less costly in the long run and positions them to more competitively reap the benefits of our silicon innovations over time. It’s not quite as simple as this binary choice, but you get the basic idea…program for as many cores as possible, even if it is more cores than are currently in shipping products.

Folks from traditional or emerging HPC vertical either know this (and have known it for many years) or come to this conclusion pretty quickly (). For more mainstream application developers, this advice is usually unwelcome…but it is an encouraging sign that developers are increasingly coming to this realization on their own.

() However, HPC developers have the interesting problem of (depending on how you look at it) scaling down a “plane” of parallelism or scaling up an inner level of parallelism to map efficiently on to multi-core silicon in their clusters/data centers/grids…which has sometimes subtle differences from the single, dual and quad-core based clusters they’ve been used to programming.

34 Responses to Unwelcome Advice

A serious problem for a relatively small class of applications that really are resource constrained today. Some games, large database systems, some scientific simulations, that sort of thing.
But for every application that faces this issue you have tens, hundreds of apps that already run plenty fast enough on today’s hardware. And a lot of them (web-facing apps) are already effectively scaled up – you just start another instance whenever needed and let the OS worry about scheduling.
Yes, for those of us who can make use of this kind of power it becomes a major headache. For most developers, it really doesn’t.

Good advice. One question: I’ve heard that as we migrate to processors with 10’s-1000’s of cores, the performance of individual cores will likely decrease, in favor of overall gains in the aggregate. True?
If so, it seems that systems like the Java VM, which does very sophisticated optimizations oriented for single processes, will require massive retooling.

All nice and good… But how long will it take until 64 and more cores are mainstream?
I can’t wait until 128 and more cores are available. And I don’t care if the cores are running around 500MHz.
There is currently a big hype in Multicore land, but 2 cores and below are still mainstream. I have some doubts if the big players really can ship a product with more than 64 cores for a reasonable price… I think GPUs will dominate the performance markets. Intel and AMD will probably nerver make CPUs with 16 or more cores, except of some “proof of concept” samples.

Having more cores is fine, but the problem is that the rest of the hardware is not keeping up. Once a server has so many cores, the disk subsystem is not fast enough to keep up with what the Cpu’s are dishing out.

Let’s focus on Intel’s bread and butter: personal computer processors. One to eight cores today + whatever the (optional) GPU can contribute. And let’s ignore personal HPC such as Rosetta@Home and Folding @Home. The market would say performance is very good in 2008 and looking better in 2009. More than enough for (most) of today’s applications.
As an ex-programmer and IT analyst with 40 years experience dealing with complex integration and market research, I am quite sure the billions of lines of legacy programs will not be re-written for terrascale in my lifetime. Moreover, some ubiquitous apps like Microsoft Word and Excel would be hard pressed to use lots of processors when most of the time the application is waiting for a user keystroke.
I would like to see two approaches used to harness terrascale personal computing:
1. Remove as much multi-processor and single-processor optimization coding as possible. Let automated tools do the job. The preferable model, and easiest for a mere human coder to deal with, is one program (thread) on one processor. My desktop has 1,095 active threads under Vista at this moment. It took Microsoft more labor to build Vista to manage those threads than the Apollo moon project required. With lots of (X86-compatible) processors, I like the idea of the OS handling the job more than expecting commercial apps to do the job right. The past five year history of PC threading/multi-core is an example of the current way that’s not moving fast enough.
2. Use the terrascale processor pool for general purpose, I/O, comms, and graphics processing. If there aren’t enough active threads, put the processors to sleep.
The easiest set of applications to harness terrascale processors are commercial server applications like databases, web servers, and apps that already are well-threaded. That suggests Xeon deserves terrascale first. However, a couple of decades of systems engineering have gone into excruciatingly complex code to share a small number of cores. If the OS can give every thread its own (virtual) memory, OS, and processor resources, the need for processor contention management gets much simpler.
The tipping point for terrascale-aware applications is more than a decade away, IMHO. Nevertheless, I agree with Intel that much can and should be accomplished sooner rather than later. — Peter

If it doesn’t become natural to use, people will not be able use it. Period.
Think about your Geneva presentation: you require users to optimize their code for a given batch of CPU-s (incl. cache size, admittedly with to be introduced tools helping some way).
Supporting code on one platform with evolving revisions (no secret: Linux) is hard already. Exploiting an inhomogeneous CPU farm deployed additionally eats up human resources (think of retirement plans).
Not to mention, that although most HPC code is custom (and thereby very often legacy), the most interesting applications are not mere raytracing algos or MC simulations: they are just components of a huge SW stack, which has to deal with data IO, formatting, feeding, etc.
Most people haven’t still realised yet, that the same problems are also killing all incarnations of the GRID/Cloud computing (-> we already have terascale, thousands of cores; admittedly with bandwidth issues at the interconnects). If usage is not transparent (which it isn’t) it will not fly.

@janne, once U have a raytracing OS GUI, your possibilities are endless, you CAN create whatever you want.
Whenever all companies understand these implications of a major programming overhaul, the switch will be quick and painfull, a new are of computing will be possible(think Minority Report). I wonder who will get all partie involved around a big round table?? IBM mybe???

D. Peck, you beat me to it – was about to ask the same question.
Just look at hacks Intel is doing in the 5000 series chipsets to support only a couple of cores, which does not give people confidence that 32+ cores is coming anytime soon – and that’s for the whole system, with the yet-to-be-released CSI.
You guys are Intel research, not some stary-eyed startup with only a deck of powerpoints. Please do not push your customers to make investment before you’ve come up with something that remotely hinted at a solution to the memory and cache architectural problems.

Using up to 64k CPUs at the time (called the Connection Machine 2) we explored new programming models for all kinds of applications. Bring on these > 1000 cores We have a way to program them called Antiobjects. Its a good way to address all kinds of AI applications:
Repenning, A., Collaborative Diffusion: Programming Antiobjects. in OOPSLA 2006, ACM SIGPLAN International Conference on Object-Oriented Programming Systems, Languages, and Applications, (Portland, Oregon, 2006), ACM Press.http://www.cs.colorado.edu/~ralex/papers/PDF/OOPSLA06antiobjects.pdf
we would be happy to try our idea on some of these new chips.
Prof. Alexander Repenning, University of Colorado

I doubt Anwar Ghuloum’s opinion. I think as current Intel’s core size, doing hundred-cores cpu is impossible. Nehalem’s core size is about 25mm^2, and doesn’t include L2 cache. If Intel enter 22nm process, he maximumly can do 16 cores CPU, not hundred-core cpu. And 22nm process is now Intel’s maximum limit ability.
Obviously, Intel makes the cpu core too complex, too big, that will make him can’t boost perfermance well within a few years. Without a doubt, Nehalem’s perfermance is good, but enemy AMD’s cpu core is far smaller than Nehalem. AMD can do 6 cores CPU in 45nm, but Nehalem only can do 4 cores cpu in 45nm. AMD adds 2 more cores than Nehalem is more advantage, and outperform Nehalem more. AMD is doing 6 cores cpu in 45nm.
If Intel really wants to do hundred cores cpu, he must simplify a core, and stop to develop any more complex instruction sets, such as SSE 6, that will make the cpu core is hard to be minimized in size.

I think the author wants you to expand beyend simple parallelism. If you had 1000 cores, what would you do? Manually writing code to use it, even with the best MPI and other libraries is still a challenge.
The key is we move to the next level of what I like to call the “Adaptive CPU”. With 1000 core’s, why can’t we develop smarter hardware? Why can’t the hardware with all those cores analyze the running code to parallelize it? For example, You run your branch prediction across multiple cores at the same time. You take a miss, well, just point to the register on the core that is correct? Or, the hardware recognizes a iterative loop or recursion and executes it across many resources at once.
Even better a neural network in hardware. Since as many astute readers point out applications run just fine on 1 core, a neural network of cores could analyze the code and make predictions. I’m not dreaming here, but if you have 1000 cores, might as well get them helping the process rather then just be more raw execution units.
Having smart hardware assist that can essentially reassemble code in flight to make it more efficient would be great. I remember reading Intel had research going on in this area for software. I believe the belief is even single threaded code can be parallelized using different algorithms. Extending it to the hardware would be a logical step and save the industry a lot of grief.

The idea that we have to redesign software implies that you’re thinking of having the processors work collaboratively (i.e., not 1000 separate programs doing different things).
This would be where Amdahl’s Law kicks in…. D. Peck and sw guy have very valid concerns about cache and memory; even if these get fixed, many applications have serialization that will put a cap on the possible speedup with more cores. While some applications can get great parallel speedup (ray tracing, graphics in general), it’s not all applications — and not even most of them.
There’s a reason that Thinking Machines, Transputer, MasPar, and a ton of others went out of business….

Telling people to prepare for thousands of cores sounds very nice but what you are really saying is that they should begin to use thousands of threads in their applications. This is absurd to the extreme. Programmers have enough trouble juggling a few threads as it is.
The truth is that, deep down, you guys at Intel know that multithreading is not the answer to parallel computing. There is no use in denying it (I know; you’ve visited my blogs hundreds of times in the past few monsths). You cannot bring yourself to admit it because the admission would be too painful for some of you. You are living in a fantasy world.
Fortunately, some of us have our feet firmly planted on the ground and we can read the writings on the walls. There is an alternative way to design and program parallel computers that does not involve threads at all. Multithreading’s days are numbered. Read this article to find out why:Parallel Computing: Why the Future Is Non-Algorithmic
My (hopefully, not unwelcome) advice to Intel is, don’t be too complacent or too confident while the ground is shifting underneath your feet. The rest of the world is fast realizing what the correct solution to the parallel programming problem is going to be and they can clearly see that threads are not in the picture. You may wake up one day and find yourself on the wrong side of the parallel revolution.

I think that Intel is not being very creative in their recommendations. They would prefer shrink-wrapped software that states on the box “32 or more cpu cores required”. But this is not how software adds value for the customer. Web browsers, word processors and spreadsheets are already fast enough, thanks. Nobody is going to artificially parallelize tasks that are hardly serializable.
So I think Intel should come up with software suggestions that require multi-core cpus in order to function well and that add value for the customer. Raytraced GUIs are a particulary bad example; after all, there’s a limit to the eye candy one can sustain. How about focusing on user interaction? I can imagine cpu core clusters that run OS services performing speech-input analysis or gesture recognition, with a hardware architecture that easily allows the OS to allocate a number of cores and a memory area to a service’s thread pool. A kind of “service oriented architecture” for cpu cores, if you like. That would make it much easier for application programmers to accelerate exactly those operations that could benefit from the multiple cores. But requiring programmers to parallelize tasks at all costs is, sorry, downright ridiculous.

Back in the 80s, when there was a similar apparent technological obstacle in developing faster processors, we were told that the transputer and the Occam parallel programming language is the only way of the future. We escaped back then (call it lucky or an unlucky escape). I honestly hope this current hype will die a similar death – perhaps a truly innovative manufacturer will come up with the necessary technology – just as the obstacle was overcome in the 80s. I very much disrespect intel for trying to hide the fact that it essentially gave up on the innovation front. Sir, we DO need fast single-thread performance, otherwise we could just use several tens of thousands of 4MHz 8088 cores, eh? (Before you dismiss me as being an anti-parallelism advocate, I’ll let you know that my thesis was a modern distributed occam implementation)
Another thing is, a modern PC already has HUGE computing power… In fact, CPU’s that were available 5 years before, already had such power. During my experience, everywhere I looked, I saw idle PC’s running Microsoft Word, waiting for keystrokes (with 3 GHz processors). And because of your clever marketing, even these PC’s were regularly upgraded in the enterprise sector. In 10 years, will we have 500-core machines running Microsoft Word and Solitaire? Will that be good for us?

Its not a fantasy world where people writes code that scales into thousends of threads. Second Life does it with microthreading. Erlang does it. Timber does it with active objects. There are several nice patterns and technologies that can be used to do scalable parallell apps.
First of all i think we need good microthreading libraries, like those that Second Life uses. Libraries that can do microthreading without the programmers awareness. These should be integrated into environments like mono and .net that can automatically switch betwean threading models, or move threads within a cluster. This would make it possible to use parallell patterns without giving up performance when many threads are running on one core.
The next step is to promote parallell patterns like active objets. For example by extending languages lika C# and Java to handle them well. Transfer functionality from Timber and Erlang to these languages.
Give us the tools and im sure the code will come.
I also think that the current focus on multicore processor has been bad for the old multiprocessor architectures. If I wanna run two quad core processors today I either have to buy an expensive xeon/opteron setup or build myself a cluster. Decently priced motherboards that can take 2 4 or 8 consumer CPU:s are not around anymore. Such technology would mean more processor sales, more cores per computer and more pressure on developers to go parallell.

By the way. Active objects should communicate trough messages and not shared variables. Locks should be avoided by not sharing data. Its allready common to keep local variables private and share using getters and setters.
Locks is probably the most abused technology ever.

It seems clear to me that this is what intel wants developers to focus on. Our software are supposed to provide the market for these future processors. How about providing free downloads of compilers and libraries?

For those who are s***** programmers like this submitter, there are ways to thread like this, its not about juggling a few threads like they do now (by splitting it into audio, visual etc) but by coding it to spread all work evenly. The submitter is stuck on trial and error math while the world uses algebra. Trickier, more expensive, but with a higher pay off.

The idea of possibly thousands of cores being available must be one of the dumbest ideas I’ve ever heard. Performance tests are already showing the limits of multicore performance. There is only so much that any algorithm can be pratically parrellised and lots of algorithms have dependent steps were one step depends on another already being done. Eventually WE WILL NEED MORE CPU CYCLES PER SECOND.

I am looking forward to the time when each character I enter into MS Word is handled by a different processor.
Seriously, I don’t see the benefit of lots of CPUs on a PC. Maybe you can share frames of a video between several CPUs, but the bottleneck will still be the hard disc and the network.

The Erlang “solution” to the parallel programming problem is not the solution, otherwise we would not be here discussing the problem. The functional programming model has major drawbacks, not the least of which is that most programmers are not familiar with it and have a hard time wrapping their minds around it.
Another problem is that the Erlang message passing model forces the OS to copy entire arrays onto a message channel. This is absurd because performance takes a major hit. Shared memory messaging is the way to go. Certainly, using shared memory can be very problematic in a multithreaded environment but that is not because it is wrong. It is because multithreading is wrong. Threads are inherently non-deterministic and, as a result, hard to manage. Memory (data) is the environment that parallel agents use to accomplish a task. Agents sense and react to changes in their environment: reactive parallelism is the way it should be. Memory should be shared for the same reason that we humans share our environment. Switch to a non-algorithmic, deterministic model that does not use threads and all the problems with shared memory will disapear.
Furthermore, in spite of its proponents’ insistence on the use of words like “micro-threading”, Erlang enforces coarse-grain parallelism. In my opinion, if your OS, or programming language or multicore processor does not use fine-grain parallelism, it is crap. There are lots of situations that call for fine-grain processing. I have yet to see a fine-grained parallel quicksort implemented in Erlang.
Finally, a true parallel programming paradigm should be universal. There is no reason to have one type of processor for graphics and another for general purpose computing. If that’s what you are proposing, then it is obvious that your model is wrong and in serious need of being replaced.
In conclusion, I will reiterate what I have said before. What is needed to solve the parallel programming crisis is a non-algorithmic, synchronous, reactive software model. By the way, I am not the only one ringing the non-algorithmic/reactive bell. Check out the work of Peter Wegner and Dina Goldin at Brown university. They’ve been ringing that bell for years but nobody is listening. You people are hard of hearing. And you are hard of hearing because you are all Turing Machine worshippers. It’s time for a Kuhnian revolution in computer science.Parallel Computing: Why the Future is Non-Algorithmic

@Louis Savain
I am curious. One of your reasons for discarding Erlang is because it requires developers to become familiar with the alien functional programming model, but your proposed alternate solution of going Non-Algorithmic is far, far more alien.
What am I missing?

Memory architectures not processor parallelism are likely to be the bigger issue as we approach the problem of programming for hundreds of cores. If we are to program for the future, what will the future aspects of memory latency, locality and bandwith look like? Chips will likely become pad-limited before we can reach a dozen channels of DDR to a single chip. Will Intel go to a high-latency (per core), high-bandwith GDDR connection per socket like nVIDIA’s already massively parallel GPUs? What will future cache hierarchies look like? Strict L1-L2-L3-etc. or something more complex like a GPU? You can’t just claim some programming language or technique is going to be your magic bullet; they may help but you always need to program to the virtues and flaws of the hardware at hand if you want decent performance. If you don’t, those who do will push you out of the market.

@Michael Bacarella
The non-algorithmic synchronous software model that I am proposing is not rocket science. It will be familiar to anybody who has taken a course in Boolean logic. If you can understand a logic cicuit, you’ll have no trouble grasping non-algorithmic programming. Besides, it facilitates the use of plug-compatible modules and opens the way to drag-and-drop programmming.
@Anonymous
I agree that the memory bandwidth problem is a nasty one. It’s the biggest problem facing the multicore processor industry, even worse than the parallel programming crisis, in my opinion. Unless there is some sort of breakthrough in quantum tunneling or optical memory, we will probably need a whole new kind computer that does away with the central processor. Unless the problem is solved soon, the industry is in serious trouble.

I totally agree that have more core is better, but the problem is that the rest of the hardware is not keeping up. Once a server has so many cores, the disk subsystem is not fast enough to keep up with what the Cpu’s are dishing out.

I also agree with Treyslay about performance is not only dependent on processor but also depends with associated hardware and memory devices. Hence all associated hardware to support processor should also updated when you upgraded with more core. Thanks Anwar for posting helpful information.

As a non-professional in this forté but an ever curious soul. I have always wondered if the future of many-core processors will be that of say, 128 very low power (maybe Intel Atom grade) RISC cores as opposed to the comparatively powerful monolithic cores we have now being manufactured.
If this were the case, we may have a serious issue with legacy software that is single threaded and had developers that made the assumption that multi threading would not be required of their low intensity software. Is it really that likely?
Further than that, on the average Windows installation runs between 50 and 130 processes, do we really need to “parellelise” any but the most resource intensive programs?

I agree with Treyslay about performance is not only dependent on processor but also depends on memory devices. Because when you load so much heavy softwares then you need an extra memory also. Hence all associated hardware to support processor should also updated when you upgraded with more core. Really useful info