Interview with Intel's James Reinders

Here at DaniWeb, we've talked a lot about Intel Parallel Studio . I recently had the chance to sit down with Intel's James Reinders, and find out more about his take on Parallel Studio 2011 . Mr. Reinders (pronounced Rhine-ders) is a senior engineer at Intel and has been with the company since 1989. He's the "chief evangelist" for their Parallel Studio product, so as you can imagine, he had a lot to say about it. We had a great interview, and here's what came out of it.

DaniWeb (JC): You wrote an O'Reilly book. Tell me a little bit about what that is.

Reinders: It's a nutshell book. Its title is Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. It came out about 3 and a half years ago at the same time that we open-sourced Threading Building Blocks. It's what we thought of as a complete set of what you need to add to C++ to do parallel programming effectively. We've got algorithms; we've got data structures because STL is not thread safe, which required fixes on data structures; we've got portable locks and atomic operations. And by "portable" I mean the language doesn't define how to do an atomic operation or lock, so you tend to use critical sections on Windows, and mutexes on Linux. Then the next thing you know you've got code that doesn't compile in both places unless you do #ifdef's all over. Well we just have an equivalent operation for that; you use it, and it'll compile everywhere. And it has some other things, timers, and so on, and it implements a task-stealing algorithm.

It's been amazing. I enjoyed putting the book together. I've gotten some really great feedback on the book. It's been translated to other languages, including Korean, Japanese, and Chinese, which is just stunning to me that O'Reilly decided there was that kind of demand. And threading building blocks, of course, has really taken off. It's been mentioned as the most widely-used method for parallelism in C and C++ now, so it's an easy way to add parallelism to your applications.

DaniWeb (JC): You mentioned that on Windows, when using native threads instead of Threading Building Blocks, you do it differently from, say, Linux. Windows has various parallelism APIs built into the operating system. Are you calling into the operating system in the building blocks library?

Reinders: When it's appropriate. In the generic source, we can use critical sections and mutexes but in some cases have more hooks into the instruction set. But to make threading building blocks portable, it understands the higher level things it can call into, and in some cases the assembly language features are available for the processors it knows about. There are also some people who have ported it to take advantage of Sparc processors, PowerPCs, and so on. Sometimes it's more clever to do those hand-coded parts in assembly. You have to make sure at the end of the day that it all works together. It's fully interoperable.

DaniWeb (JC): According to the web site, Parallel Building Blocks is one of the new features of Parallel Studio; it didn't exist in the previous version. First, tell me what the difference is between Parallel Building Blocks, and then tell me more about both.

Reinders: Threading Building Blocks, or TBB, which is what the O'Reilly book is about, is an open source library. Parallel Building Blocks, on the other hand, is sort of a name for a family of models.

I like to think of Parallel Studio 2011 as being made of two parts: Tools to help a developer , such as compilation tools – debug, performance analysis, and so on. The other are programming models, such as Parallel Building Blocks, which builds on what we've done with Intel Threading Building Blocks, but it creates a little more general family of models that work well together. So if you take a look at Intel Threading Building Blocks, which is a key part of this family, it's a template library aimed at C++ programmers (but you can use it in C, and some people do), but a template library that's highly portable.

But the compiler doesn't really know what you're doing; it can't help you directly. So the other challenge is that as a template library, it doesn't really address data parallelism directly. So as I think about Intel Threading Building Blocks, the two biggest limitations on it is: One, we didn't have a compiler-assisted version because it's a template library. And two, we didn't address data parallelism straight on. So Parallel Building Blocks, in addition to having Threading Building Blocks, also includes two new things – Intel Cilk Plus, and Array Building Blocks, or ArBB.

Cilk Plus consists primarily of three keywords that are added to the language that our compiler understands. One thing that strikes people right away is it looks more comfortable in the code than calling a template. They're incredibly easy to understand; one keyword is just cilk_spawn, which you follow by a function call. Another is cilk_sync, which just says that I've spawned some functions, and let's wait until they all finish. And then the other is cilk_for, a parallel for loop that you can use instead of a regular for statement, and the iterations will be executed in parallel. Three simple keywords, they look very intuitive, very easy to use, and the compiler then understands that you're introducing parallelism and can do a variety of things behind the scenes for you.

There's a couple other things include in cilk plus. One we call hyper objects. Once you start executing functions and loop iterations in parallel, global variables can become problematic. If you have a single variable that you're updated from multiple parallel instantiations, that can be a problem. And you want to solve it with a reduction. Cilk plus has a very clever thing where you take a global variable, and you change it to a hyper object, and you mention that it's for a summing-reduction. Behind the scenes, it creates separate copies of the variable for every task that you kick off, and it helps you do the reduction. So one of the challenges with a reduction, on a quad core you're going to spin off four tasks, on a ten-core you're going to spin off ten, you want to create as many copies as there are cores. The thing is, when you say cilk_for, you never say how many you're going to split it up for, because what we do is at runtime we figure out what type of machine we're on. The hyper object is the equivalent for the data side. It replicates the data as much as necessary for the machine you're on and it helps you with the reduction. It's a really clever idea. This idea came out of Cilk Arts (which is a company we acquired last year). It's an incredible addition to the language. It's very practical. Once you start throwing in parallel fors and parallel spawns, this is the next thing that hits you really hard. The hyper object is simple to use and very effective for adding parallelism.

And then we allow for some array notations. When we look at what can we do for data parallelism, it turns out to be incredibly valuable these days. A lot of programs want to process a lot of data to stream through; if they want to do operations on it, images, video, audio signals, and our microprocessors give us two wonderful ways to handle parallelism. Unfortunately, using them both at the same time at the course level is a nightmare. If you're trying to use SSE or the new ABX instructions, you structure your loop a certain way so you can present the data to the processor a few at a time. It's called vectorization, and the compiler can do it for you. The other approach is that you've got multiple cores. So I divide my data up and give a chunk to one core and so on. But if you try to do both of those together in a loop manually, you can do it, but because of the work, you're only going to do it for really important loops and you're going to pull your hair out trying to get it right. The innermost loop you're going to want to vectorize, and you're only going to want to do a certain number of chunks at a time; on the outer loop you want to break it up by cores. Well, why are we doing any of that? What we really should say is I want to, for example, do an array multiplication, and then tell the computer, "You figure out how to spread that out across SIMD and SSE, and AVX, and if I have four cores, ten cores, whatever."

In Cilk Plus we've extended the C/C++ language so that you can operate on arrays and vectors, and we've also introduced Array Building Blocks, which comes out of two things – a research project out of Intel called CT, combined with our acquisition of RapidMind, and we've merged the two to create Array Building Blocks. It's a very sophisticated way to handle data parallelism. We've got a couple neat demos of it here. Both are very similar in that they've got a C program, and they're using Array Building Blocks. They're both pretty readable, but the code doesn't try to take advantage of SIMD or multi-cores; instead, the Array Building Blocks system is able to take care of that to form data parallelism automatically. It's about a ten-times difference from non-parallel because of the combination of SIMD and quad-core. It's very compelling and we get back to writing code that looks like what we wanted it to look like, as opposed to having all these iterations unrolling.

DaniWeb (JC): So the code looks readable and is intuitive.

Reinders: Right, and then magically underneath it just automatically vectorizes and distributes it. So together all these things help us with parallelism, and we give it the banner name of Parallel Building Blocks. And that includes the Cilk Plus, and the Intel Threading Building Blocks and the Array Building Blocks. Remember, Threading Building Blocks is open source. Cilk is integrated into our compiler, which isn't open source, but we're hoping the extensions make sense to others and that other compiler vendors will consider adding it. And we'll work with anybody interested. And Array Building Blocks is still in beta, and we're taking feedback on what sort of openness there makes sense.

DaniWeb (JC): So let's say you use Parallel Building Blocks, and you write a program that makes use of all that, and then somebody runs it on a computer that has an AMD processor in it.

Reinders: It'll just work. Absolutely. There's nothing specific to Intel that will affect the parallelism. With threading building blocks, one of the reasons it's been popular is people are running on it all sorts of different machines—even Xboxes!

DaniWeb (JC): Let's talk a bit about education. Back when I was in school in the late 80s, I did some graduate work and they were talking about parallel programming back then. They were multiple separate processors built by Cray and others. Do a lot of the techniques they came up with back then still apply today with our multi-cores?

Reinders: Absolutely. The fundamentals of how do you break up algorithms and distribute them still apply. You're looking for where the coursest paralleism is so you can fire off work that's independent. It's kind of like dividing tasks among multiple people. If you do it wrong and they're always having to consult with each other, they don't get a lot of work done. That sort of fundamental thing is there.

What's changed, though, and what really fascinates me, is parallelism is everywhere today, and it's cheap. When we were working with those supercomputers back then, they were expensive, and you had to get everything out of them. You didn't just play around and try using them for things. The cell phones today have more power than those supercomputers 20-30 years ago. A lot of the apps you have on your phone, you would never have run on a supercomputer. Now we have parallelism everywhere and it can be used to enhance the user experience in many ways. What's really interesting to me is looking at how many application areas will probably be investigated for parallelism that we never looked at in supercomputing. Maybe they're not totally efficient, but who cares. As long as the app works well. So that to me is a really exciting element of it. I have a deep background for years in supercomputers, and this is an exciting area. You don't have to worry about getting the ultimate speed up; you want it to be good enough so you can ship the app. It's a big change.

DaniWeb (JC): What should people do if they want to get started in parallel programming? And as for students learning serial programming first and then later in life learning parallel. Sort of like when I first learned object oriented, there was "We could do it this way, but for this project we're going to ‘turn on' object oriented." But over time it just became automatic, and leaving out classes is like programming with one hand tied behind your back. So if students are learning serial programming and they get out, then suddenly there's all these other parallel aspects. So what do you think for people today who want to learn it, what kind of hurdles they've got to come over, and is it feasible for students to start out square one learning parallelism?

Reinders: I think, first of all, if we are teaching people programming, we should introduce parallelism as a concept pretty early on. I don't think that means that we change the curriculum over completely or that we substitute parallel programing classes. But, for instance, I think it's a mistake to talk about data structures and spend an entire semester or year and never talk about the implications of concurrency or parallelism. If you're building a list, but it can't be accessed concurrently, you should at least understand when it applies and when it doesn't. You're going to be writing applications in a concurrent environment, and you need to understand concurrency. It needs to be a fundamental part of classes that will touch on it.

Now, the University of California Berkeley starts their freshman off with a class that teaches map reduce early on. When I found out that's what they're doing, I thought, Wow, that's an easy one. And you can tell people later all the complications, like not everything is that parallel. But you should see some parallelism. So I think teaching parallelism at an undergraduate level is critical. I think any curriculum that doesn't integrate parallelism is making a big mistake. I see hundreds if not thousands of universities have made some moves in this direction. It'll take time to figure out the best ways to teach it, but I think introducing map reduce, looking at parallelism when you're doing data structures—those are things to do early on.

But what to do about all of us that learned serialism first and then want to learn parallelism later in life… Well, it's probably a lot to expect us to re-enroll and go through undergraduate again [laughs] and hope that we pick up the tidbits here and there. I can't say enough the importance of just jumping in and trying things. So the question would be, What do you want to try? I would stay away from programming to Windows threads and pthreads directly. The reason I say that is as interesting as it is, if you think about, for example, assembly language programming, how important is that to most programmers? Well, if you really want to understand all the internals of the machine, assembly language taught you that when you took that course in school. But other people just sort of toss that out and say, I work at the Java level or I work at the Fortran level. Well, when you look at parallel programming, Windows threads and pthreads are like the assembly-language version of it. Traditionally, that's all that we've really taught, because parallel programming meant getting out the crudest tools possible. Now we have the opportunity to program at much higher levels. And this has been coming on board for a while. Research projects like MIT with Cilk and Intel with CT—and there's many other out there and has been for decades—are starting to produce these new products; now we've got Array Building Blocks, Intel Building Blocks, and a number of technologies that are now products.

And if you're a C++ programmer, I'd go grab Intel Threading Building Blocks. If you'd rather focus more on C, I'd take a look at the Cilk Plus things. Get at least an evaluation copy of the compiler from Parallel Studio and play with that. Microsoft and .NET have their parallel extension, too, that are worth taking a look at. Stay at a high level. Just do some simple things. Start with a parallel-for, or take a look at how to do a map reduce algorithm. And just start doing it! Do simple things, and what I hope happens is you realize two things. One, you start to algorithmically understand what makes a good parallel program and a bad parallel program as far as how to decompose data, and your brain will start to shift a bit with how you think about applications. And the other thing is you'll start to understand data locks, races, what causes them, and it'll become intuitive.

I've been doing parallel programming long enough that when I look at an algorithm, I'm thinking of how to do it in parallel. And I'm thinking about how the data and tasks can be separated so they don't overlap any more than they need to. And the other thing is I'm not focused on efficiency of a single thread as much. And that takes a little practice. If you have a wonderful algorithm that runs in, say, ten seconds, but it can't scale—you run it in two cores, and it still runs in ten seconds—then compare that to one that runs in twenty seconds on a single core, but it scales. Maybe in two cores the scalable one runs in ten seconds; well, that doesn't sound like a huge win. But when you go to four cores, let's say it runs in six seconds. That's scalable. And some ways of writing programs will scale, and some ways won't.

In the past we've never paid attention to that kind of scaling. When I was learning programming, we never talked about scaling, because it was always going to run on one processor. But scaling becomes an issue because you can't automate it. It's about an algorithm; you don't write a bubble sort and hope that your compiler turns it into a quicksort. There's a lot of algorithms out there that could be recoded. How did somebody look at the sorting algorithms and come up with the quick sort, if you've been doing bubble sorts your whole life? It's an algorithmic thing. Parallelism has that challenge to it. It's the only real challenge of parallelism in an eternal sense of real—writing a good, scalable algorithm is something that good programmers learn and do. Things like Threading Building Blocks take care of the housekeeping. The things you would have needed to know in the assembly language version, you don't really need to worry about. But on the algorithm side you need to worry about it.

DaniWeb (JC): So the more you do it, it becomes natural, just like my object-oriented programming analogy. I don't "turn on" object-oriented programming; it's just natural to me to use it.

Reinders: Right, so I would take advantage of any opportunity to try coding things up; it can be frustrating if you're trying to code and parallelism hasn't become intuitive to you; it can be frustrating if you're trying to do it on a schedule. I hope that tools like Parallel Studio can go a long way in helping to relieve the schedule pressure. When the program's not working, in the past you were just going to have to be brilliant and figure out where that data race was. Now the tool can tell you directly where it is, or the tool can tell you what's limiting the scaling, which loop, which lock. In the past we just guessed. These were intelligent guesses, based on our expertise, and we were right some of the time. And the other times we would just beat our heads against the walls. But with tools like Parallel Studio, I'm amazed at what we're able to do automatically, like deadlock detection and scaling analysis that help make it possible to go in and write a parallel program on a schedule. So let me ask you, have you played with Parallel Studio?

DaniWeb (JC): Yes! I reviewed the first version a year ago, and when I was playing with it, that's when it hit me—years ago I was learning how to write algorithms for multiple processors, and now we have these multiple cores on a single chip, and it's basically the same thing. For some reason it had not hit me that those multiple cores are doing the same thing. And then I was remembering what they taught me back then, and I could finally try out all this stuff they taught me back then.

Reinders: Yeah, I took a parallel programming class at University of Michigan when I was a Master's student, and we were supposed to get a Cube in. It didn't show up during the term I took the parallelism class! [laughs] So everything was done on paper. I loved the class, but I would have loved to try it out. But it wasn't until I was employed that I tried it out, and when I actually saw it speed up, I was like, Wow! And now, with multiple cores, we can all do it, and that's exciting.

DaniWeb (JC): It's much more accessible now. It's not just a matter of learning it and hoping you can find a supercomputer; you can actually try it out right there on your own computer.

Reinders: I think that's going to lead to amazing things. I know that any time we've taken the power of supercomputers and put it on a desktop for scientists, they tend to dabble and try things they might not have done otherwise. Now you look at the power of these handheld devices, people are putting amazing apps on those. They never did that when the computer was less accessible. Now it's in your hand, people do some amazing things. Parallel Studio is like that for us. We know that we're making it much, much easier for people to do parallel applications, and much more likely we'll see parallel applications because of Parallel Studio. But I can't even begin to imagine what people will come up with.

Write a C program that should create a 10 element array of random integers (0 to 9). The program should total all of the numbers in the odd positions of the array and compare them with the total of the numbers in the even positions of the array and indicate ...

I have a 2d matrix with dimension (3, n) called A, I want to calculate the normalization and cross product of two arrays (b,z) (see the code please) for each column (for the first column, then the second one and so on).
the function that I created to find the ...