At Hot Chips last week, IBM talked about BlueGene/Q, the processor powering …

The BlueGene/Q processors that will power the 20 petaflops Sequoia supercomputer being built by IBM for Lawrence Livermore National Labs will be the first commercial processors to include hardware support for transactional memory. Transactional memory could prove to be a versatile solution to many of the issues that currently make highly scalable parallel programming a difficult task. Most research so far has been done on software-based transactional memory implementations. The BlueGene/Q-powered supercomputer will allow a much more extensive real-world testing of the technology and concepts. The inclusion of the feature was revealed at Hot Chips last week.

BlueGene/Q itself is a multicore 64-bit PowerPC-based system-on-chip based on IBM's multicore-oriented, 4-way multithreaded PowerPC A2 design. Each 1.47 billion transistor chip includes 18 cores. Sixteen will be used for running actual computations, one will be used for running the operating system, and the final core will be used to improve chip reliability. For BlueGene/Q, a quad floating point unit, capable of up to four double-precision floating point operations at a time, has been added to every A2 core. At the intended 1.6GHz clock speed, each chip will be capable of a total of 204.8 GFLOPS within a 55 W power envelope. The chips also include memory controllers and I/O connectivity.

The 18th core is a redundant spare. If a fault is detected in one of the chip's cores, the core can be disabled and transparently mapped to the redundant spare. Detection and remapping of faulty cores can be done at any stage of the system's manufacture—not just when the chip wafer is being tested, but also when chip has been installed into Sequoia. Sequoia will use about 100,000 of the chips in total to reach its 20 petaflops target. Sequoia's huge scale makes the ability to remap faulty cores important: IBM estimates that given the number of chips in the supercomputer, one chip will fail every three weeks, on average.

Traditional multithreading: locks and serialization

Transactional memory is an approach to parallel programming that has the potential to make efficient parallel programming a great deal easier than it is currently. Parallel programming is easy when a task can be broken up into many independent threads that don't share any data; each part can run on a processor core, and no coordination between cores is necessary. Things get more difficult when the different parts of the task aren't completely independent—for example, if different threads need to update a single value that they share.

The traditional solution is to use locks. Every time a thread needs to alter the shared value, it acquires the lock. No other thread can acquire the lock while one thread holds it; they just have to wait. The thread with the lock can then modify the shared value (which may require a complex computation, and hence can take a long time), and then release the lock. The release of the lock in turn allows the waiting threads to continue. This system works, but it has a number of problems in practice. If updates to the shared value occur only infrequently—and hence, it's rare for a thread to ever have to wait—the lock-based system can be very efficient. However, that efficiency tends to rapidly diminish whenever updates to the shared value are frequent: threads spend a lot of their time waiting for the lock to become available, and can't do any useful work while they're waiting.

Locks also prove quite difficult for programmers to use correctly. Though the case of a single shared value is easy to handle, real programs are rarely so simple. A program with two locks, A and B, is susceptible to a problem called deadlock. If two threads need both locks, they have a choice; they can either acquire lock A followed by lock B, or they can acquire lock B followed by lock A. As long as every thread acquires the locks in the same order, there's no problem. However, if one thread acquires lock A first, and the other acquires lock B first, then the two threads can get stuck—the first waits for lock B to become free, the second waits for lock A to become free, and neither can ever succeed. This is a deadlock.

This problem might seem easy to avoid, and indeed when a program only has two locks, it normally is—but it becomes harder to ensure that every part of the program does the right thing as the program becomes more complex. Add more locks, for other bits of shared data, and it becomes harder still.

Transactional memory: the end of locks

Transactional memory is designed to solve this kind of problem. With transactional memory, developers mark the portions of their programs that modify the shared data as being "atomic." Each atomic block is executed within a transaction: either the whole block executes, or none of it does. Within the atomic block, the program can read the shared value without locking it, perform all the computations it needs to perform, and then write the value back. At the end, it commits the transaction. The clever part happens with the commit operation: the transactional memory system checks to see if the shared data has been modified since the atomic operation was started. If it hasn't, the commit just makes the update and the thread can carry on with its work. If the shared value has changed, the transaction is aborted, and the work the thread did is rolled back. Typically when this happens, the program will simply retry the operation.

Transactional memory potentially offers a number of advantages over the lock-based scheme. First, it's optimistic: instead of each thread needing to acquire a lock just in case another thread tries to perform a concurrent operation, the threads assume that they'll succeed. It's only in the case of actual concurrent modifications that one thread will be forced to retry its work. Second, there's no deadlock scenario, since there are no locks. Third, the programming model is, broadly speaking, one that developers are quite familiar with; the notion of transactions and roll-back is familiar to most developers who've used relational databases, as they offer a somewhat similar set of features. Fourth, atomic blocks arguably make it a lot easier to construct large, correct programs: an atomic block with nested atomic blocks will do the right thing, but the same isn't necessarily true of lock-based programs.

(It's worth pointing out that transactional memory has a number of complexities of its own: for example, what if a transaction needs to do something that can't be rolled back, like sending data over a network or drawing on the screen? The best way to approach this kind of issue, and many others, is still an area of active research.)

The hardware advantage

Up until now, transactional memory research has mostly focused on software-based implementations. Real processors don't actually support transactional memory, so it has to be emulated in some way. Some schemes make use of virtual machines to do this—there are transactional memory modifications for the .NET and Java virtual machines, for example—others use native code, and require programmers to use special functions for accessing shared data, so that the transactional memory software can ensure the right things happen in the background. A consistent feature of all of these implementations is that they tend to be slow—sometimes very slow. Although the transactional memory makes it easier to produce bug-free programs, careful use of locks (or other multithreading techniques) can yield much greater performance.

BlueGene/Q moves transactional memory into the processor itself. It's the first commercial processor to do so, though Sun's Rock processor—cancelled when the company was purchased by Oracle—would also have included a transactional memory capability. The transactional memory implementation is predominantly found in the chip's 32MB level 2 cache. IBM did not describe the system in enormous detail at Hot Chips, but did describe a handful of details. Data in cache has a version tag, and the cache can store multiple versions of the same data. Software tells the processor to begin a transaction, does the work it needs to do, and then tells the processor to commit the work. If other threads have modified the data—creating multiple versions—the cache rejects the transaction and the software must try again. If other versions weren't created, the data is committed.

The same versioning facility can also be used for speculative execution. Instead of having to wait for up-to-date versions of all the data it needs—which might require, for example, waiting for another core to finish a computation—a thread can begin executing with the data it has, speculatively performing useful work. If it turns out that the data was up-to-date, it can commit that work, giving a performance boost: the work was done before the final value was delivered. If it turns out that the data was stale, the speculative work can be abandoned, and re-executed with the correct value.

A logical evolution

The transactional memory support is in some ways a logical extension of a feature that has long been a part of the PowerPC processor, "load-link/store-conditional," or LL/SC. LL/SC is a primitive operation that can be used as a building block for all kinds of thread-safe constructs. This includes both well-known mechanisms, like locks, and more exotic data structures, such as lists that can be modified by multiple threads simultaneously without any locking at all. Software transactional memory can also be created using LL/SC.

LL/SC has two parts. The first is the load-link. The program uses load-link to retrieve a value from memory. It can then perform the work it needs to do on that value. When it's finished, and needs to write a new value back to memory, it uses store-conditional. Store-conditional will only succeed if the memory value has not been modified since the load-link. If the value has been modified, the program has to go back to the beginning and start again.

LL/SC is in fact found on many processors—PowerPC, MIPS, ARM, and Alpha all use it. x86 doesn't; it has an alternative mechanism called "compare and swap." Most LL/SC systems are quite restrictive. For example, they may not be able to track writes to individual bytes of memory, but only entire cache lines, meaning that the SC operation can fail even if the monitored value wasn't actually modified. SC will also typically fail if, for example, a context switch (which flushes the cache) occurs between the LL and the SC. Some implementations will even make the SC fail if any value gets written to memory between the LL and the SC.

Transactional memory is a kind of LL/SC on steroids: each thread in a transaction can, in effect, perform an LL on many different memory locations, and the commit operation performs a kind of SC that takes effect on those multiple locations simultaneously, with either every store succeeding or failing together.

Will it deliver?

The implementation of the transactional memory itself is complex. Ruud Haring, who presented IBM's work at Hot Chips, claimed that "a lot of neat trickery" was required to make it work, and that it was a work of "sheer genius." After careful design work, the system was first built using FPGAs (chips that can be reconfigured in software) and, remarkably, it worked correctly first time. As complex as it is, the implementation still has its restrictions: notably, it doesn't offer any kind of multiprocessor transactional support. This isn't an issue for the specialized Sequoia, but it would be a problem for conventional multiprocessor machines: threads running on different CPUs could make concurrent modifications to shared data, and the transactional memory system won't detect that.

BlueGene/Q's hardware support allows use of transactional memory with little or no performance penalty. This lack of performance penalty in turn means that transactional memory can be used in real-world software, to see if it really is as useful in practice as it appears to be in theory. Haring said that, the transactional memory "feels good," but the team was still tuning compilers and software support, so it did not yet have any real-world data.

As specialized as Sequoia is, the insight it will give into the utility of transactional memory will be invaluable. The combination of ease-of-use advantages for programmers and the performance potential (both of transactional memory and speculative execution) make transactional memory very appealing. Software implementations, however, have for the most part reached a performance impasse: so severe are the performance issues that it puts the entire approach in jeopardy. If this hardware implementation proves successful, it could be the first of many. But if it doesn't work out—if it fails to deliver the performance and reliability that transactional memory is assumed to provide—it could sound the death knell for a once promising solution to the multicore conundrum.

Keep in mind that Crysis with all features turned up is still a pretty good measure of "good" hardware. Of course your shit laptop ran Crysis, but it probably wouldn't run it with everything turned all the way up.

Please do shut up with the Crysis jokes. The game is now almost 4 years old, the jokes even more so.

And no, it cannot run Crysis because Crysis is compiled for x86 processors, while the Sequoia uses PowerPC chips. And that's putting aside the fact that it's a GPU-dependent game, and supercomputers don't really have much in the way of GPU hardware.

Very little, at least with existing databases. As the article mentions, databases already support robust concurrency controls. However, it may be possible to modify the implementation of a database to make use of hardware transactional memory in the future.

I see parallels here to memory paging, which was originally implemented in software but eventually became supported in all modern hardware.

Keep in mind that Crysis with all features turned up is still a pretty good measure of "good" hardware. Of course your shit laptop ran Crysis, but it probably wouldn't run it with everything turned all the way up.

Yes, but the shitty question wasn't if it could run Crysis on max, it was if it could run Crysis.

Very little, at least with existing databases. As the article mentions, databases already support robust concurrency controls. However, it may be possible to modify the implementation of a database to make use of hardware transactional memory in the future.

I see parallels here to memory paging, which was originally implemented in software but eventually became supported in all modern hardware.

Ok. It was a bit of a layman's question, as I have a roommate that does a lot of sql work from home, and I remember seeing (nolock) in the code quite a bit. I'm guessing that is something completely different now.

Serious question: Why must a transactional rollback and retry be required in the event of a modified access? Why couldn't an additional operation be executed first to determine if the previously committed change actually modified any information relevant to the transaction in question?

I've never heard transactional memory pitched as a solution for reliability and performance. In my experience, it's always pitched as being an easier programming model, and that's all. Hardware TM just tries to be competitive with complex lock-based programming models, while being extremely simple to program (and impossible to mess up). Deadlock and livelock go away for transactional memory unless the HTM system introduces some of its own, which some academic HTM solutions in the past have, but then that's the architect's problem, not the programmer's. With lock-based multi-threaded programs, handling concurrency wrong totally breaks your program, but with HTM, handling concurrency incorrectly only hurts performance. Again, TM is just about making the programmer's life easier.

Using a conservative estimate on power consumption, Sequoia will use at least 25000 kilowatts of power. That's about $2500... PER HOUR. That's $22,000,000 a year in electricity if it's used 24/7, and yes, it will be used 24/7.

That's one of the outstanding questions, really, when it comes to "how should transactional memory be exposed to developers?".

In an intuitive sense, it would be great if transactional memory could co-operate with other transaction managers--for example, Windows has a transactional file system, transactional access to the registry, and of course, databases. The more things you have that are transactional, the more you can put into an atomic block. While some things like network access might not ever be truly transactional, one could envisage higher-level protocols that included this property.

This would provide quite a compelling development model--your atomic block could write to the file system or the database, and roll back all those changes if a conflicting in-memory change occurs.

Transactional memory has lots of nuances and complexities. The basic concept is easy enough--but it turns out there are lots of decisions that can be made. For example, what kind of isolation do you provide. Consider this:

Question is, can that hang? Can the left-hand thread see the x++ but not the y++, and hence enter its infinite loop? With "weak" TM systems that update data in-place, the dirty reads are visible, and the program can hang. In "strong" TM systems that don't update in-place, the modified values are invisible until the transaction is committed, at which point they're atomically made visible (so other threads can either see x == y == 0, or they can see x == y == 1, with no other option). Problem is, at least for existing software TM implementations, in-place modification and dirty reads make things a hell of a lot more efficient.

There are other issues. Transactions work well with transactions (as long as you have a reasonable policy for how nested transactions could work) but perhaps counterintuitively, transactions don't necessarily work well with code that uses locks, or even some lock-free algorithms. For example, consider this:

Code:

volatile int flag = 0; // this is some type that can be modified atomically, i.e. does not need locks for other threads to read it

flag = 1; while(flag != 1) {while(flag == 1) { }} flag = 2;

With no transactions or locks, that routine is safe; it won't hang. But what happens if those routines happen to be run inside transactions and we have strong isolation (no writes visible outside the transaction?

Code:

volatile int flag = 0; // this is some type that can be modified atomically, i.e. does not need locks for other threads to read it

It turns out that now both routines hang. Neither can see the signal that the other has sent. For the left-hand one, flag is always set to 1; for the right-hand one, it's always set to 0.

The broader in scope you make your transactions, the hairier things become. If you start allowing transactions to wrap lock-based code (or lock-less code), interacting with transactional databases, file systems, and more, you can suffer all kinds of problem.

Maybe this can all be worked out in an effective and efficient way. But maybe it can't. It may be that the best route is to keep transactional memory as its own thing, keep it as a low-level thing, and leave it to skilled developers to use--in much the same way as LL/SC and CAS are handled.

Serious question: Why must a transactional rollback and retry be required in the event of a modified access? Why couldn't an additional operation be executed first to determine if the previously committed change actually modified any information relevant to the transaction in question?

That is theoretically feasible, but the effort required in many cases is probably similar to the effort to just rerun. There's lots of cases where that's not true, but it's usually better to err on the side of simplicity and reducing chance of incorrect data.

But it brings up at least one problem in optimistic transactions: if too many of your transactions fail b/c the data was modified by another thread, you're redoing work over and over again hoping you'll get to be the one that successfully modifies the data first.

This can lead to a situation where you have thread A performing a lot of quick updates to the data and thread B doing a lot of work for one update. Thread B may NEVER be able to commit because by the time it gets done with its work, thread A has modified it again.

Given the comments on "sheer genius" and "neat trickery", they may have solved that problem.

I've never heard transactional memory pitched as a solution for reliability and performance. In my experience, it's always pitched as being an easier programming model, and that's all.

"easier way to program" == "fewer deadlocks and other problems" == "more reliability""easier way to program" == "greater use of multiple threads and multiple cores" == "more performance"

Quote:

Hardware TM just tries to be competitive with complex lock-based programming models, while being extremely simple to program (and impossible to mess up). Deadlock and livelock go away for transactional memory unless the HTM system introduces some of its own, which some academic HTM solutions in the past have, but then that's the architect's problem, not the programmer's. With lock-based multi-threaded programs, handling concurrency wrong totally breaks your program, but with HTM, handling concurrency incorrectly only hurts performance. Again, TM is just about making the programmer's life easier.

If the programmer doesn't care about performance, he can use a single-threaded system and not use any locking at all. No need for transactional memory in that situation.

Serious question: Why must a transactional rollback and retry be required in the event of a modified access? Why couldn't an additional operation be executed first to determine if the previously committed change actually modified any information relevant to the transaction in question?

This is basically what is done, but that check needs to be done at the end of the transaction. Lazy transactions (as opposed to eager) work like this:

1. begin transaction2. do useful work3. make sure none of my reads were written by other threads since I began transaction (this is basically what you were talking about)4. commit my writes to the shared virtual memory (this step may cause other threads to abort when they get to their own stage 3)5. end transaction

jimCA wrote:

But it brings up at least one problem in optimistic transactions: if too many of your transactions fail b/c the data was modified by another thread, you're redoing work over and over again hoping you'll get to be the one that successfully modifies the data first.

This is the problem of starvation and there are many ways to get around it, such as not allowing younger transactions to abort older transactions after the older transaction has been aborted X number of times. You could think of a million others. Even in that bad case you described, that's still not livelock. Useful work is still being done (young transactions are committing, hence work is being done). It's just not "fair" to the older transaction.

Serious question: Why must a transactional rollback and retry be required in the event of a modified access? Why couldn't an additional operation be executed first to determine if the previously committed change actually modified any information relevant to the transaction in question?

It's the only general way of achieving correct performance.

In fact, the big ability that transactional memory and LL/SC have over CAS is that they can detect that a change has occurred even if it's not visible to the program. This is important for something called the ABA problem. To prevent the ABA case, you want to roll back due to memory writes even if the modification didn't change the value.

"easier way to program" == "fewer deadlocks and other problems" == "more reliability""easier way to program" == "greater use of multiple threads and multiple cores" == "more performance"

If the programmer doesn't care about performance, he can use a single-threaded system and not use any locking at all. No need for transactional memory in that situation.

I can see where you're coming from on the reliability point, but I can't really agree with the performance thing. It's not like transactional memory makes it easier than locks to break up your problem into X pieces. It can make the coding of it smoother and easier, but it doesn't really improve performance. In fact, TM often (read: almost always) hurts performance compared to a well-coded lock-based solution. The ideal for TM is to break even with locking, while being simpler to program.

Again, the point of TM is to make writing a single-threaded and multi-threaded version of the same program practically the same experience, so it's geared toward people who want to get more performance out of their programs for little more effort beyond writing the single-threaded version.

I can see where you're coming from on the reliability point, but I can't really agree with the performance thing. It's not like transactional memory makes it easier than locks to break up your problem into X pieces. It can make the coding of it smoother and easier, but it doesn't really improve performance. In fact, TM often (read: almost always) hurts performance compared to a well-coded lock-based solution. The ideal for TM is to break even with locking, while being simpler to program.

But you only use locking and multithreading because you care about performance.

If you don't care about performance, you don't use multiple threads at all--TM may be relatively easy, but single-threaded is easier still. The very fact that someone is using multiple threads in the first place means that they care about performance.

You're coming at it from the angle that the alternative to/opposite of transactional memory is judicious use of locking. But it isn't. The alternative is single-threaded programs, or at a pinch, multithreaded programs with a single global lock that's used to protect every shared data structure. Both of these solutions are correct and will avoid multithreading issues. They just don't perform worth a damn.

Problem is, neither do typical STM implementations; they're generally slower than the single-threaded code would be. So they give you the worst of all worlds, really--greater complexity than pure single-threaded code, and worse performance even when using multiple cores.

"easier way to program" == "fewer deadlocks and other problems" == "more reliability""easier way to program" == "greater use of multiple threads and multiple cores" == "more performance"

If the programmer doesn't care about performance, he can use a single-threaded system and not use any locking at all. No need for transactional memory in that situation.

I can see where you're coming from on the reliability point, but I can't really agree with the performance thing. It's not like transactional memory makes it easier than locks to break up your problem into X pieces. It can make the coding of it smoother and easier, but it doesn't really improve performance. In fact, TM often (read: almost always) hurts performance compared to a well-coded lock-based solution. The ideal for TM is to break even with locking, while being simpler to program.

Again, the point of TM is to make writing a single-threaded and multi-threaded version of the same program practically the same experience, so it's geared toward people who want to get more performance out of their programs for little more effort beyond writing the single-threaded version.

I think you're wrong. The point isn't to make writing a single-threaded and multi-threaded version of the same program practically the same experience. It's to avoid the downfall of the deadlock/livelock issue.

But if you want to look at it, this does greatly enhance the performance of some applications. If a program is sufficiently large, and sufficiently complex, with each thread writing frequently to locked values, this will greatly increase performance, such that it probably well exceeds single threaded performance. If you create a deadlock between two threads, those 2 threads won't ever execute, and even more threads can begin to pile up waiting for the locks to become removed (remember even one lock being in use halts a thread until it's released, and moving that thread out of a core requires a context switch which either kills the process and forces a restart, or moves it to the side in a SMT solution which increases the likelyhood of other deadlocks if it has a lock on anything). Also note that once a thread deadlocks it can take down a whole core even using SMT depending on where it is in the pipeline. Once you've locked enough cores your performance WILL be impacted.

That's the point of Dr. Pizza's quote about misusing locks equating to your program breaking under locking, compared to your performance simply degrading using Hardware TM.

I've never heard transactional memory pitched as a solution for reliability and performance. In my experience, it's always pitched as being an easier programming model, and that's all. Hardware TM just tries to be competitive with complex lock-based programming models, while being extremely simple to program (and impossible to mess up). Deadlock and livelock go away for transactional memory unless the HTM system introduces some of its own, which some academic HTM solutions in the past have, but then that's the architect's problem, not the programmer's. With lock-based multi-threaded programs, handling concurrency wrong totally breaks your program, but with HTM, handling concurrency incorrectly only hurts performance. Again, TM is just about making the programmer's life easier.

The potential for performance issues entered my mind, in that you could end up with a thread in virtual standstill because a different thread (or threads) keeps changing the values from under its proverbial feet.

On-chip transaction memory seems target more for compilers than end-developers. Much like end-developers don't worry so much about how many registers a chip has. The on-chip TM will only be able to do so much variable bookkeeping and then would switch over to using a software TM. It would be a nightmare to have to worry about low level details like that to utilize transaction memory.

That's one of the outstanding questions, really, when it comes to "how should transactional memory be exposed to developers?".

[/quote]

It's probably worth noting that there's a proposed draft standard for TM (software TM) in C++ (CXX1X) that addresses lots of these issues http://software.intel.com/en-us/blogs/2 ... cts-for-c/. This may or may not be relevant to IBM's HTM implementation, but does address lots of these issues.

Intel also has a prototype version of icc that can handle the draft's "atomic" blocks that anyone can download and play with. It's a bit outdated but is still useful.

Please do shut up with the Crysis jokes. The game is now almost 4 years old, the jokes even more so.

And no, it cannot run Crysis because Crysis is compiled for x86 processors, while the Sequoia uses PowerPC chips. And that's putting aside the fact that it's a GPU-dependent game, and supercomputers don't really have much in the way of GPU hardware.

Please do shut up with the Crysis jokes. The game is now almost 4 years old, the jokes even more so.

And no, it cannot run Crysis because Crysis is compiled for x86 processors, while the Sequoia uses PowerPC chips. And that's putting aside the fact that it's a GPU-dependent game, and supercomputers don't really have much in the way of GPU hardware.

Please do shut up with the Crysis jokes. The game is now almost 4 years old, the jokes even more so.

And no, it cannot run Crysis because Crysis is compiled for x86 processors, while the Sequoia uses PowerPC chips. And that's putting aside the fact that it's a GPU-dependent game, and supercomputers don't really have much in the way of GPU hardware.

Haskell offers its users a sturdy, efficient software transactional memory system that works quite well and is easy to understand. The type system prevents transactional code from doing anything that can't be rolled back. Of course, understanding Haskell itself is a prerequisite for this, which is not a trivial feat for most people.

The problem with all these in hardware features are that developers normally never get to see them outside of specialized hardware like this. IBM has tried before with the CELL chip to pitch a new programming model to the masses, arguing (not wrongly) that the current model of substantial speed increases under the 15th programming abstraction layer are over.

Now developers like their abstraction layers and surely didn't switch to programming SPEs. Lets suppose that they add this feature to Power chips. Hell even DB2 will most likely not support it since it is available on 5 million different platforms and processing architectures, and I am pretty sure that locking mechanisms are so ingrained in the code that it would be a bitch to support different versions on different platforms. Besides the allure of easier programming is lost once you have to support the harder versions as well.

It pisses me off that we only get cool stuff like this for nuclear weapons research. I'm happy that these concepts are being explored, it's the way they have to get stuffed in with your traditional, please-model-my-bomb, supercomputers that makes me angry.

It pisses me off that we only get cool stuff like this for nuclear weapons research. I'm happy that these concepts are being explored, it's the way they have to get stuffed in with your traditional, please-model-my-bomb, supercomputers that makes me angry.