Oracle Blog

Stay Tuned

Monday Sep 02, 2013

This is a technical white paper with a fairly challenging title, but it actually describes the contents quite well.

We wrote this paper because the new SPARC T5 and M5 Oracle servers provide so much main memory plus so many cores and threads that one may wonder how to manage and deploy such a kind of system. They're pretty unique in the market.

This is why we joined forces and set ourselves the goal of providing a holistic view on this topic.

The paper is written in a modular way and readers can select the individual topics they're interested in, but of course we hope you'll read it front to back and find it useful. Perhaps the Glossary at the end comes in handy too.

The first part covers the processor and system architectures, but only to the extent we felt is needed for the remainder of the paper. There are several other white papers that go into an awful lot more detail on this.

The next part targets those (thinking about) developing parallel applications and looking for tips and tricks what choices need to be made and how to make real world codes scale. That is no mean feat, but rewarding and long lasting. Think about it. The trend is upward and the size of systems can be expected to continue to scale up. Any investment done today in improving scalability is going to help. There is a learning curve as well and the sooner one begins, the better.

We feel these chapters are however also of use to those not directly involved writing parallel code. It helps to understand what happens under the hood and may explain things one has observed experimentally. For example why there may be a diminishing return on adding more and more threads.

The third part covers the virtualization features available and how these can be used to configure the system to the needs. Perhaps to run legacy applications that require an older software environment and do this side by side with applications running in a more modern environment. On top of that, each of such applications can be multi-threaded, providing the optimal configuration per application.

The paper concludes with a brief coverage of key Solaris features. Not everybody realizes the importance of a scalable OS and how much on-going engineering investment is needed to continue to improve the scalability of Solaris.

Sunday Aug 05, 2012

IWOMP stands for "International Workshop on OpenMP". It is a workshop held once a year, rotating across the US, Asia and Europe. IWOMP started in 2005 and since then has been held every year. It is the place to be for those interested in OpenMP. The talks cover usage of OpenMP, performance, suggestions for new features and updates on upcoming features. There is usually also a tutorial day prior to the workshop.

June 11-13, 2012, CASPUR in Rome, Italy, hosted IWOMP 2012. This was the 8-th workshop and as always, the event was very well organized with a variety of high quality talks and two interesting key note speakers. The beautiful and very special Horti Sallustiani had been chosen as the workshop venue.

This year was rather special since Bjarne Stroustrup, the creator of C++, had accepted the invitation to give the opening key note talk. The focus of his presentation was how to use C++ to write more reliable and robust code.

Let's consider a very simple example, the computation of a = b + c. This boils down to the following (pseudo-assembler) instructions that need to be executed:

load @b, r1
load @c, r2
add r1,r2,r3
store r3, @a

The first two instructions load variables b and c from an address in memory (here symbolized by @b and @c respectively). These values go into registers r1 and r2. The third instruction adds the values in r1 and r2. The result goes into register r3. The fourth instruction stores the contents of r3 into the memory address symbolized by @a.

If we're lucky, both b and c are in a nearby cache and the load instructions only take a few processor cycles to execute. That is the good case, but what if b or c, or both, have to come from very far away? Perhaps both of them are in the main memory and then it easily takes hundreds of cycles for the values to arrive in the registers.

Meanwhile the processor is doing nothing and simply waits for the data to arrive. Actually, it does something. It burns cycles while waiting. That is a waste of time and energy. Why not use these cycles to execute instructions from another application or thread in case of a parallel program?

That is exactly what latency hiding on the SPARC T-Series processors does. It is a hardware feature totally transparent to the user and application. As soon as there is a delay in the execution, the hardware uses these otherwise idle cycles to execute instructions from another process. As a result, the throughput capacity of the system improves because idle cycles are no longer wasted and therefore more jobs can be run per unit of time.

This feature has been in the SPARC T-series from the beginning, so why this paper?

The difference with previous publications on this topic is in the amount of detail given. How this all works under the hood is fully explained using two example programs. Starting from the assembly language instructions, it is demonstrated in what way these programs execute. To really see what is happening we go down to the processor pipeline level, where the gaps in the execution are, and show in what way these idle cycles are filled by other copies of the same program running simultaneously.

Both the SPARC T4 as well as the older UltraSPARC T2+ processor are covered. You may wonder why the UltraSPARC T2+ is included. The focus of this work is on the SPARC T4 processor, but to explain the basic concept of latency hiding at this very low level, we start with the UltraSPARC T2+ processor because it is architecturally a much simpler design. From the single issue, in-order pipelines of this processor we then shift gears and cover how this all works on the much more advanced dual issue, out-of-order architecture of the T4.

The analysis and performance experiments have been conducted on both processors. The results depend on the processor, but in all cases the theoretical estimates are confirmed by the experiments.

If you're interested to read a lot more about this and find out how things really work under the hood, you can download a copy of the paper here.

A paper like this could not have been produced without the help of several other people.

I want to thank the co-author of this paper, Jared Smolens, for his very valuable contributions and our highly inspiring discussions. I'm also indebted to Thomas Nau (Ulm University, Germany), Shane Sigler and Mark Woodyard (both at Oracle) for their feedback on earlier versions of this paper. Karen Perkins (Perkins Technical Writing and Editing) and Rick Ramsey at Oracle were very helpful in providing editorial and publishing assistance.

About

Ruud van der Pas is a Senior Staff Engineer in the Microelectronics organization at Oracle. His focus is on application performance, both for single threaded, as well as for multi-threaded programs.
He is also co-author on the book Using OpenMP