Making Sense of Parallel Programming Terms

By Susan Morgan, January 2008

This article explains common parallel and multithreading concepts, and differentiates between the hardware
and software aspects of parallel processing. It briefly explains the hardware architectures
that make parallel processing possible. The article describes several popular parallel programming models. It
also makes connections between parallel processing concepts and related Sun hardware and software
offerings.

Parallel Processing and Programming Terms

The terms parallel computing, parallel processing, and parallel programming are sometimes used in
ambiguous ways, or are not clearly defined and differentiated. Parallel computing is a
term that encompasses all the technologies used in running multiple tasks simultaneously on
multiple processors. Parallel processing, or parallelism, is accomplished by dividing one single runtime task
into multiple, independent, smaller tasks. The tasks can execute simultaneously when more than
one processor is available. If only one processor is available, the tasks execute
sequentially. On a modern high-speed single processor, the tasks might appear to
run at the same time, but in reality they cannot be executed simultaneously
on a single processor.

Parallel programming, or multithreaded programming, is the software methodology used to implement parallel processing. The
program must include instructions to inform the runtime system which parts of the
application can be executed simultaneously. The program is then said to be parallelized.
Parallel programming is a performance optimization technique that attempts to reduce the
“wall clock” runtime of an application by enabling the program to handle many activities
simultaneously.

Parallel programming can be implemented using several different software interfaces, or parallel programming models.

This article explains the common parallel and multithreading concepts, and the differences between
the hardware and software aspects of parallel processing. It briefly describes the hardware
architectures that make parallel processing possible, and presents several popular parallel programming models. Pointers
to other locations where you can read more about specific topics are included.

Parallel Processing

Parallel processing is a general term for the process of dividing tasks into
multiple subtasks that can execute at the same time. These subtasks are known
as threads, which are runtime entities that are able to independently execute a
stream of instructions. Parallel processing can occur at the hardware level and at
the software level. Distinguishing between these types of parallel processing is important. At
the software level, an application might be rewritten to take advantage of parallelism
in the code. With the right hardware support, such as a multiprocessing system,
the threads can then execute simultaneously at runtime. If not enough processors or cores
are available for all the threads to run simultaneously, certain tasks might still
execute one after the other. The common way to describe such non-parallel execution
is to say these tasks execute sequentially or serially.

Parallelism in the Hardware

Execution of a parallel application is dependent on hardware design. However, even when
the system is capable of parallel execution, the software must still divide,
schedule, and manage the tasks.

Multiprocessors – More than one processor can be active simultaneously. The processors use shared memory to communicate and share data. The allocation of tasks between the processors is handled by the operating system, so the system is able to execute multiple jobs simultaneously. The simultaneous execution improves the overall throughput, and for a given workload, reduces the turnaround time for the applications when compared to a system with a single processor. In certain cases, this reduction might not be sufficient because executing the single application still takes too long. At that point, parallel programming might be considered as a way to address this problem. The application developer needs to select a suitable parallel programming model such as POSIX Threads or OpenMP to implement the parallelism. Most Sun hardware is available in multiprocessor configurations, with a few entry level servers and workstations having one processor. Sun offers a comparison of Sun server families and a comparison of Sun workstations that detail the number of processors in Sun machines.

Multicore processors – More than one core, or processing unit, in a single chip can be active simultaneously. Multicore processing is sometimes called chip-level multiprocessing (CMP) because multiple processors are on a single chip. The cores use shared memory, a shared system bus, and, in some cases, shared caches, to communicate and share data with each other. The cores generally have their own processing units and registers. The architecture of each core varies with different processor implementations. The operating system views each core as a processor, and handles the allocation of tasks between the cores. A multicore processor is like a multiprocessor system implemented on a single chip. Although differences exist, especially with respect to the sharing of resources, from an application point of view multiple processors and multicore processors are effectively the same. Therefore, with single-threaded applications running on a multicore processor, the throughput of a multijob workload is increased by executing more than one application simultaneously. For example, on a dual-core processor, two programs can run at the same time. For a parallel application, the independent tasks can be scheduled onto the various cores of the processor. In both cases, however, the performance might not be as good as on a true multiprocessor design. The performance largely depends on the multicore implementation, and how many shared resources are needed by the applications that are running simultaneously. Sun servers are available with single-core or dual-core AMD Opteron processors, dual-core or quad-core Intel Xeon processors, and with single-core UltraSPARC III and UltraSPARC IIIi processors or dual-core UltraSPARC IV and UltraSPARC IV+ processors.

Multithreaded processors – These processors contain a number of multithreaded cores, which switch between a number of active threads. Some processor cores implement vertical multithreading (VMT), which enables the core to execute multiple threads in an interleaved fashion. If one software thread stalls waiting for a resource (data to come from memory, input/output, and so on), another thread immediately takes over execution. When the second thread stalls, the first thread or another waiting thread takes over. VMT enables processor cycles to be used more efficiently. Early VMT designs suffered from too much resource sharing, and in many cases, overall performance could be improved by disabling resource sharing. Current processors that use vertical threads include the dual-core SPARC64 VI processor, which has two vertical threads per core, and was developed by Fujitsu. The Sun SPARC Enterprise M-series servers use the SPARC64 VI processor.

A further refinement of hardware multithreading technology is called simultaneous multithreading (SMT). A truly multithreaded processor, specifically designed for SMT, does not have a resource sharing problem. Sun has introduced the term chip multithreading (CMT) for a processor design with multiple cores in which each core is multithreaded. The UltraSPARC T1 is the first processor implementing this design. The UltraSPARC-T1 processor is deployed in the T1000 and T2000 server models. The UltraSPARC T2 is the latest generation to extend these concepts further. The UltraSPARC T2 processor is deployed in the T5120 and T5220 server models.

Sun uses the term throughput computing for its strategy of processor design that greatly increases throughput, or the amount of work that can be done by a computer in a given period of time. The processors use chip multithreading, fully exploited through optimizations in the Solaris OS, to achieve these performance gains. This term is adapted from networking terminology, in which throughput is defined as the rate at which a computer or network sends or receives data. For more information about throughput computing, see the Throughput Computing White Paper.

The term CoolThreads refers to the CMT-based processor line from Sun, the first of which is the UltraSPARC T1. The CoolThreads name reflects the processor's chip multithreading architecture, and its low power usage, which causes less heat to be dissipated, resulting in a cooler chip. For more information about CoolThreads and the UltraSPARC T2 and UltraSPARC T1, see the Sun Servers with CoolThreads Technology overview.

The Solaris 10 OS is optimized for running on CoolThreads processors. Other operating systems are being ported to the UltraSPARC architecture through the OpenSPARC community.

For more information about chip-multithreading, see the following Sun publications:

Cluster computing – A cluster is a group of computers, generally called nodes, working together as a single system. Often the nodes are the same type of computer, running the same operating system, and belonging to the same administrative domain. Special cluster software running on the nodes and a high-speed network connecting the nodes enable rapid communication between them. Clusters can be configured to provide high availability (HA), for situations where the hardware and software must always be up and running. Hardware and software failures in a node do not cause the cluster to fail because built-in redundancy in HA configurations enables other nodes to pick up the tasks of a failed node while the cluster continues to run. Examples of environments requiring high availability are online reservation or ordering systems.

A cluster of systems can also be used as a large parallel computer, useful for high performance computing (HPC). Clusters configured for HPC might be used to run parallelized scientific applications, for example. Usually the HPC and HA uses of a cluster are not combined. When used for HPC environments, all of a cluster's available resources are used for the tasks at hand. If a failure occurs, the hardware or software is fixed and restarted.

Grid computing – This term refers to a heterogeneous mix of networked computers working together, similar to a cluster but potentially working across administrative domains or organizations. The nodes on a grid can range from a small group of systems located in the same room to a large set of networked computers installed around the world. Even a cluster can be a node in a grid. Each node in the grid runs special software that enables it to make optimal use of the available resources like CPU cycles and storage that are contributed by the nodes on the grid. Often, the grid software can be configured so that any possible spare CPU cycle is used to run applications. This technique enables optimal use of the system. Originally, grids were used to run scientific applications. More recently, grid use has extended to other environments, including environments where clusters have traditionally been used. As a result, the difference between a cluster and a grid is not always very clear. The system software is often the main differentiator.

Parallelism by Software Programming Models

The Solaris OS kernel and most Solaris services have been multithreaded and optimized
for many years in order to take advantage of multiprocessor architectures. Sun continues
to invest in parallelizing and optimizing Solaris software to fully support emerging parallel
architectures. For a single application to benefit from a multiprocessor architecture including clusters
and grids, the program should be parallelized using one of the parallel programming
models. In all cases, the application's use of parallelism must improve performance enough to
surpass the processing overhead that comes with the programming model. The creation and
management of threads are examples of processing overhead.

The programming model used in any application depends on the underlying hardware architecture
of the system on which the application is expected to run. Specifically, the
developer must distinguish between a shared memory system and a distributed memory system.
In a shared memory architecture, the application can transparently access any memory location. A
multicore processor is an example of a shared memory system. In a distributed memory
environment, the application can only transparently access the memory of the node it
is running on. Access to the memory on another node has to be
explicitly arranged within the application. Clusters and grids are examples of distributed memory
systems.

Shared Memory Programming Models

Shared memory, or multithreaded, programming is sometimes also called threaded programming. In this
context, threads are lightweight processes, which are processes that exist within a single
operating system process. Threads share the same memory address space and state information of
the process that contains them. The containing process is sometimes also called the
parent process. The shared memory model is supported on computers that have multiple
processors, where each core or processor has access to the same shared memory.
Such a system has a single address space. Communication and data exchange between the
threads takes place through shared memory.

Parallel programming can be implemented for shared memory systems using any of the
following models.

Automatic parallelization – When the program is compiled, the compiler tries to identify the parallelism in the application. The focus is on loops, either a single loop or a set of nested loops, as this area is typically where most of the execution time is spent. Through a dependence analysis, the compiler determines whether parallelizing a loop is safe. If it is safe, the compiler generates the right parallel infrastructure for parallel execution at runtime. The developer merely has to use the appropriate option on the compiler to activate this feature. With the Sun Studio compilers, this option is the -xautopar option. The -xloopinfo option, which displays parallelization messages, is also highly recommended.

POSIX threads and Solaris threads – The Solaris OS supports two shared-memory threading models. The standard POSIX threads API, usually abbreviated as Pthreads, is available for applications written in C. The older Solaris threads API, which predates the Pthreads standard, is also supported. The POSIX threads API is the standard supported on many UNIX-based operating systems. Use of this standard increases portability. Both libraries are included in the standard C library libc in the Solaris OS. See the pthreads(5) man page for a comparison of both APIs.

For condensed information about Pthreads programming, see the POSIX Threads Programming tutorial at www.llnl.gov. For a more comprehensive understanding of programming with POSIX threads you might read the books Programming with POSIX Threads by David R. Butenhof and Programming with Threads by Steve Klieman, Devang Shah, and Bart Smaalders.

OpenMP – This API specification is for implementing parallel programming on a shared memory system. OpenMP offers a higher level model than POSIX threads and also provides additional functionality. In many cases, an OpenMP implementation is built on top of a native threading model like POSIX threads. OpenMP consists of a set of compiler directives, runtime functions, and environment variables. Fortran, C and C++ are supported.

The compiler directive plays a key role in OpenMP. By inserting directives in the source, the developer specifies what parts of the program can be executed in parallel. The compiler transforms these specified parts of the program into the appropriate infrastructure, such as a function call to an underlying multitasking library. OpenMP has four main advantages over other programming models:

Portability – Although OpenMP is not an official standard, a program using OpenMP is portable to another OpenMP compiler or environment.

Ease of use – The developer does not have to create and manage threads at the level of POSIX threads, for example. Thread management is handled by the compiler and underlying multitasking library.

The application can be parallelized step by step – The developer specifies the sections that can be executed in parallel, and can thus incrementally parallelize the application as necessary.

The sequential version of the program is preserved – If the program is not compiled with the compiler option for OpenMP, the directives in the code are ignored. This behavior effectively disables parallel execution for that source and the program runs sequentially again.

The Sun Studio documentation set includes the Sun Studio 12: OpenMP API User’s Guide, which describes issues specific to the Sun Studio implementation of the OpenMP API.

Distributed Memory Programming Models

Developers can implement the parallelism in an application by using a very low-level
communication interface, such as sockets, between networked computers. However, using such a
method is the equivalent of using assembly language programming for applications: very powerful,
but also very minimal. As a result, an application parallelized using such an
API might be hard to maintain and expand.

The Message Passing Interface (MPI) model is commonly used to parallelize applications for
a cluster of computers, or a grid. Like OpenMP, this interface is an
additional software layer on top of basic OS functionality. MPI is built on
top of a software networking interface, such as sockets, with a protocol such
as TCP/IP. MPI provides a rich set of communication routines, and is widely
available.

An MPI program is a sequential C, C++, or Fortran program that
runs on a subset of processors, or all processors or cores in the
cluster. The programmer implements the distribution of the tasks and communication between the
tasks, and decides how the work is allocated to the various threads.
To this end, the program needs to be augmented with calls to MPI
library functions, for example, to send and receive information from other threads.

MPI is a very explicit programming model. Although some convenience functionality is provided,
such as a global broadcast operation, the developer has to specifically design the
parallel application for this programming model. Many low-level details also need to
be handled explicitly.

The advantage to MPI is that an application can run on any
type of cluster that has the software to support the MPI programming model.
Although originally MPI programs mainly ran on clusters of single processor workstations
or PCs, running an MPI application on one or more shared
memory computers is now common. An optimized MPI implementation can then also take
advantage of the faster communication over shared memory for those threads executing in
the same system.

Open MPI is an open–source effort by a consortium of research, academic, and industry partners to build an MPI library that combines technologies and resources from several MPI projects. Open MPI is the basis for the Sun HPC ClusterTools 7 software. You can download this software for free from the Sun HPC ClusterTools 7 page.

For a detailed overview of MPI, see the Message Passing Interface (MPI) Tutorial at www.llnl.gov.

Hybrid Programming Models

With the emergence of multicore systems, an increasing number of clusters and grids
are parallel systems with two layers. Within a single node, fast communication
through shared memory can be exploited, and a networking protocol can be used
to communicate across the nodes. Programs can take advantage of both shared memory
and distributed memory.

The MPI model can be used to run parallel applications on clusters
of multicore systems. MPI applications run across the nodes as well as within each
node, so both parallelization layers, shared and distributed, could be used through MPI.
In certain situations, however, adding the finer-grained parallelization offered by a shared
memory programming model such as Pthreads or OpenMP is more efficient. Typically, parallel
execution over the nodes is achieved through MPI. Within one node, Pthreads or
OpenMP is used. When two programming models are used in one application, the
application is said to be parallelized with a hybrid or mixed-mode programming model.

Another hybrid programming model that is sometimes used is to combine Pthreads and
OpenMP. This type of application only runs in one shared-memory system. Each Pthread
process is further parallelized using OpenMP, taking advantage of the additional parallelism offered
by this type of process.

Sun Parallel Application Development Software

Sun offers software products to support the technologies discussed in this article.

For Shared Memory Systems

Sun software for shared memory systems includes:

Threads – POSIX threads and Solaris threads libraries are both included in the
Solaris libc library.

OpenMP – An implementation of OpenMP for C, C++ and Fortran is included
in the Sun Studio software, which is free to download. The -xopenmp compile and
link-time option instructs the Sun Studio compiler to recognize OpenMP directives and runtime functions
in a program. The OpenMP runtime support library, libmtsk, provides support for
thread management, synchronization, and scheduling of work. The library is implemented on top
of the POSIX threads library.

For Distributed Memory Systems

An implementation of MPI is included in Sun HPC ClusterTools. This product also includes
driver compile scripts and tools to query and manage the jobs at runtime.
Note that multiple versions of Sun HPC ClusterTools are available. The ClusterTools 5
and ClusterTools 6 software includes the Sun implementation of MPI, called Sun
MPI. The ClusterTools 7 software includes the newer open-source implementation of MPI, called
Open MPI. The Sun HPC ClusterTools 7.1 Software Migration Guide describes the differences between Sun MPI and Open
MPI to help in upgrading applications that use Sun MPI functions to
run with Open MPI. For complete ClusterTools information, see Sun HPC ClusterTools 7.1 Documentation.