Cameron and Tracey Hughes

Dr. Dobb's Bloggers

Fault Tolerance: We Gotta Get There!!

When parallel processing is required, virtually every aspect of the software design and implementation is affected. The developer is faced with what we call the "10 challenges of concurrency".

When parallel processing is required, virtually every aspect of the software design and implementation is affected. The developer is faced with what we call the "10 challenges of concurrency".

Here are the 10 challenges of concurrency:

Software decomposition into instructions or sets of tasks that need to execute simultaneously

Communication between two or more tasks that are executing in parallel

Concurrently accessing or updating data by two or more instructions or tasks

Identifying the relationships between concurrently executing pieces of tasks

Controlling resource contention when there is a many-to-one ratio between tasks and resource

Determining an optimum or acceptable number of units that need to execute in parallel

Creating a test environment that simulates the parallel processing requirements and conditions
Recreating a software exception or error in order to remove a software defect

Documenting and communicating a software design that contains multiprocessing and multithreading

Implementing the operating system and compiler interface for components involved in multiprocessing and multithreading

Some of the concurrency challenges have to be checked in the testing phase and accounted for in exception handlers. These challenges are:

Incorrect/inadequate communication between two or more tasks that are executing in parallel

Data corruption as a result of unsafe updating of data by two or more instructions or tasks

Resource contention when there is a many-to-one ratio between tasks and resource

An unacceptable number of units that need to execute in parallel

Missing/Incomplete Documentition for communicating a software design that contains multiprocessing and multithreading

The mechanism to synchronize communication and data or device access between concurrently executing threads or processes (for instance mutexes and semaphores) are used control and prevent errors that would occur from Challenge 2. Timed mutexes can be used to control and prevent errors that would result from the problems that could occur from Challenge 3. Documentation in so many cases receives the least amount of attention and dedicated resources but is one of the most important components of a software deployment. As with everything else with parallel programming and multithreading documentation is even more critical for these classes of application. The testing process should verify and validate that the design documentation and the post production documentation match! Table 1 shows which mechanisms can be used to prevent control and prevent some of the 5 challenges.

TYPES OF SEMAPHORES

DESCRIPTION

Mutex Semaphore

Mechanism used to implement mutual exclusion in a critical section of code.

Read-write Locks

Mechanism used to implement read-write access policy between tasks.

Multiple Condition Variable

Same as an event mutex but includes multiple events or conditions.

Condition Variables

Mechanism used to broadcast a signal between tasks that an event has taken place. When a tasks locks an event mutex, it blocks until it receives the broadcast.

The mechanisms listed in Table 1 are low-level mechanisms. Fortunately using features of higher-level component libraries such as TBB, or the standard C++ concurrent programming library will take some of tedium away during the testing process. These issues are meant to be dealt with in Layer 2 and 3 from the PADL (Parallel Application Design Layers) analysis model. There are several words that are used in discussions on testing, error handling and fault tolerance that are often used in correctly or loosely. Table 2 contains the basic definitions.

TERMS

DESCRIPTION

Defect

A flaw in any aspect of software or software requirements that contributes or may potentially contribute to the occurrence of one or more failures.

Error

An inappropriate decision made by a software engineer/programmer that leads to a defect in the software.

Exception Handling

A mechanism for managing exceptions (unanticipated conditions during the execution of a program) that changes the normal flow of the execution of a program/software.

Failure

An unacceptable departure from the operation of a software element that occurs as a consequence of a fault.

Fault

A defect in the software due to human error that when executed under particular conditions causes failure.

Fault Tolerance

A property that allows a piece of software to survive and recover from the software failures caused by faults (defects) introduced into the software as a result of human error.

Reliability

The ability of the software to perform a required function under specified condition for a stated period of time.

Since some of the terms in Table 2 such as error, failure and fault are commonly used in many different ways, we have provided simple definitions for how they can be used. The extent to which our software is able to minimize the effects of failure is a measure of its fault tolerance. Achieving fault tolerant software is one of the primary goals of any software engineering effort. However, the distinction between fault tolerant software and well tested software is often misunderstood and blurred. Sometimes the responsibilities and activities of software verification, software validation, and exception handling are erroneously interchanged. To work towards our goal of using the C++ exception handling mechanism to help us achieve logical fault tolerant software, we must first be clear where exception handling fits in the scheme of things.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!