Reliability, Availability, and Maintainability

Definition: Reliability, Availability, and Maintainability (RAM or RMA) are system design attributes that have significant impacts on the sustainment or total Life Cycle Costs (LCC) of a developed system. Additionally, the RAM attributes impact the ability to perform the intended mission and affect overall mission success. The standard definition of reliability is the probability of zero failures over a defined time interval (or mission), whereas Availability is defined as the percentage of time a system is considered ready to use when tasked. Maintainability is a measure of the ease and rapidity with which a system or equipment can be restored to operational status following a failure.

Keywords: availability, maintainability, RAM, reliability, RMA

MITRE SE Roles and Expectations: MITRE systems engineers (SEs) are expected to understand the purpose and role of Reliability, Availability, and Maintainability (RAM) in the acquisition process, where it occurs in systems development, and the benefits of employing it. MITRE SEs are also expected to understand and recommend when RAM is appropriate to a situation and if the process can be tailored to meet program needs. They are expected to understand the technical requirements for RAM as well as the strategies and processes that encourage and help end users and other stakeholders to actively participate in the RAM process. They are expected to monitor and evaluate contractor RAM technical efforts and the acquisition program's overall RAM processes and to recommend changes when warranted.

Background

Reliability is the wellspring for the other RAM system attributes of availability and maintainability. Reliability was first practiced in the early start-up days for the National Aeronautics and Space Administration (NASA) when Robert Lusser, working with Dr. Wernher von Braun's rocketry program, developed what is known as "Lusser's Law" [1]. Lusser's Law states that that the reliability of any system is equal to the product of the reliability of its components, which equates to the weakest link concept.

The term "reliability" is often used as an overarching concept that includes availability and maintainability. Reliability in its purest form is more concerned with the probability of a failure occurring over a specified time interval, whereas availability is a measure of something being in a state (mission capable) ready to be tasked (i.e., available). Maintainability is the parameter concerned with how the system in use can be restored after a failure, while also considering concepts like preventive maintenance and Built-In-Test (BIT), required maintainer skill level, and support equipment. When dealing with the availability requirement, the maintainability requirement must also be invoked because some level of repair and restoration to a mission-capable state must be included. Clearly, logistics and logistic support strategies are also closely related and are dependent variables at play in the availability requirement. This takes the form of sparing strategies, maintainer training, maintenance manuals, and identification of required support equipment. The linkage of RAM requirements and the dependencies associated with logistics support illustrates how the RAM requirements have a direct impact on sustainment and overall LCC. In simple terms, RAM requirements are considered the upper level, overarching requirements that are specified at the overall system level. It is often necessary to decompose these upper level requirements into lower level design-related quantitative requirements such as Mean Time Between Failure/Critical Failure (MTBF or MTBCF) and Mean Time To Repair (MTTR). These lower level requirements are specified at the system level; however, they can be allocated to subsystems and assemblies. The most common allocation is made to the Line Replaceable Unit (LRU), which is the unit that has lowest level of repair at the field (often called organic) level of maintenance.

Much of this discussion has focused on hardware, but the complex systems used today are integrated solutions consisting of hardware and software. Because software performance affects the system RAM performance requirements, software must be addressed in the overall RAM requirements for the system. The wear or accumulated stress mechanisms that characterize hardware failures do not cause software failures. Instead, software exhibits behaviors that operators perceive as a failure. It is critical that users, program offices, the test community, and contractors agree early as to what constitutes a software failure. For example, software "malfunctions" are often recoverable with a reboot, and the time for reboot may be bounded before a software failure is declared. Another issue to consider is frequency of occurrence even if the software reboot recovers within the defined time window as this will give an indication of software stability. User perception of what constitutes a software failure will surely be influenced by both the need to reboot and the frequency of "glitches" in the operating software.

One approach to assessing software "fitness" is to use a comprehensive model to determine the current readiness of the software (at shipment) to meet customer requirements. Such a model needs to address quantitative parameters (not just process elements). In addition, the method should organize and streamline existing quality and reliability data into a simple metric and visualization that are applicable across products and releases. A novel, quantitative software readiness criteria model [2] has been developed to support objective and effective decision making at product shipment. The model has been "socialized" in various forums and is being introduced to MITRE work programs for consideration and use on contractor software development processes for assessing maturity. The model offers:

An easy-to-understand composite index

The ability to set quantitative "pass" criteria from product requirements

Easy calculation from existing data

A meaningful, insightful visualization

Release-to-release comparisons

Product-to-product comparisons

A complete solution, incorporating almost all aspects of software development activities.

Using this approach with development test data can measure the growth or maturity of a software system along the following five dimensions [2]:

Software functionality

Operational quality

Known remaining defects (defect density)

Testing scope and stability

Reliability.

Government Interest and Use

Many U.S. government acquisition programs have put greater emphasis on reliability. The Defense Science Board (DSB) performed a study on Developmental Test and Evaluation (DT&E) in May 2008 and published findings [3] that linked test suitability failures to a lack of a disciplined systems engineering approach that included reliability engineering. The Department of Defense (DoD) has been the initial proponent of systematic policy changes to address these findings, but similar emphasis has been seen in the Department of Homeland Security (DHS) as many government agencies leverage DoD policies and processes in the execution of their acquisition programs.

As evidenced above, the strongest government support for increased focus on reliability comes from the DoD, which now requires most programs to integrate reliability engineering with the systems engineering process and to institute reliability growth as part of the design and development phase [4]. The scope of reliability involvement is further expanded by directing that reliability be addressed during the Analysis of Alternatives (AoA) process to map reliability impacts to system LCC outcomes [5]. The strongest policy directives have come from the Chairman of the Joint Chiefs of Staff (CJCS) where a RAM-related sustainment Key Performance Parameter (KPP) and supporting Key System Attributes (KSAs) have been mandated for most DoD programs [6]. Elevation of these RAM requirements to a KPP and supporting KSAs will bring greater focus and oversight, with programs not meeting these requirements prone to reassessment and reevaluation and program modification.

Best Practices and Lessons Learned [7] [8]

Subject matter expertise matters. Acquisition program offices that employ RAM subject matter experts (SMEs) tend to produce more consistent RAM requirements and better oversight of contractor RAM processes and activities. The MITRE SE has the opportunity to "reach back" to bring MITRE to bear by strategically engaging MITRE-based RAM SMEs early in programs.

Consistent RAM requirements. The upper level RAM requirements should be consistent with the lower level RAM input variables, which are typically design related and called out in technical and performance specifications. A review of user requirements and flow down of requirements to a contractual specification document released with a Request For Proposal (RFP) package must be completed. If requirements are inconsistent or unrealistic, the program is placed at risk for RAM performance before contract award.

Ensure persistent, active engagement of all stakeholders. RAM is not a stand-alone specialty called on to answer the mail in a crisis, but rather a key participant in the acquisition process. The RAM discipline should be involved early in the trade studies where performance, cost, and RAM should be part of any trade-space activity. The RAM SME needs to be part of requirements development with the user that draws on a defined concept of operations (CONOPS) and what realistic RAM goals can be established for the program. The RAM SME must be a core member of several Integrated Product Teams (IPTs) during system design and development to establish insight and a collaborative relationship with the contractor team(s): RAM IPT, Systems Engineering IPT, and Logistics Support IPT. Additionally, the RAM specialty should be part of the test and evaluation IPT to address RAM test strategies (reliability growth, qualification tests, environmental testing, BIT testing, and maintainability demonstrations) while interfacing with the contractor test teams and the government operational test community.

Remember—RAM is a risk reduction activity. RAM activities and engineering processes are risk mitigation activities used to ensure that performance needs are achieved for mission success and that the LCC are bounded and predictable. A system that performs as required can be employed per the CONOPS, and sustainment costs can be budgeted with a low risk of cost overruns. Establish reliability Technical Performance Measures (TPMs) that are reported on during Program Management Reviews (PMRs) throughout the design, development, and test phases of the program, and use these TPMs to manage risk and mitigation activities.

Institute the Reliability Program Plan. The Reliability (or RAM) Program Plan (RAMPP) is used to define the scope of RAM processes and activities to be used during the program. A program office RAMPP can be developed to help guide the contractor RAM process. The program-level RAMPP will form the basis for the detailed contractor RAMPP, which ties RAM activities and deliverables to the Integrated Master Schedule (IMS).

Employ reliability prediction and modeling. Use reliability prediction and modeling to assess the risk in meeting RAM requirements early in the program when a hardware/software architecture is formulated. Augment and refine the model later in the acquisition cycle, with design and test data during those program phases.

Reliability testing. Be creative and use any test phase to gather data on reliability performance. Ensure that the contractor has planned for a Failure Review Board (FRB) and uses a robust Failure Reporting And Corrective Action System (FRACAS). When planning a reliability growth test, realize that the actual calendar time will be 50–100% more than the actual test time to allow for root cause analysis and corrective action on discovered failure modes.

Don't forget the maintainability part of RAM. Use maintainability analysis to assess the design for ease of maintenance, and collaborate with Human Factors Engineering (HFE) SMEs to assess impacts to maintainers. Engage with the Integrated Logistics Support (ILS) IPT to help craft the maintenance strategy, and discuss levels of repair and sparing. Look for opportunities to gather maintainability and testability data during all test phases. Look at Fault Detection and Fault Isolation (FD/FI) coverage and the impact on repair time lines. Also consider and address software maintenance activity in the field as patches, upgrades, and new software revisions are deployed. Be aware that the ability to maintain the software depends on the maintainer's software and information technology skill set and on the capability built into the maintenance facility for software performance monitoring tools. A complete maintenance picture includes defining scheduled maintenance tasks (preventive maintenance) and assessing impacts to system availability.

Understand reliability implications when using COTS. Understand the operational environment and the COTS hardware design envelopes and impact on reliability performance. Use Failure Modes Effects Analysis (FMEA) techniques to assess integration risk and characterize system behavior during failure events.

Download the SEG

MITRE is an equal opportunity employer with an inclusive workplace where differences are valued.
MITRE welcomes resume submissions directly from individual job seekers. Unsolicited resumes from employment agencies will not be honored.