Breadcrumbs

Blog

OSSera Tackles Fault-Tolerant Fault Management - don't skip a beat

OSSera's OSS Explorer Platform is unique because of its multi-threaded symmetrically distributed architecture which can provide a 99.999% Fault-Tolerant Fault Management monitoring solution.

Unlike other monitoring solutions which have a High Availability add-on component, OSSera's OSS Explorer platform does not have a Primary and Secondary system architecture, therefore there is no Secondary system that must be kept in synch with the Primary.

Unlike other monitoring solutions which have a Hot and/or Cold Standby system, OSSera's OSS Explorer platform does not have any transition time between the runtime production system to a Hot and/or Cold Standby system.

These industry options are common to High Availability Clusters (HAC). In summary HAC's are groups of computers that support server applications (i.e.: mission-critical fault management software applications) that can be reliably utilized with minimum down-time. HAC's detect hardware/software faults, and immediately restart the application on another system without requiring administrative intervention, a process known as failover (see HAC definition from wikipedia below).

Fault-Tolerance

Unlike High Availability Clusters, A fault-tolerant system must be architected to just continue to run without skipping a beat.

Unlike High Availability Clusters, A fault-tolerant system does not have a failover system that is restarted.

A fault-tolerant system is designed from the ground up for reliability...

OSSera's Fault Management architecture is built upon OSS Explorer which has been designed from the ground up to be fault-tolerant due to its unique ability to distribute processing across a multi-server/multi-core virtualized environment and shift the load transparently based upon available processors and servers.

Therefore based upon the definitions below, fault-tolerance is even more reliable and available than High Availability because a Fault-Tolerant system does not have to "resubmit", "restart", and/or "failover" to a secondary system.

Imagine being able to handle disaster recovery, event storms, and maintenance upgrades without skipping a beat. Never lose sight to critical resources and services.

Fault-tolerance or graceful degradation is the property that enables a system (often computer-based) to continue operating properly in the event of the failure of (or one or more faults within) some of its components. A newer approach is progressive enhancement. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naïvely-designed system in which even a small failure can cause total breakdown. Fault-tolerance is particularly sought-after in high-availability or life-critical systems.

The ability to continue non-stop when a hardware failure occurs. A fault-tolerant system is designed from the ground up for reliability by building multiples of all critical components, such as CPUs, memories, disks and power supplies into the same computer. In the event one component fails, another takes over without skipping a beat.

Tandem and Stratus were the first two manufacturers that were dedicated to building fault-tolerant computer systems for the transaction processing (OLTP) market.

High Availability Many systems are designed to recover from a failure by detecting the failed component and switching to another computer system. These systems, although sometimes called fault tolerant, are more widely known as "high availability" systems, requiring that the software resubmits the job when the second system is available.

Redundant Hardware True fault tolerant systems with redundant hardware are the most costly because the additional components add to the overall system cost. However, fault tolerant systems provide the same processing capacity after a failure as before, whereas high availability systems often provide reduced capacity.

The monitoring of error indications in a computer system in order to log the occurrences and send alerts to system administrators and field service. Fault management software keeps track of hardware faults such as memory parity errors and software crashes. The proper analysis of the frequency and type of such errors is intended to initiate a repair order before a total breakdown occurs.

High-availability clusters (also known as HA clusters or failover clusters) are groups of computers that support serverapplications that can be reliably utilized with a minimum of down-time. They operate by harnessing redundant computers in groups or clusters that provide continued service when system components fail. Without clustering, if a server running a particular application crashes, the application will be unavailable until the crashed server is fixed. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate filesystems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.

High availability clusters (HAC) improve availability of applications by failing them over or switching them over in a group of systems as opposed to High Performance Clusters which improve performance of applications by allowing them to run on multiple systems simultaneously.

About Us - OSSera, Inc. is a global provider of Operational Support System (OSS) solutions for IT organizations, service planning, service operations, and network operations. OSSera's multi-threaded symmetrically distributed platform fully leverages modern multi-core server hardware to provide higher flexibility, reliability, and scalability for service and resource management solutions. OSSera's products support the TM Forum's suite of standards especially in the area of Service Management, Fault Management, Performance Management, Data Mediation, and Configuration Management. Meet the management team.