The idea of continuous computing via fault-tolerant (FT) servers is not a new one. Companies such as IBM and Sun Microsystems have been providing high-end FT server solutions to financial institutions and service companies that need uncompromised access to their data.

The goal of fault-tolerant servers is to provide nonstop, 24/7 computing, even in the event of a component failure or software crash. But fault-tolerant servers usually require proprietary hardware and software; they have been predominantly Unix-based platforms based on RISC processors. These costly systems can be justified only in the enterprise market.

Until recently, businesses operating in an Intel/Windows 2000 environment had little choice but to rely on clustering, a technique in which multiple systems are connected to form a network in such a way that if one server crashes, another server will take over. In addition to load balancing, the main goal in clustering is to recover from unplanned outages. Fault-tolerant servers, on the other hand, are designed to prevent outages from occurring at all.

Thanks to a cooperative effort by NEC and Stratus Technologies, fault-tolerant servers are no longer limited to proprietary components and software. The companies have partnered to produce the first Intel-based fault-tolerant servers designed specifically to run under Windows 2000 using industry-standard components.

At first glance, the NEC Express5800/320 LA and the Stratus ftServer 3210 server are identical. They both make use of dual- modular redundancy (DMR) architecture, the key to hardware fault tolerance. With two separate CPU expansion I/O trays and six hot-swappable hard drive bays, DMR is essentially two computers acting as one in a redundant array. The trays, or customer-replaceable units, are all hot-swappable, as are the individual PCI, I/O boards, dual power supplies, and front-panel components.

Each CPU tray can hold two processors. The Stratus ftServer 3210 we tested was a two-way system that came with two Pentium III/800 CPUs in each tray, 1GB of RAM, and six 18GB SCSI hard drives. The NEC system was configured as a one-way system with one Pentium III/ 800 processor per tray, 256MB of RAM, and two 36GB hard drives. Both servers have a 256K cache, a 100-MHz front-side bus, and memory banks expandable to a total of 2GB. And each can handle up to six 36GB high-speed hard drives.

The I/O trays contain three PCI expansion slots and core I/O support for integrated SCSI and Ethernet connections. The front-panel controller contains three USB ports, a VGA connector, CD-ROM and SuperDisk drives, and an LCD panel for viewing system status.

The processors run in lockstep mode: They execute the same instructions at the same time and appear as a single processor to the OS. A specialized fault-tolerant hardware/ software combination monitors the system and keeps all trays running in duplex mode. If any tray or component within the tray should fail, the system will fail-over and isolate the failed tray. The interruption is transparent to the user and the OS, although the surviving tray will now be running in simplex mode until the problem tray is repaired.

Both the NEC and Stratus servers use software fault tolerance in the form of "hardened" device drivers to prevent drivers from writing outside of their exclusive memory space. For customers who require specific devices that are not fault-tolerantcertified, Stratus offers a service that hardens drivers to meet FT requirements.

Perhaps the biggest difference between the NEC and Stratus boxes is service. NEC's market strategy targets an existing customer base, most of which have an established IT organization. As such, they've included tools that enable IT professionals to manage the various fault-tolerant components, whether the server is online or not. NEC's ESMPro/FT Edition software, an SNMP agent, allows IT administrators to bring problematic components off-line for repair or upgrade purposes, as well as monitor the server's health from a remote location.

Included in NEC's WMA (Workstation Management Application) program is a Remote Virtual Disk utility, which enables boot up, diagnostics, and firmware upgrades from a remote console and gives the administrator the ability to interact with pre-boot post operations. A jump switch button provides a one-touch solution to updating firmware without bringing the server down (it puts one tray in simplex mode while the other is being upgraded). Additionally, a fault detection and alert utility will send a message to a specified pager should a failure occur. NEC offers several levels of support that include installation, telephone support, and parts replacement programs.

Stratus's service is an integral part of its core FT strategy. The Stratus Service Network (SSN) provides continuous links between systems, channel partners, and worldwide customer assistance centers.

We connected to the SSN network via modem. When we removed a CPU tray, the modem automatically dialed the SSN and reported the problem. We were able to view a list of problems for our system on the company's password-protected Web siteincluding actions such as automatic, next-day parts shipments and a history of service actions. A representative called within minutes to confirm the problem.

As we continued to pull power supplies, hard drives, and I/O, the Stratus service organization responded. Stratus offers various levels of support, including a basic 9:00-to-5:00 program, a 24/7 unlimited service with a one-hour response time, and a Business-Critical offering that includes 24/7 support, 30-minute response time, and a Microsoft Windows 2000 kernel support program for OS-related problems.

For small to medium businesses that run mission-critical applications where a downed server can have a serious financial impact, these fault-tolerant servers offer a high degree of reliability and can help reduce the high cost of IT support and maintenance associated with complex cluster networksand all without migrating to more costly "big iron" platforms.

About the Author

As a Contributing Editor for PCMag, John Delaney has been testing and reviewing monitors, TVs, PCs, networking and smart home gear, and other assorted hardware and peripherals for almost 20 years. A 13-year veteran of PC Magazine's Labs (most recently as Director of Operations), John was responsible for the recruitment, training and management of t... See Full Bio

Fault-Tolerant Servers Arrive f...

Fault-Tolerant Servers Arrive for the Windows/Intel...

Get Our Best Stories!

This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You may unsubscribe from the newsletters at any time.