NT Cluster-in-a-Box

Wouldn't you like to guarantee users 100 percent uptime? Database users would never have to worry about the system going down, you could guarantee 24-hour-a-day availability for your e-commerce systems, and you'd get that big promotion for making it all happen. That guarantee is the Holy Grail of clustering.

Big IronData General's entry into the Windows NT clustering market is NT Cluster-in-a-Box. The case is reminiscent of midrange computers of years past: The unit measures 28"*36"*60" and houses two Quad 200MHz Pentium Pro motherboards, 1.5GB of RAM, two 2GB Barracuda hard disks, an 18GB CLARiiON RAID array, Error-Correcting Code (ECC) memory, and a split SCSI bus. You can order the system with disk arrays capable of holding either 20 or 30 disks, which also support the 2GB, 4GB, and 9GB drives, for a maximum of 270GB.

NT Cluster-in-a-Box comes with four 4-port OSICOM (formerly RNS) LAN cards for a total of sixteen 10/100Mbps autosense ports, and two 33.6Kbps modems for remote access. Top this configuration off with a rack-mounted Apex Outlook video, keyboard, and mouse switch and a 17" monitor all fed by two 30 amp 220 volt power feeds. But wait; there's more: The whole unit is rack-mounted with factory-completed cabling, dual power supplies for the individual systems, enhanced cooling within the individual systems, and dual 220 volt power feed internal power buses. The CLARiiON disk array includes dual storage processors, fault tolerant disk array technology (using RAID 1 and RAID 5 and supporting other RAID levels), N+1 (where N = number needed for operation) power and cooling, repair-under-power features, and a battery-backed-up 8MB write cache per storage processor (with a maximum of 64MB per processor). Add all these features up, and you are dealing with more than an enterprise-class NT machine.

How Does It Work?Enough with the hardware. In this review, I want to determine how well this unit does as a clustering solution. Think of the Cluster-in-a-Box as two separate systems connected via Ethernet. The Fast/Wide SCSI controllers are configured as a split bus--each system's SCSI bus terminates at the storage processor in the CLARiiON. This configuration provides service to one clustered system's connection without interruption to the other clustered system and doesn't allow a total failure if a SCSI cable comes off and breaks the SCSI bus. The system can run in the shared bus mode (and will have to for Wolfpack). Each system can function independently of the other and can run any standard NT applications. Data General's intent is to have each system doing meaningful work. System A can be a primary SQL Server machine and System B can be a primary Web machine. Each machine can be the primary backup for the other. So System A backs up the Web server and System B backs up the SQL Server. The system comes completely configured and ready to go, right out of the crate (that's right, crate).

By default, the system is configured as a symmetric cluster. The CLARiiON RAID array is divided into three partitions: The first is a 2GB partition (a RAID 1 mirrored pair) designated drive T. The second partition, V, is an 8.3GB RAID 5 partition. The third partition, U, is a 2GB RAID 1 mirrored pair. Partitions T and V are available to the primary system, named AViiONA. Partition U is available to the second system in the cluster, AViiONB. Each system also has a 2GB Seagate Barracuda drive to load operating systems and other applications. The CLARiiON RAID array is primarily for sharing data. All the RAID drives are hardware RAID, not the slower software-based RAID NT provides.

By default, the CLARiiON RAID array transfers control of drives T and V to the AViiONB system if the AViiONA system fails. After the AViiONA, or primary system, comes back online, AViiONB returns control of drives T and V to AViiONA. The two systems continuously communicate their respective status to each other. They conduct this communication with two Ethernet connections per machine. These links, known as heartbeats, keep each of two systems in the cluster in contact with the other.

For this review I tested how well NT Cluster-in-a-Box handles SQL Server in a clustered environment. I installed SQL Server on both systems in the cluster. I used an HP NetServer LX Pro and Bluecurve's Dynameasure software to simulate 100 SQL users pounding away on AViiONA. In the middle of the simulation, I cut power to AViiONA and checked whether AViiONB took over without the users knowing anything had happened.

VERITAS FirstWatch
NT Cluster-in-a-Box uses third-party clustering software by VERITAS. FirstWatch is available from VERITAS on NT and UNIX platforms separately. I tested VERITAS FirstWatch with CLARiiON support added. FirstWatch uses agents to transfer control of disk drives and services between systems in the cluster. NT Cluster-in-a-Box comes with the VERITAS clustering software installed. All you have to do is assign IP addresses to the network cards on either system, and you're ready to go. Figure 1 shows system configuration for NT Cluster-in-a-Box with VERITAS FirstWatch software.

I recommend that you read the manual. Although clustering seems simple in theory, it is very complex in practice. The VERITAS software is a UNIX port that is not yet fully integrated with NT. The result is a mix of Windows programs and text GUI programs. The manual contains all the information you will need to configure VERITAS for SQL clustering, but it is very terse and difficult to read. If you invest the money for NT Cluster-in-a-Box, spend extra for a support contract.

Installing SQL
SQL Server 6.5 is not very cluster friendly. To make SQL work in a clustering environment, you have to perform a little trickery. You must install the executables to a drive local to each system. You also have to be able to share the master and tempdb databases between the systems. Luckily, during installation, SQL Server 6.5 lets you specify where you want to create the master. I installed SQL on the AViiONA system and created the master and tempdb databases on the T drive on the AViiONA system. Next, I stopped the SQL services on AViiONA and then manually failed AViiONA with VERITAS' HAmon program, shown in Screen 1 (you can also use the Web-based GUI for managing FirstWatch software, as Screen 2, page 84, shows). This procedure transferred control of drive T to AViiONB. Switching over to the AViiONB server, I installed SQL Server on the local drive and told SQL to create the master and tempdb databases on the T drive now under control of the AViiONB server.

You're probably thinking, "But wait, you already have a master and tempdb on the T drive." You are correct, and SQL Server recognizes this fact. Unfortunately SQL will not let you use an existing master database, so you must overwrite it. After I overwrote the database, I stopped SQL services on AViiONB and manually failed over AViiONB to transfer control of the T drive back to AViiONA.

Setting Up the SQL Server Agent
Next I installed the VERITAS SQL agent. This software monitors SQL Server on a given system and transfers control to the other system in the cluster if the server running SQL fails.

Because Data General ships NT Cluster-in-a-Box with all the software installed, all I had to do was activate the SQL agent and configure it. Configuring the SQL agent was easy; for the most part, the defaults worked just fine. The only difference between configuring the agents on AViiONA and AViiONB was telling the SQL agent which states to have SQL active in on a particular system. For AViiONA, the primary SQL Server host, I wanted SQL to be active when the server is Online Primary and when it is providing Dual Services. Screen 3 shows the SQL agent configuration screen. With Online Primary, the system provides its own services only; with Dual Services, it provides its own services and services for the other system in the cluster that has failed. On AViiONB, I want SQL active only when the system is providing Dual Services, because I want only one copy of SQL running at a time.

Testing the Cluster
Now for the good stuff. I set up Bluecurve Dynameasure 1.5 on an HP NetServer LX Pro Quad 200MHz Pentium Pro with 512MB of RAM and 2GB of Fast/Wide SCSI disk. I used this machine to simulate 100 clients performing a mix of standard SQL operations. Dynameasure generally requires a separate server to control the test, a separate server for administrative duties, and separate machines for the client simulation. (For more information on Dynameasure and SQL testing, see Joel Sloss, "Microsoft SQL Server 6.5 Scaleability," January 1997.) I was not testing throughput performance of NT Cluster-in-a-Box, I was testing availability, so I opted to use the HP for all these tasks.

After I created the test databases on the target machine, AViiONA, I fired up the test. The first part of the test simulated five users performing standard database reads and writes. I manually failed over AViiONA with the HAmon program. Within 30 seconds, AViiONA failed and AViiONB took over, yet three of the five clients stopped testing. When AViiONB takes over, it assumes the IP address for the AViiONA and reestablishes client connections. This assumption means your SQL Servers must be running TCP/IP Sockets and Named Pipes because you need communications protocols for communicating with SQL, but your clients can run only TCP/IP to communicate with SQL. Although the entire failover process takes less than 30 seconds, the clients will time out and have to reconnect to the server. In my test, three clients were not able to reestablish that connection and stopped testing. The other clients reestablished the connection and continued without service interruption.

I continued to failover the servers manually, back and forth, for the next 45 minutes waiting for Dynameasure to build until 100 clients were simultaneously connected to SQL Server AViiONA.

Now that all 100 clients were connected to AViiONA, the big moment had arrived. I switched the video to the AViiONB server and got HAmon up and running so that I could view the status of the two machines. Next, I pushed the power button on AViiONA and heard the fans screech to a halt. First the heartbeat indicators on the HAmon program changed from Yes to No. Next the status of the AViiONA changed to Unknown; immediately AViiONB began the takeover process and was up in Dual Services mode providing SQL Server services for AViiONA, which, for all intents and purposes, was dead.

Again about 60 percent of the clients stopped functioning, but all recovered when restarted by the next Dynameasure test run. I powered on AViiONA and watched the HAmon program on AViiONB. When AViiONA came online, AViiONB relinquished control, and AViiONA once again provided SQL service to the Dynameasure clients. For the next 24 hours, I alternately failed the two systems and never encountered a problem.

The fact that I had to manually restart the clients is a shortcoming that all clients in a clustering environment currently exhibit (at least for NT). Because an interruption on the network occurs, the clients must reconnect to continue. The good news is that the clients can reconnect to what they perceive as the same machine, and the clustering software takes care of routing the request to the correct place.

Success in a Box
In the beginning, I had a little trouble setting up the clustering software, largely because of sub-par documentation and my eagerness to test the system. I strongly recommend that anyone considering this solution pay for the support contract available from Data General, at least for the initial installation. However, after I configured everything, this unit performed flawlessly. I failed over the two systems more than 50 times without problems. Of all the systems we tested, the Data General was the most reliable.

I did encounter a problem with the video switch freezing up occasionally. I solved this problem by not switching back and forth between servers during the initial boot process. In addition, the rack-mounted Barracuda drives had some problems initially as a result of the mounting tracks on the drive units being too long. Data General is aware of this problem and supplied a replacement part.

If you need to implement a clustering solution now and are worried about the future, Data General has you covered. The company has announced its support for Wolfpack on NT Cluster-in-a-Box. Data General is testing Wolfpack beta versions now. The Data General NT Cluster-in-a-Box provides high availability (97 percent or greater uptime) rather than fault tolerance (100 percent uptime). You must make sure your applications can reestablish a connection to a server in case of failover. Another point worth mentioning is that NT Cluster-in-a-Box allows for failover for routine maintenance, giving you even higher availability. If the system fails over four times a month (twice for a system failure, failing over and back, and twice for routine maintenance), and each failover means 30 seconds of downtime, the total is 1440 seconds of downtime a year, or less than 0.00005 percent downtime (greater than 99.99 percent uptime).

Back to Data Genteral http://www.dg.com/news/html/news_by_product.html