Sun

Update 2010/07/20: We had another failure; after replacing several components the system now appears fine again. The new lesson learned is that apparently the ILOM updates the "System Information: Components" inventory during boot. If the system won't boot or hasn't booted yet, ILOM just shows old (incorrect) information. Additionally, ILOM power readings are unreliable. The old ILOM didn't show any power consumption when the system was running, and the new ILOM (with latest firmware) looks different but still doesn't show 'Actual Power'.

We have had serious hardware and service problems with a Sun server recently. Unfortunately, while the hardware problems can be written up to incredibly bad luck, other problems indicate serious corporate and support flaws at Oracle.

Prologue

We bought a new high-end server (X4540) a year ago, and a 2-hour onsite service contract. After installing it, we discovered the system only saw half its RAM. I called Sun, and they sent out a Field Engineer with a new motherboard. Unfortunately the replacement motherboard didn't work. After 3 days of parts replacement -- a second replacement motherboard, some RAM, and a replacement CPU -- they were unable to get either replacement motherboard to boot. They did eventually get the original motherboard to see all the RAM, though, so we resumed using it.

A little while later, the server became inaccesible. A reboot cleared the problem temporarily, and we discovered the problem was a bad patch breaking Sun's ipf firewall. After a couple weeks of requesting the fix (as a patch), I removed the bad patch and the firewall worked again.

April

In April, this X4540 lost a disk, which should have resulted in an automatic ZFS rebuild onto a hot spare, and the filesystem problems cascaded to disable about 20 dependent systems. I called support Thursday night, asking why the hot spares had not been utilized, and was told the problem was almost certainly a bad disk coupled with a bad disk controller on the motherboard.

Friday morning, an FE brought a new motherboard; unfortunately it didn't work. He got another motherboard and CPU, but the system still wouldn't boot. The daytime phone rep didn't know what was going on, so he escalated to another phone rep who told me (condescendingly) that he knew a lot about the X4500 & X4540 hardware, but it turned out he didn't actually know the basic component configuration. This third phone rep insisted that a bad CPU was causing all our problems, including phantom DIMMs reported in empty slots, etc. He also insisted we needed new DIMMs, a new CPU, and a couple more disks. The whole process -- mostly waiting on hold -- took long enough to kill my phone's battery.

Saturday I met another FE back at the machine room to pick up the parts, which were due by 9am. We got some of them by 10am, but others didn't arrive until later. We resumed the parts replacement dance, and again spent several hours on the phone (I brought my charger this time!), fortunately with a different phone rep (#4, for those of you counting along at home). This gent noticed that the system reported 0V coming into the motherboard, and lots of other voltages were off. At the end of the day, we agreed we needed a new chassis, as the X4540 routes power through the power supplies, into the power distribution board, through the chassis, and into the motherboard (so a fault in any major component can screw up CPU power input, and thus everything). The chassis was the only component we hadn't replaced yet. The phone reps, however, explained that the chassis could not be on-site until Monday morning. So much for our 2-hour SLA -- our Regional Service Manager explained it means an FE will be on-site within 2 hours, but they make no commitment at all on parts delivery. I asked the support reps how we could replace our lemon (at this time, refusing to boot from their fifth motherboard) which they were unable to repair, and was told the service organization could not authorize a replacement. So I called our sales rep, who referred the question back to a counterpart in the service arm.

Sunday nothing happened -- they were unable to provide a replacement chassis.

Monday morning, the second FE and I met at the machine room with a Senior System Engineer to assist and supervise. At this point (including the earlier RAM problems), we had had a complete failure to handle RAID recovery, 4 'bad' motherboards, 3 'bad' drives, and 2 'bad' CPUs. They were escalating internally, and the chassis was due at 9am. At 10am, the FE called the distribution center to ask where the chassis was, and was told it was 'almost there'. At about 11:45 the courier arrived, bringing a few small components, but not the required chassis. Someone in the warehouse had sent the wrong box. The courier explained that it would take 60-90 minutes to get us the chassis, because it took him that long to drive to the machine room from the warehouse -- meaning he left after 10am. So not only did the warehouse send the wrong part, but they sent it after the delivery time, when they told us the delivery was nearby at 10am, it hadn't even left yet. More calls, and someone explained that the chassis was not available -- they would have to send one from Boston, and it couldn't arrive Monday at all.

Tuesday they sent back an FE with 2 SSEs and the chassis, and the system came up. This ended the outage that had disabled 20 machines since Thursday night.

May/June

A month later, we received some disk alerts, apparently because we were supposed to mark the ZFS pool as repaired, but we were unaware of this and Sun didn't tell us about it.

On the next reboot, Solaris started logging errors that both the boot disks were offline (while running from these same disks). Eventually I was told that this was due to a bug in the kernel patch, which I backed out.

After rebooting we started seeing errors from another disk. When I asked the Sun case owner how to fail over to the hot spare until we could physically swap it out, he eventually sent me an unhelpful snippet from the manual page. Our SSE actually sent me a separate document with the correct command.

New policy: no more Solaris patching. Between this bug and the patch which broke networking, clearly Sun no longer test patches adequately, and we cannot trust them.

After the disk replacement, the system once more sees only 32gb. We are ordering a replacement storage system from another company, and will avoid breathing on this X4540 until we can migrate off it. It's clearly not trustworthy, and Sun is clearly incapable of supporting it.

Recap

Over two incidents I spent about 6 days at the machine room, well over 20 hours on the phone (much of it on hold), and watched Sun replace 4 motherboards, at least 2 CPUs, several RAM sticks (although they never just sent a full set of 16 4gb DIMMs), a PDB, and a chassis. This is all the components to an X4540. The chassis should have been replaced Friday or Saturday, but only arrived Tuesday.

Lesson

On this one system, we have seen multiple failures of multiple different types.

Undiagnosed failure (apparently in the chassis), which prevented 4 motherboards from working.

SATA controller failure (the first I've ever heard of).

Automatic ZFS hot spares didn't fail over.

A 'backline' phone tech was completely wrong, and obnoxious.

Warehouse staff failed to send the right part, failed to deliver parts on time on all 3 days, and lied about courier/delivery status.

Warehouse stocking is inadequate -- it took us 3 days to get a part.

Support escalation was a complete failure. It took about 3 weeks before I got any response from management other than "I'll get back to you."

In less than 18 months, this system has experienced 2 major hardware incidents, encompassing over a week of downtime. ZFS hot sparing has not yet worked, but has instead failed twice.

We have twice installed recommended patches with serious flaws, once making the system entirely unusable.

We have had entirely too many problems appear after reboots. Perhaps there is a disk scanning process that is automatically started after rebooting, but the result is that we do not trust this machine, and are afraid to reboot it.

Oracle's support is a mess. I feel like an idiot for buying this system.

Check contract SLAs carefully. I believed that this support level included parts availability within 4 hours (EMC, at least, used to make a big deal out of their 4-hour parts availability in NYC, for instance), but Sun makes no commitment for timeliness of parts replacement.

Today's issue: the RPM installs in /gridware/sge (fine), but the installer doesn't work unless you put the SGE software in the desired directory and before running the installer. This is not the way to do it, guys! So I'd install the RPM, move all the files it just installed into the desired destination directory, run the installer, then move the original files back to /gridware/sge (hopefully without disturbing the actually installed version), and then remove the RPMs, I guess, if they don't do anything useful. Okay, I give up -- I accept that Sun's SGE RPMs are worthless. I give in -- I'll install from source. Oh, and cleaning up I noticed that the RPMs install unclaimed files, so deleting the RPMs leaves cruft behind. The person who built the RPMs must really hate their job.

And... the source tarballs are tarred, gzipped, and zipped. Who was the genius behind this? We can hope that Oracle will kill whatever insane Sun website policy required such redundant packaging, but I'm not holding my breath -- more hoping that SGE continues to exist after Oracle notices it.

Just finding the files is amazingly complicated. http://gridengine.sunsource.net/ appears to be the old SGE site -- it doesn't offer 6.3 releases -- but it lacks a pointer to the new site. I thought 6.3 wasn't really available yet, until I remembered seeing a totally different download site, and found it again.

I found 4 ways to get SGE electronically (there are also CD media, but for clusters who cares?):

CVS source. The CVS tree includes instructions for building, but I found several inaccuracies and problems. I didn't get it built, so I don't know how serious the problems are. Compiling from source isn't really appropriate for AMIs anyway, so I stopped working through the issues.

RPMs should install into the right places and be ready to go with chkconfig, but instead Sun decided to unpack them into /gridengine/sge, which doesn't even follow Sun's /opt convention. Worse, they do not install init scripts, or even provide init scripts suitable for symlinking. Instead the unpacked installer must be run to customize the init script templates. What were they (not?) thinking?!? The inst_sge installer doesn't actually copy any files -- you have to manually copy them to the right place, making the RPM even less useful (the workaround is probably to make /gridengine/sge a symlink to the desired location, assuming rpm will install under a symlink).

At this point, you might say "Wow, documentation is needed to explain this hideously complicated situation!" And you'd be right, but apparently Sun hasn't figured that out. When I went looking for an explanation of this convoluted state of affairs, the best I could find was http://wiki.gridengine.info/wiki/index.php/Main_Page#Is_Grid_Engine_commercial_or_open_source_software.3F, which hints that the split may be symptomatic of a deliberate commercial vs. open source split. If so, Sun's mishandling it amazingly -- these pages do not identify themselves as referring to the open source or commercial flavor, or even acknowledge the existence of an alternate product.

To make sorting this out just that little bit tougher, the binaries (both tarball & RPMs) completely lack documentation -- not even a URL for Sun's online docs. Adding insult to injury, the install docs explain how to unpack a tarball, but don't even acknowledge the existence of the RPMs. Apparently Sun decided that the RPM must be equivalent to the tarball -- it provides the files so you can run Sun's installer -- instead of being a proper RPM, which should fully install the software. This would be obnoxious and shortsighted if I hadn't already noticed that Fedora has an SGE RPM, and Scalable Systems produced an SGE RPM in 2002 -- including full integration with either plain RHEL or Rocks. Apparently Sun doesn't want something that works -- instead they prefer to force people to use their lame installer, which took over 1,500 lines for a basic install!

Sun and Microsoft are about the only 2 large companies based on the proposition that Linux isn't the best operating system. IBM supports several OSes, but they strongly support Linux for most applications. HP & Dell are happy to sell hardware to run anything a customer wants to pay for, although they are both Windows-biased (and Dell continues to have serious trouble with Linux).

That's why I was so surprised and amused to discover that ILOM, Sun's Integrated Lights-Out Management system which is used to manage Sun's current x86 servers, is running Linux. So Sun is using Linux to make Solaris systems more reliable. I found a reference to Linux underpinning ILOM a few weeks ago, and still chuckle every time I think of it. I had a better reference, but cannot find it today.

That said, this was probably the right choice. Nobody's going to build a tiny system management system around Solaris, and rebuilding one and coping with the inevitable bugs in such a constrained and important system would have been a huge waste.

Sun has done several things to differentiate the new 7000 series from their existing server models.

They all use the Fishworks custom build of OpenSolaris. This is obviously not the very latest release, due to Sun's testing and qualification cycle. I was surprised it's not Solaris 10. It's easier to get patches into OpenSolaris than Solaris proper; for example, the kernel CIFS server has been in OpenSolaris for a while, but will not make it into Solaris proper until Solaris 11.

Sun has integrated ZFS patches (and presumably non-ZFS patches as well); this is easier to do in or on top of OpenSolaris proper. They are all intended to reach Solaris eventually.

Sun adds the Fishworks web-based GUI. It handles all admin tasks, including installation (hopefully better than the Solaris 10U3 installer, which was too stupid to set up hot spares on a Thumper), patching, and configuration (including networking, ZFS, and sharing). The GUI is pretty extensive -- it handles link aggregation, LDAP/AD integration, DTrace analysis, fault isolation, etc.

All models reserve a pair of mirrored disks for the OS, configuration, and logs.

Although the X4540 supports chaining J4500 JBoDs for increased capacity, the 7210 does not. This is unfortunate, as the 7210 & J4500 are twice as dense (48 top-accessible 3.5" bays) as the J4400 (24 front-accessible 3.5" bays).

They can "phone home" to Sun for diagnotics; Sun can proactively send replacement components (drives), and can also detect a crashed host if it doesn't make the daily call.

The 7210 & 7410 offer Logzilla, and the 7410 includes Readzilla, which are not otherwise available.

The J4500 JBoD is basically a lobotomized Thor -- CPUs removed in favor of SAS ports. There is a small price savings, but ZFS makes it easy to present all the disks as a larger ZFS pool. If that's not a hard requirement, multiple X4540s provide better performance.

Readzilla & Logzilla are quite interesting. Readzilla is a 100gb 2.5" SSD (flash drive). It's intended serve as cache in a 7410 controller, which has 6 available bays for Readzillas. Sun doesn't support normal hard drives in these bays, because that would interfere with failover. So instead they reserve 7410 internal drive bays for the OS and read cache.

Logzilla is a more exotic SSD. It's a 3.5" 18gb low-latency store for filesystem logs (journals): the ZIL (ZFS Intent Log). Logzilla combines DRAM (the working cache), flash memory (to store the data from DRAM in case of a failure), and a supercapacitor with enough juice to copy the data from DRAM to flash in an emergency.

Basically, when an application (particularly a database) writes data and needs to ensure it has been recorded, it instructs the operating system to flush the data to stable storage, to ensure that even in the event of a crash or power outage the data won't be lost. File systems do this too, to ensure that the metadata (directory structure) is valid -- it's not safe to create a file if its parent directory might not have been created/recorded, for instance. The problem is that disks are the main type of stable storage, and writing to disk takes significant time -- the data must be transferred from the CPU to the disks, and then the disks need to spin around and write the data in the right places. This is aggravated by RAID levels 2-6, which require extra disk reads and parity calculations. The application (user) ends up wasting time waiting for data to be stored safely on disk.

Storing data in a DRAM cache is much faster, but if the system crashes or power fails, data in DRAM is lost. So when an application requests a flush, Sun copies it from DRAM in 7410 controller to DRAM in Logzilla and the application continues. This way even if the OS crashes or power fails, Logzilla itself has enough intelligence to copy the data to flash. Since flash doesn't require power to retain data (just drain your iPod to confirm this), the data is available when the system is ready to read and flush it to disk.

Our Sun presenter, Art, talked about wear leveling and a 5-year lifespan for Logzilla's flash, but I don't understand why this is a factor -- it seems like the flash should only be written to in case of emergency. Clearly I'm missing something.

So the architecture ends up slightly odd -- Readzilla cache is inside the 7410 controllers, while Logzilla cache is outside the controllers in the JBoDs. This is because all the data needs to be available to both controllers in a redundant configuration. If controller A gets data from a client and writes it to Logzilla and then crashes, controller B can access the Logzilla and its data via the shared SAS fabric, so no data is lost -- just as it can access the 1tb disks. Internally, this is a zfs import operation, and Logzillas are just part of the pool. Readzilla doesn't have this constraint, though -- if controller A fails with data on its Readzillas, controller B can just fetch the data from the SATA disks. There's a performance hit as the cache refills but no data loss. The design assumes that much more data is read from Readzillas via private SAS connections than written to Logzillas shared SAS connections -- a safe bet.

Right now, the X4540 looks more attractive to me. The 7210 price is considerably higher, I don't think we really need Logzillas/Readzillas, and 7210s do not support J4500s for extending the zpool. The 7410 is impressively engineered, but we don't need HA clustering and it takes up much more rack space than the considerably cheaper X4540. As you add J4400s, the density gradually approaches 50% of the X4540's. Sun's list price is $116k for a 7410 with 2 Logzillas, 1 Readzilla, & 34 disks in 8U -- compared to $62k for a 4U X4540 with 48 disks. No, I don't know why the single-controller 7410 comes with a 12-bay J4200, rather than the 24-bay J4400, but Sun doesn't sell J4200s for the 7000 series.

Through all of this, don't be taken in by Sun's (or any other vendor's) capacity numbers. A 1,000,000,000,000 byte "1tb" disk provides about 930gb of usable space, because operating systems use base 2. 10^12 / 2^30 = 930gb (10^12 / 2^40 = .909tb). But even worse, some of those disks are needed for parity and hot spares, so the realistic capacity of a Thor with RAIDZ is in the 30-35gb range -- less under RAIDZ2 -- and each Logzilla or hot spare subtracts from the usable space.

Sun has a handy table of usable space for the 7110 & 7210, but note that it ignores the base 10 vs. base 2 differential, so remember that those "1tb SATA" drives are really .9tb. Unfortunately to calculate sizes for a 7410 you need a program that's part of the Fishworks installation (details on that page).

Tangential irony: Sun offers VirtualBox as a free virtualization system, but the Sun Unified Storage Simulator is a VMware VM. It provides the full software stack, so you can run through the installation procedure and set up shares (and run the 7410 capacity planner), but the storage is VMware virtual volumes rather than real disks. Clever, but why isn't this available as a VirtualBox image too?? Perhaps because VirtualBox only supports 3 disk devices -- fix it, guys!

Update: As of 2009/02/02, Sun offers a VirtualBox image, but for some reason it's 1,136mb instead of 418mb. Now it's a VB issue to make their images more efficient, rather than the Fishworks team's task to provide a VB image. I just found a nice overview from the launch.

Update: As of 2009/02/04, Sun's VirtualBox image is format v1.5, which requires conversion to format v1.6 to run under VB 2.1.2, released a couple weeks ago. The included 'install' script wasn't executable; when I ran it it complained 16 times about creating the VM image. But the conversion didn't work right, because I had to manually reattach the 16 virtual SATA disks. On the other hand, this demonstrates that VB can indeed use more than 3 virtual hard disks, which is well done by the VB team.

I'm setting up a 2gbps connection for the aforementioned Thumper, and wading through all the various ways of doing things, many of which do not work. Here's what I've found.

IPMP is IP Multipathing. This automatically shifts an IP address between multiple interfaces to ensure that even if the primary interface goes down, the IP is still available on another interface. IPMP is for high availability, and doesn't inherently provide more bandwidth than a single connection.

Sun Trunking supports combining up to 4 or 16 interfaces (depending on the underlying hardware) into a single higher speed link. Sun apparently used to charge for Trunking, but bundled v1.3 into the first release of Solaris 10 for free, and then removed it from later releases of Solaris 10 (quite surprising behavior, actually).

Apparently more large changes are planned under Project Clearview, but it's been underway at least since 2005, so I'm not sure how soon Clearview will be complete.

dladm itself is in a state of flux, too. I found a couple of blogposts which describe how to create aggregates, using a numeric key as identifier. But Sun's manual page says this is deprecated in favor of aggregate names. Unfortunately, the manual page's recommended syntax doesn't work -- it describes something newer than the current release of Solaris 10 (10/08, or Solaris 10 U6). That's ridiculous. Perhaps Sun's (commercial) Solaris documentation actually describes OpenSolaris...

Link aggregation (or teaming, bonding, or channeling) doesn't change the physical properties of the physical links. There are still multiple independent circuits at their rated speed (1 gbps Ethernet in our case), rather than a real higher-speed link (there is no such thing as a 1gbps circuit) -- the host & switch just balance traffic between the multiple available channels.

There are a few different algorithms for determining which packets go across which link to spread the traffic (hopefully evenly). Basically all three consist of taking a hash of some packet header data and matching that against one of the available channels. With L2 policy, the determination is based on MAC addresses. With L3 policy, it's determined by IP addresses. With L4 policy, more of the TCP/UDP headers are used (including ports). On Solaris, this is set with "dladm -P L2", "dladm -P L3", or "dladm -P L4".

We're using L4 because most of our traffic is from an instrument to one server, and then from that server to a backup server. The backup server mostly communicates with a single partner, so hashing on MAC or IP would leave all the traffic over a single link, and gain no benefit from the second 1gbps link. For a general purpose backup server (NetBackup, NetWorker, etc.), it shouldn't matter, unless one backup client is much faster/busier than the others and tends to dominate network bandwidth, in which case L4 might be useful.

dladm is noteworthy because it's one of very few Solaris commands that makes a persistent change directly -- things like configuring IP addresses and changing running processes generally need to be codified in configuration files or init scripts to persist past reboot, but dladm takes care of recording and re-applying configuration changes behind the scenes -- apparently a welcome attempt to make administration simpler and easier, also seen in ZFS.

We have a couple X4500 "Thumper" servers here. An X4500 is basically a standard dual-Opteron chassis, with 48 SATA drive bays added on. This provides a raw capacity of 48tb with 1tb drives, or over 30tb usable in a realistic RAID environment. I love the informative labels on the outside (I haven't yet popped the cover to see the internals):

If you open the cover, you must close it within 60 seconds to avoid frying components.

If you lift the chassis, you will bend the sheet metal. I am told after removing all 48 drives and both power supplies, it's still difficult for one person to lift the empty chassis.

Solaris cannot yet boot from RAIDZ, so it's common to install the OS onto a single disk, mirror it with Sun Volume Manager (DiskSuite), and then use the 46 remaining disks for RAID and hot spares. RAIDZ works better with < 10 drives in a stripe, and can be used with single parity (RAIDZ1) or double (RAIDZ2).

A reasonable configuration would be 2 hot spares, 4 9-disk RAIDZ sets, and an 8-disk RAIDZ set -- all RAIDZ1 or RAIDZ2, because mixing different types of drives of protection levels in a single ZFS pool is not recommended. A higher performance configuration would use smaller RAIDZ sets, or even ZFS mirroring. 4 9-disk and 1 8-disk RAIDZ1 stripes provide 4 * 8 + 7 = 39 drives of capacity. With RAIDZ2 it's 4 * 7 + 6 = 34 drives usable. Of course, '1tb' drives don't provide a real 1tb usable, because drive manufacturers use base 10 and operating systems use base 2. The formatted capacity of our '1tb' drives is 931gb, so we would max out at 39 * 931gb = 36tb usable. Impressive, but not quite the 48tb advertised.

My current problem: Only 2 of the 48 disks are bootable, and their naming within Solaris is not consistent. In the list below, 46 disks report as ATA-HITACHI, but c4t0 & c4t4 report as DEFAULT. I'm pretty sure I should just mirror onto c4t4, but not enough to proceed with putting the server into production without verification. I've found docs across the Internet that refer to c5t0 & c5t4, as well as c6t0 & c6t4, so it's not as simple as it really should be.

"Which disks can I boot from?" is a fundamental question for a system that cannot boot from any attached disk, but when I called 800-USA-4-Sun yesterday, I explained my issue to a very nice gentleman who told me to pick the first two disks from format output. That's apparently wrong for an X4500, and after I explained the issue further he started finding and reading the same docs I had been. A particular favorite is "Upgrading to ILOM 2.0.2.5 Changes Controller IDs" in the Sun Fire X4500 Server Product Notes, which refers to a hd command not present on my fresh Solaris 10 10/08 installation or the pre-installed Solaris 10 11/06 that came on the other Thumper. I also found Important Solaris OS Installation and Bootable Hard Disk Drive Guidelines in the Product Notes, which says to use the first two disks returned by cfgadm -al -- c0t0 & c0t1 on this system. That's not right, although the instructions are for use within the Solaris installer.

Today I called back, and was eventually told there is only one person in who would have the answer to my question, but he's busy on another call!

This makes the "okay-to-remove" LEDs on the drives essential -- with a RAIDZ1 set, if one drive goes bad and I accidentally remove a good drive in the same set, at best the whole pool will go offline (and no data will be lost).