Behold Thumper: Sun’s SunFire X4500 storage server

In July of 2006, Sun added an interesting new machine to their lineup of Opteron-based x86 servers. It was codenamed ‘Thumper’ during development and designated the SunFire X4500 when it was launched. It is essentially a 4U rack-mount, dual-AMD Socket 940 server with 48 SATA disks. In my real life as a sysadmin for a large company, I was intrigued by the new direction in storage systems that Sun’s experimenting with. As a PC enthusiast, I was impressed by the simplicity and scale of it.

The recipe for the thumper is simple. It’s a list of commodity bits we’re all familiar with:

Sun, as we can see, has built the Thumper around the copious system bandwidth available on the Opteron platform, an area where AMD is still competitive with the fastest Intel Xeon CPUs. The 48 SATA drives are connected via a passive backplane to power and to the six Marvell SATA controller chips on the mainboard. From there, the design of the system shows an attention to balancing I/O throughput along the entire path back the CPUs. Each drive is connected to a dedicated SATA port, unlike even most high-end storage systems which group multiple drives on a common bus. All six of the eight-port controller chips are connected via a dedicated PCI-X connection, which results in 133MB/s per drive–plenty to keep the drive reading and writing data as fast as it can move bits on and off the platters. Those PCI-X connections are, in turn, bridged onto the 8GB/s CPU hypertransport links by AMD 8132 chips, again leaving enough headroom for all the controller chips to feed data into the CPUs at once. The two PCI-X slots also have dedicated connections and system peripherals, including four Gigabit Ethernet connections, connected via downstream HT links on the tunnel chips.

So what does this add up to? A file server with 48 individual disks and a theoretical 6GB/s of disk bandwidth. Because the disk controllers are simple SATA host adapters with no RAID intelligence, the installed OS is going to see all 48 as individual devices. If you were to install Windows, I suppose you would have drive letters from C: to AX:, but would you really want the poor machine to suffer like that? The solution to this is to use your operating system of choice’s software RAID functionality. Software RAID has fallen out of favor these days, in lieu of dedicated hardware to offload that task. This made a lot of sense when 200MHz Pentium Pro processors cost $1000, but most servers these days have plenty of CPU cycles available. Additionally, the RAID controller has become the bottleneck between disk and CPU in many current server configurations.

Another downside of software RAID has always been increased complexity in the OS configuration. Sun has given us another neat piece of technology to assist here: ZFS. ZFS is a new filesystem available in Solaris 10. All of the various layers of storage management have been rolled up into the filesystem with ZFS. Configuring RAID sets, aggregating them into one logical volume and then formatting and mounting it as a filesystem is accomplished a single step with ZFS. There are some examples here, and while those are some of the longest commands you might ever have to type, most of it is taken up listing all the disk device names (nothing like this).

I know this all reads like an advertisement, and maybe I’ve drunk the purple kool-aid, but it’s hard for a server geek not to get excited about this. The combination of the X4500 and ZFS results in a level of performance and capacity that matches some high-end enterprise storage arrays. There are simple benchmarks published that put the real-world read throughput of this configuration over 1GB/s. That’s a level of performance that would take an exotic configuration of four 4Gb/s host bus adapters to equal in the fiber channel world, and that’s if your array controllers were capable of feeding data at that rate. All this comes at a cost that is very low by enterprise storage standards. The model with 48 1TB drives lists for about $60,000, a delivered cost per gigabyte of about $1.25. This presents new vistas of capability to system engineers, and new challenges as well. We can offer a petabyte of online storage for a little over $1M and only taking up two racks in the computer room. Problems that would have broken the entire IS budget are approachable now, but while we can afford the primary disk, the tape infrastructure to back it all up remains unattainable, not to mention it would take weeks to move 1PB from disk to tapes.

Good problems to have, I guess, at least I don’t have to worry about where to save all my linux install .iso’s anymore.

P.S.:

I’ve been putting in some time working with phpBB 3 lately. I expect we’ll be taking some downtime in the next month or so to upgrade.

I was looking at this a few weeks ago. However my budget cannot cover it ($20K) It’s very cheap for what you get. I spent around $18K and only got a 6TB iSCSI setup.

Very cool and very cheap.

Anonymous Hamster

12 years ago

Why bother with tapes? Why not just have another one or two of these things set up offsite? Of course, now the limitation is bandwidth between sites.

evermore

12 years ago

The fastest tape library I found in a quick search was the Exabyte Magnum 224, which has a bit under 1Gbps uncompressed storage rate. A gigabit WAN link is expensive, but if you’re buying two 60,000 computers just for storage, maybe not too bad.

The thing about tapes though is you can continuously archive the data, never overwriting the old data if you don’t want to. Just use new tapes each time. Or you could have tapes enough to make a backup every day for a week before rotating back to the first tape of the week. Buying SEVEN of these might be a bit much for many companies, considering it’s meant to be a cheap(er) SATA-based option for companies that may not be sitting on piles of cash. Of course with the time it’d take to perform a full backup, you won’t be rotating backups daily, even if you do incrementals.

I don’t know where the petabyte figure even came from though. It’s only a 48TB storage system. You’d need 20 of them to hit a PB. 48TB at 1Gbps, assuming no compression and no delays with tape changes, would take 106.67 hours. Call it one solid week with actual throughput and mechanical delays. By the time you get halfway done, the stuff you’ve already written is no longer the same “version” of the things you still have to write. Plus you’d need to use about 40 to 60 tapes, enough to consume all the storage in an entire tape library.

Wow, it took 4 years to develop this monster. Only 1 year to go before AMD totally obsolents the 940 Opterons which reach the end of their longevity program.

Evermore, the bridge chips can’t sit between the CPUs. The link between the CPUs is the coherent link.
The CPU to Bridge HT links are non-coherent.

Obviously, because the system was designed 4 years ago, Sun designed in PCI-X rather then PCI-Express which would have been a much better solution from a bandwidth and reliability view.

If the Opteron had a built-in LPC, there would have been no need for the 8111 chip. Those 8132’s are monster sized when you have to use 5 of them.

I can see Sun replacing those Opterons/8132s with Intel Quickpath based components in the near future reducing the primary vendors from 3 to 2.

evermore

12 years ago

Wasn’t clear, I didn’t mean “between” them as in replacing the coherent link. Just between them as in the bridge chip having an HT link to both CPUs instead of just one, so the traffic from one CPU doesn’t have to go through the second in order to get to the bridge. The processors only have 3 HT links available, but they’d be able to eliminate bridges on each side if the two CPUs were connecting through one. Trying to lay it out in my head, I guess with all those drives they’re trying to integrate, and with only 3 HT links on each CPU and only 2 on each bridge chip, they wouldn’t be able to do it right using those components.

Does any CPU have an LPC interface integrated, even in the server world? Usually the LPC link is just off the southbridge. Since they’re not using any actual southbridge they need the I/O hub, even if only for the USB ports, since the LPC link wouldn’t be able to run those, nor the VGA controller (which doesn’t need to occupy the PCI-X slot since it’s linked into the 8111, probably PCI).

evermore

12 years ago

“l[

Snake

12 years ago

“l[

UberGerbil

12 years ago

I do like the cropped version of the photo that shows up when this rotates into top “featured article” spot. You can see what it is, but if you’re not looking closely (or you weren’t a techy) it just looks like the Borg cube, or some kind of abstract art. Blow it up to wall size and it could be something by Burtynsky
§[< http://www.edwardburtynsky.com/

Convert

12 years ago

I think I would need a sturdier rack to hold that thing. In addition to the small portable crane of course.

SuperSpy

12 years ago

And here I sit, complaining it’s taking forever to backup ~600 GiB while I rebuild the raid volume on my home server.

Flying Fox

12 years ago

In case some of you people that can’t read properly: notice the big banner on top stating it is a *[

WaltC

12 years ago

Given a choice I always prefer to read opinions labeled as “blogs” than to read opinions labeled as “news”…;) That’s really the confusing aspect for most people–they often don’t know the difference between an objective, unbiased news story and a news story that only pretends to be objective and unbiased.

In this case, what’s not to like? The author likes something and tells us why he does. It’s difficult to see anything “wrong” with that…;) And, refreshingly, he sticks to his subject and doesn’t use it as a platform to bash anything else on the planet. There’s definitely an example here that I would hope would be highly emulated across the Internet by all the aspiring tech journalists out there: if the title of your article or review is “A,” then don’t spend 60% of the wordage in your article bashing “B.”

Krogoth

12 years ago

Wow, this is like a uber-NAS device or a powerful component of a SAN. 😉

UberGerbil

12 years ago

Well, no. It’s got way too much CPU to be just a NAS device (even with software RAID soaking up some cycles). That’s a bt like getting a gaming GPU for an ATM or Point-of-Sale system.

This box is intended to run a DB server, not just be (relatively) dumb storage. Sun is suggesting it should be used for “/[http://www.tpc.org/tpch/results/tpch_result_detail.asp?id=107101502

sroylance

12 years ago

It may be over-spec’ed as a ‘NAS’, but I’m pretty sure it’s cheaper per GB than any NAS device of similar performance or capacity. There are no software or license charges, solaris and ZFS are free. If you can deal with the complexity of configuring the OS you can get almost all the features of a netapp filer out of ZFS.