When I first got involved in data center networking in the early 1980s, there were several competing technologies. The two leaders were Ethernet and Token Ring, and although Bob Metcalfe had invented Ethernet, his first company 3Com actually sold both. Within a couple of years, economics, obstinacy by IBM, and a patent troll had taken Token Ring out of the picture, and Ethernet ruled. It quickly evolved from its shared-media topology: in 1987 SynOptics introduced the first Ethernet Hub, and two years later Kalpana broke the mold with the first Ethernet switch. Many of us concluded that whatever the future LAN technologies might look like, they would be called Ethernet.

The history of protocol stacks roughly paralleled that of LAN technology. In the early 1980s there were many candidates – NetWare, XNS, ARCnet, NETBEUI, OSI, AppleTalk, and others, as well as TCP/IP. By the end of the decade, TCP/IP had won. Some companies rehosted their application protocols on top of TCP/IP (I’m ashamed to say that my name is on the RFCs for NetBIOS-over-TCP), but most disappeared or pivoted away, like Novell.

Over the last 20 years, we’ve seen a steady process of convergence around Ethernet and TCP/IP. (Metro Ethernet is a fascinating and unexpected example.) Fibre channel was introduced in 1988 as a replacement for HIPPI in storage area networking. Twenty years later some companies tried to layer the FC protocols directly over Ethernet (FCoE). Most regard this as a failed experiment: although it slightly simplified cabling, the FC protocols were too inflexible to work well in a noisy LAN, and the lack of routability conflicted with data center networking practices. Instead, people started to experiment with storage protocols running over TCP/IP: iSCSI for block access, S3-like HTTP-based protocols for moving large objects around, and the perennial NFS and CIFS for file access.

One area that has so far remained untouched by this process of convergence is the connection between storage devices and computers. Even though the actual technologies have evolved – IDE, ATA, ATAPI, PATA, SCSI, ESDI, SATA, eSATA – the most common storage interconnection topologies are pretty much the same that IBM introduced with the S/360 mainframe in 1964: a controller device integrated into the computer, communicating with a small number of storage devices over a private short-range interconnect. The “private” bit is important; although various techniques have been created for shared (multi-master) access to the interconnect, all were relatively expensive, and none are supported by the consumer-grade drives which are often used for scale-out storage systems.

Historically, storage servers have been constructed as “black box” turnkey systems, from the Auspex NFS servers in the 1980s to the storage arrays from vendors like EMC and NetApp. More recently, people have been constructing interesting scale-out storage services from commodity hardware, using an x86 with a tray of consumer grade disks as a building block. However these architectures are constrained by the single point of failure and performance bottleneck introduced by the private interconnect between CPU and disks. (One odd consequence is that it is often hard to put together a economic “proof of concept” system, because the scale-out algorithms perform poorly with a small number of nodes.)

Over the years there have been various attempts at re-inventing this pattern. Most of these are based on the idea of moving more of the processing to the disk itself, taking advantage of the fact that every disk already has a certain amount of processing capacity to do things like bad sector remapping. Up until now, these efforts have been unsuccessful because of cost or architectural mis-match. But that’s about to change.

Yesterday Seagate introduced its Kinetic Open Storage Platform, and I’m simply blown away by it. It’s a truly elegant design, “as simple as possible, but no simpler”. The physical interconnect to the disk drive is now Ethernet. The interface is a simple key-value object oriented access scheme, implemented using Google Protocol Buffers. It supports key-based CRUD (create, read, update and delete); it also implements third-party transfers (“transfer the objects with keys X, Y and Z to the drive with IP address 1.2.3.4”). Configuration is based on DHCP, and everything can be authenticated and encrypted. The system supports a variety of key schemas to make it easy for various storage services to shard the data across multiple drives.

Don’t fall into the trap of thinking that this means we’ll see thousand upon thousands of individual smart disks on the data center LANs. That’s not the goal. (Or I don’t think it is.) EMC or NetApp can still use these drives to build big honking storage arrays, if they want to. The difference is that they have much more freedom in designing the internals of those arrays, because they don’t have to use one kind of (severely constrained) technology for one kind of traffic (disk data) and a completely different kind of technology for their internal HA traffic. They’re free to develop new kinds of internal topologies based on Ethernet, and to implement their services more efficiently using the Kinetic API.

For those vendors who are building out commodity-based scale-out storage, things are even more exciting. It becomes possible to build extremely scalable, highly-available configurations using commodity Ethernet switches. And the servers used to implement the external storage service – Swift, Gluster, Ceph, NFS – are likely to change, too: CPU, RAM for caching, multiple NICs, little or no PCI, a little SSD, – no moving parts. Perhaps someone will integrate one into a top-of-rack switch, to produce a very efficient dense array for cool or cold storage.

A bunch of very smart engineers at Seagate have developed this system (that’s Jim Hughes, allowing me to touch a prototype unit), but they know it won’t be accepted if it’s proprietary. So they’re opening up the protocol, the clients, a simulator for design verification. If everything works out, this will become the new standard interface for disk drives. (And, well, any kind of mass storage.)

“The “private” bit is important; although various techniques have been created for shared (multi-master) access to the interconnect, all were relatively expensive, and none are supported by the consumer-grade drives which are often used for scale-out storage systems.”

I was working on multi-master storage systems using parallel SCSI in 1994. Nowadays you can get an FC or SAS disk array for barely more than a JBOD enclosure. Shared storage is neither new nor expensive. It’s not common at the single-disk layer, but it’s not clear why that should matter.

“Don’t fall into the trap of thinking that this means we’ll see thousand upon thousands of individual smart disks on the data center LANs. That’s not the goal.”

…and yet that’s exactly what some of the “use cases” in the Kinetics wiki show. Is it your statement that’s incorrect, or the marketing materials Seagate put up in lieu of technical information?

“they don’t have to use one kind of (severely constrained) technology for one kind of traffic (disk data) and a completely different kind of technology for their internal HA traffic.”

How does Kinetic do anything to help with HA? Array vendors are not particularly constrained by the interconnects they’re using now. In the “big honking” market, Ethernet is markedly inferior to the interconnects they’re already using internally, and doesn’t touch any of the other problems that constitute their value add – efficient RAID implementations, efficient bridging between internal and external interfaces (regardless of the protocol used), tiering, fault handling, etc. If they want to support a single-vendor object API instead of several open ones that already exist, then maybe they can do that more easily or efficiently with the same API on the inside. Otherwise it’s just a big “meh” to them.

At the higher level, in *distributed* filesystems or object stores, having an object store at the disk level isn’t going to make much difference either. Because the Kinetics semantics are so weak, they’ll have to do for themselves most of what they do now, and performance isn’t constrained by the back-end interface even when it’s file based. Sure, they can connect multiple servers to a single Kinetics disk and fail over between them, but they can do the same with a cheap dual-controller SAS enclosure today. The reason they typically don’t is not because of cost but because that’s not how modern systems handle HA. The battle between shared-disk and shared-nothing is over. Shared-nothing won. Even with an object interface, going back to a shared-disk architecture is a mistake few would make.

The assessment of future storage system architectures is mine, not Seagate’s. As for performance, my assumption (again, personal) is that all of the really latency-sensitive applications have already migrated from disk to flash or RAM. However there will be exabytes of warm, cool and cold data which need to live on cheap, cheap disk (occasionally powered down, but always accessible).

I would like to see Kinetic drives cluster together using some sort of swarming and distribution algorithm, and completely remove the need for middleware like OpenStack Swift, Hadoop etc. The applications would directly talk to the Seagate Kinetic drives using REST APIs. Is this possible?

It’s not really feasible, Saqib. At the very least, those clients need to have an accurate picture of where all the disks are. Otherwise, two clients might create the same key on two different disks, with different contents. That’s not even eventual consistency; it’s just a mess. Who’s going to maintain that information, or that needed for security? Who’s going to recognize when data needs to be re-replicated, or rebalanced across an ever-changing set of drives? Sure, you can have the clients do all that, but as soon as you start putting authoritative information on clients and expecting them to look after it well then they’re partly servers. That in turn gets you into a whole world of problems associated with a large and unstable set of servers, some of them not configured optimally for the role, and that’s an even harder problem than more explicitly server-oriented systems have.

In some ways the system you describe is a lot like how GlusterFS (the project I work on) is structured. We’ve actually had to solve a lot of the coordination problems that Seagate’s “markeneers” don’t even seem to know about. We try to keep the lowest-level “bricks” as dumb as possible, with as much logic as possible on the clients, and still we rely on richer semantics than Kinetic has to maintain adequate behavior and performance.

Before anything like this could work, the Kinetic folks would have to write about ten times more code then they have already, to run on drives that are then effectively micro-servers. They’d probably need faster processors and more memory too – increasing cost and power/heat, decreasing density. At that point why not just use something like the ARM micro-servers that are already here? I have a quad-core one upstairs right now, about the size of a credit card. Tweak that a little bit, amortize the cost of one over several 2.5″ drives as they can be bought for peanuts today, and you have something that can beat Kinetic in every dimension.

“We’ve actually had to solve a lot of the coordination problems that Seagate’s “marketeers” don’t even seem to know about.”

I’ll let Seagate speak for themselves, but I’m pretty sure that the team knows a hell of a lot about coordination problems. (I know some of the systems they built in previous lives.) They deliberately focussed on a single problem: replacing the private disk interconnect with a well understood shared access technology, Ethernet. How you use this is up to you; let a thousand sharded key-value flowers bloom.

Yes, of course you can build a more powerful microserver. So what? As I wrote earlier, I think that the sweet spot for this is very large volume cool/cold storage, with many of the drives being powered off at any time. (It supports wake-on-LAN.) Cost and power are critical; not latency, not rich semantics. It’s easy to put the smart stuff somewhere else (maybe in a TOR switch?).

(1) Geoff, I wish Seagate *would* speak for themselves. Those coordination problems are non-trivial, and the claims Seagate is making about using Kinetic without servers (e.g. https://developers.seagate.com/display/KV/Distributed+File+Systems) is just total hand-wave. Even if we accept your implied appeal to authority, that authority isn’t doing a very good job sharing what must be some very special knowledge. If the particular people at Seagate involved in this project really did know squat about these coordination problems, they’d be trying to educate people about how to make the thing in their picture work without so much as a conditional-put call. I’ve already explained more than they have, and I’m one guy (who’s supposed to be on sabbatical).

(2) Very cold storage and powered-off disks make the coordination problem *worse*. Let’s say your data is sharded across many of these drives, and you want to set a key. You *need to know* if it already exists on one of those powered-off drives, or else you risk creating inconsistent copies. Therefore you need to wake up those drives to check (defeating the original purpose) or you need to have redundant information about which keys are where, which means you need more coordination.

(3) If you acknowledge the need to have some “smart stuff” anywhere, then you’ve already improved on Seagate’s flawed vision, and putting that inside a switch isn’t any better than the server effectively *being* the switch. Remember, we’re talking about something that might be the size of a credit card and consume barely more power than a single drive. How is it better to hang multiple 1Gb/s Kinetic drives off a server embedded in a switch than to hang multiple 6Gb/s SATA drives off a server that’s closer to what already exists? Alternatively, you could go all pNFS and allow clients to access the drives directly after receiving permission/directions, but that would require a far better security model than Kinetic has so you’re boxed out again.

Basically Seagate’s picture of a system without servers *doesn’t work* with their current functionality. They claim that their new interface streamlines system design, but it’s not so. A system broadly similar to this could work, but that would require significant evolution from what has been announced so far. IMO it’s irresponsible of them to make such grandiose claims based on so little actual substance.

Disclaimer

I work for Verizon. The opinions expressed here are my own views and do not represent Verizon or any of my past employers.

This blog is focussed on technical and business aspects of cloud computing; if you prefer to read about philosophy, atheism, sports, or politics, you might want to check out my personal blog, geoffarnold.com, or my FaceBook page.