Accounts

October252013

Seagate Kinetic Storage — In the words of Geoff Arnold: The physical interconnect to the disk drive is now Ethernet. The interface is a simple key-value object oriented access scheme, implemented using Google Protocol Buffers. It supports key-based CRUD (create, read, update and delete); it also implements third-party transfers (“transfer the objects with keys X, Y and Z to the drive with IP address 1.2.3.4”). Configuration is based on DHCP, and everything can be authenticated and encrypted. The system supports a variety of key schemas to make it easy for various storage services to shard the data across multiple drives.

Masters of Their Universe (Guardian) — well-written and fascinating story of the creation of the Elite game (one founder of which went on to make the Raspberry Pi). The classic action game of the early 1980s – Defender, Pac Man – was set in a perpetual present tense, a sort of arcade Eden in which there were always enemies to zap or gobble, but nothing ever changed apart from the score. By letting the player tool up with better guns, Bell and Braben were introducing a whole new dimension, the dimension of time.

October042013

When I was in college, a friend of mine told me he liked to take his code out for a walk every now and then. By that he meant recompiling and running all of his programs, say once a week. I asked him why he would want to do that. If a program compiled and ran the last time you touched it, why shouldn’t it compile and run now? He simply said I might be surprised

The only way to archive digital information is to keep it moving. I call this #movage instead of #storage. Proper movage means (...)

February212013

We wanted to give you a brief update on what we’ve learned so far from our series of interviews with players and practitioners in the in-memory data management space. A few preliminary themes have emerged, some expected, others surprising.

Performance improves as you put data as close to the computation as possible. We talked to people in systems, data management, web applications, and scientific computing who have embraced this concept. Some solutions go to the the lowest level of hardware (L1, L2 cache), The next generation SSDs will have latency performance closer to main memory, potentially blurring the distinction between storage and memory. For performance and power consumption considerations we can imagine a future where the primary way systems are sized will be based on the amount of non-volatile memory* deployed.

Putting data in-memory does not negate the importance of distributed computing environments. Data size and the ability to leverage parallel environments are frequently cited reasons. The same characteristics that make the distributed environments compelling also apply to in-memory systems: fault-tolerance and parallelism for performance. An additional consideration is the ability to gracefully spillover to disk when main is memory full.

There is no general purpose solution that can deliver optimal performance for all workloads. The drive for low latency requires different strategies depending on write or read intensity, fault-tolerance, and consistency. Database vendors we talked with have different approaches for transactional and analytic workloads, in some cases integrating in-memory into existing or new products. People who specialize in write-intensive systems identify hot data (i.e., frequently accessed) and put those in-memory.

Hadoop has emerged as an ingestion layer and the place to store data you might use. The next layer identifies and extracts high-value data that can be stored in-memory for low-latency interactive queries. Due to resource constraints of main memory, using columnar stores to compress data becomes important to speed I/O and store more in a limited space.

While it may be difficult to make in-memory systems completely transparent, the people we talked with emphasized programming interfaces that are as simple as possible.

Our conversations to date have revealed a wide range of solutions and strategies. We remain excited about the topic, and we’re continuing our investigation. If you haven’t yet, feel free to reach out to us on Twitter (Ben is @BigData and Roger is @rogerm) or leave a comment on this post.

* By non-volatile memory we mean the next-generation SSDs. In the rest of the post “memory” refers to traditional volatile main memory.

August202012

It wasn’t enough for Dr. George Church to help Gilbert “discover” DNA sequencing 30 years ago, create the foundations for genomics, create the Personal Genome Project, drive down the cost of sequencing, and start humanity down the road of synthetic biology. No, that wasn’t enough.

He and his team decided to publish an easily understood scientific paper (““Next-generation Information Storage in DNA“) that promises to change the way we store and archive information. While this technology may take years to perfect, it provides a roadmap toward an energy efficient, archival storage medium with a host of built-in advantages.

The paper demonstrates the feasibility of using DNA as a storage medium with a theoretical capacity of 455 exabytes per gram. (An exabyte is 1 million terabytes.) Now before you throw away your massive RAID 5 cluster and purchase a series of sequencing machines, know that DNA storage appears to be very high latency. Also know that Church, Yuan Gao, and Sriram Kosuri are not yet writing 455 exabytes of data, they’ve started with a more modest goal of writing Church’s recent book on genomics to a 5.29 MB “bitstream,” here’s an excerpt from the paper:

We converted an html-coded draft of a book that included 53,426 words, 11 JPG images and 1 JavaScript program into a 5.27 megabit bitstream. We then encoded these bits onto 54,898 159nt oligonucleotides (oligos) each encoding a 96-bit data block (96nt), a 19-bit address specifying the location of the data block in the bit stream (19nt), and flanking 22nt common sequences for amplification and sequencing. The oligo library was synthesized by ink-jet printed, high-fidelity DNA microchips. To read the encoded book, we amplified the library by limited-cycle PCR and then sequenced on a single lane of an Illumina HiSeq.

If you know anything about filesystems, this is an amazing paragraph. They’ve essentially defined a new standard for filesystem inodes on DNA. Each 96-bit block has a 19-bit descriptor. They then read this DNA bitstream by using something called Polymerase Chain Reaction (PCR). This is important because it means that reading this information involves generating millions of copies of the data in a format that has been proven to be durable. This biological “backup system” has replication capabilities “built-in.” Not just that, but this replication process has had billions of years of reliability data available.

While this technology may only be practical for long-term storage and high-latency archival purposes, you can already see that this paper makes a strong case for the viability of this approach. Of all biological storage media, this work has demonstrated the longest bit stream and is built atop a set of technologies (DNA sequencing) that have been focused on repeatability and error correction for decades.

In addition to these advantages, DNA storage has other advantages over tape or hard drive — it has a steady-state storage cost of zero, a lifetime that far exceeds that of magnetic storage, and very small space requirements.

If you have a huge amount of data that needs to be archived, the advantages of DNA as a storage medium (once the technology matures) could quickly translate to significant cost savings. Think about the energy requirements of a data center that needs to store and archive an exabyte of data. Compare that to the cost of maintaining a sequencing lab and a few Petri dishes.

For most of us, this reality is still science fiction, but Church’s work makes it less so every day. Google is uniquely positioned to realize this technology. It has already been established that Google’s founders pay close attention to genomics. They invested an unspecified amount in Church’s Personal Genome Project (PGP) in 2008, and they have invested a company much closer to home: 23andme. Google also has a large research arm focused on energy savings and efficiency with scientists like Urs Hozle looking for new ways to get more out of the energy that Google spends to run data centers.

If this technology points the way to the future of very high latency, archival storage, I predict that Google will lead the way in implementation. It is the perfect convergence of massive data and genomics, and just the kind of dent that company is trying to make in the universe.

April112012

The default approach to most complex problems is to engineer a complex solution. We see this in IT, generally, and in cloud computing specifically. Experience has taught us, however, that large-scale systems belie this tendency: Simpler solutions are best for solving complex problems. When developers write code, they talk about "elegant code," meaning they are able to come up with a concise, simple, solution to a complex coding problem.

In this article, I hope to provide further clarity around what I mean by simple solutions and how they differ from more complex ones.

Simple vs. complex

Understanding simplicity seems ... well, simple. Simple systems have fewer parts. Think of a nail. It's usually a single piece of steel with one end flattened, and it does just one thing, so not much can go wrong. A hammer is a slightly more complex, yet still simple, tool. It might be made of a couple of parts, but it really has one function and not much can go wrong. In comparison to these, a power drill is a significantly more complex tool and computers are far, far, more complex. As parts increase and as the number of functions that are provided by a system increases, the complexity increases.

Related to this phenomena is the simplicity or complexity of the parts themselves. Simpler parts can be more easily assembled into more complex systems that are more reliable because their parts are simpler. Whereas putting together complex systems from complex parts leads us toward building fragile and brittle Rube Goldberg contraptions.

Complexity kills scalable systems. If you want higher uptime and reliability in your IT system, you want smaller, simpler systems that fail less often because they have simpler, fewer parts.

An example: storage

A system that is twice as complex as another system isn't just twice as likely to fail, it's four times as likely to fail. In order to illustrate this and to drive home some points, I'm going to compare direct-attached storage (DAS) and storage-area network (SAN) technologies. DAS is a very simple approach to IT storage. It has fewer features than SAN, but it can also fail in fewer ways.

In the cloud computing space, some feel that one of Amazon Web Services' (AWS) weaknesses is that it provides only DAS by default. To counter this, many competitors run SAN-based cloud services only, taking on the complexity of SAN-based storage as a bid for differentiation. Yet AWS remains the leader in every regard in cloud computing, mainly because it sticks to a principle of simplicity.

If we look at DAS versus SAN and trace the path data takes when written by an application running "in the cloud," it would look something like this figure:

(A quick aside: Please note that I have left out all kinds of components, such as disk drive firmware, RAID controller firmware, complexities in networking/switching, and the like. All of these would count here as components.)

A piece of data written by the application running inside of the guest operating system (OS) of a virtual server flows as follows with DAS in place:

1. The guest OS filesystem (FS) software accepts a request to write a block of data.

2. The guest OS writes it to its "disk," which is actually a virtual disk drive using the "block stack" (BS) [1] in its kernel.

3. The guest OS has a para-virtualization (PV) disk driver [2] that knows how to write the block of data directly to the virtual disk drive, which in this case is provided by the hypervisor.

4. The block is passed by the PV disk driver not to an actual disk drive but to the hypervisor's (HV) virtual block driver (VBD), which is emulating a disk drive for the guest OS.

At this point we have passed from the "virtualization" layer into the "physical" or real world.

5. Repeating the process for the hypervisor OS, we now write the block to the filesystem (FS) or possibly volume manager (VM), depending on how your virtualization was configured.

6. Again, the block stack handles requests to write the block.

7. A disk driver (DD) writes the data to an actual disk.

8. The block is passed to a RAID controller to be written.

Obviously, laid out like this, the entire task already seems somewhat complex, but this is modern computing with many layers and abstractions. Even with DAS, if something went wrong, there are many places we might need to look to troubleshoot the issue — roughly eight, though the steps I listed are an approximation for the purpose of this article.

SAN increases the complexity significantly, where starting at step 7, we take a completely different route:

9. Instead of writing to the disk driver (DD), in a SAN system we write a network-based block device: the "iSCSI stack," [3] which provides SCSI [4] commands over TCP/IP.

10. The iSCSI stack then sends the block to the hypervisor's TCP/IP stack to send it over the network.

11. The block is now sent "across the wire," itself a somewhat complicated process that might involve things like packet fragmentation and MTU issues, TCP window size issues/adjustments, and more — all thankfully left out here in order to keep things simple.

12. Now, at the SAN storage system, the TCP/IP stack receives the block of data.

13. The block is handed off to the iSCSI stack for processing.

14. The SAN filesystem and/or volume manager determines where to write the data.

15. The block is passed to the block stack.

16. The block is passed to the disk driver to write out.

17. The hardware RAID writes the actual data to disk.

In all, adding the SAN conservatively doubles the number of steps and moving parts involved. Each step or piece of software may be a cause of failure. Problems with tuning or performance may arise because of interactions between any two components. Problems with one component may spark issues with another. To complicate matters, troubleshooting the issues may be complex, in that the guest OS might be Windows, the hypervisor could be Linux, and the SAN might be some other OS all together. All with different filesystems, block stacks, iSCSI software, and TCP/IP stacks.

Complexity isn't linear

The problem, however, is not so much that there are more pieces, but that those pieces all potentially interact with each other and can cause issues. The problem is multiplicative. There are twice as many parts (or more) in a SAN, but that creates four times as many potential interactions, each of which could be a failure. The following figure shows all the steps as both rows and columns. Interactions could theoretically occur between each two of the steps. (I've blacked out the squares where a component intersects with itself because that's not an interaction between different components.)

This diagram is a matrix of the combinations, where we assume that the hypervisor RAID in my example above isn't part of the SAN solution. The lighter-colored quadrant in the upper left is the number of potential interactions or failure points for DAS, and the entire diagram depicts those for SAN. Put more in math terms, there are N * (N-1) possible interactions/failures. With this DAS example, that means there are 8 * (8-1) or 56. For the SAN, it's 240 (16 * (16-1)) minus the hypervisor RAID (16) for 224 — exactly four times as many potential areas of problems or interactions that may cause failures.

How things fail and how uptime calculations work

To be certain, each of these components and interactions have varying chances of failure. Some are less likely to fail than others. The problem is that calculating your likely uptime is just like the matrix. It's a multiplicative effect, not additive. If you want to predict the average uptime of a 99% uptime system with two components, it's 99% * 99% = 98% uptime.

If every component of our DAS or SAN system is rated for "five 9s" (99.999%) uptime, our calculation is as follows:

The point here is not that DAS is "four 9s" or SAN is "five 9s," but that by adding more components, we have actually reduced our likely uptime. Simpler solutions are more robust because there are fewer pieces to fail. We have lost a full "9" by doubling the number of components in the system.

An anecdote may bring this home. Very recently, we were visited by a potential customer who described a storage issue. They had moved from a DAS solution to a SAN solution from a major enterprise vendor. Two months after this transition, they had a catastrophic failure. The SAN failed hard for three days, bringing down their entire cloud. A "five nine" enterprise solution actually provided "two nines" that year. In fact, if you tracked uptime across year boundaries (most don't), this system would have to run without a single failure for 10-plus years to come even close to what most consider a high uptime rating.

It's worth noting here that another advantage of a DAS system is smaller "failure domains." In other words, a DAS system failure affects only the local server, not a huge swath of servers as happened with my anecdote above. This is a topic I plan to cover in detail in future articles, as it's an area that I think is also not well understood.

Large-scale cloud operators are the new leaders in uptime

Once upon a time, we all looked to the telephone system as an example of a "high uptime" system. You picked up the phone, got a dial tone, and completed your call. Nowadays, though, as the complexity of networks has increased while moving to a wireless medium, carrier uptimes for wireless networks have gone down. As data volumes increase, carrier uptimes have been impacted even further.

The new leaders in uptime and availability are the world's biggest Internet and cloud operators: Google, Facebook, Amazon, eBay, and even Microsoft. Google regularly sees four nines of uptime globally while running one of the world's largest networks, the largest plant of compute capability ever, and some of the largest datacenters in the world. This is not a fluke. It's because Google and other large clouds run simple systems. These companies have reduced complexity in every dimension in order to scale up. This higher availability also comes while moving away from or eliminating complex enterprise solutions, running datacenters at extreme temperatures, and otherwise playing by new rules.

Wrapping up

What have we learned here? Simple scales. What this means for businesses is that IT systems stay up longer, have better resiliency in the face of failures, cost less to maintain and run, and are just plain better when simplified. Buying a bigger, more complex solution from an enterprise vendor is quite often a way to reduce your uptime. This isn't to say that this approach is always wrong. There are times when complex solutions make sense, but stacking them together creates a multiplicative effect. Systems that are simple throughout work best. Lots of simple solutions with some complexity can also work. Lots of complex solutions means lots of failures, lots of operational overhead, and low uptime.

[1] A "block stack" is a term I use in this article to represent the piece of kernel software that manages block devices. It’s similar in concept to the "TCP/IP stack" used to manage the network for a modern operating system. There is no standard name for the "block stack" as there is for the networking software stack.

[2] Para-virtualization drivers are a requirement in all hardware virtualization systems that allow for higher performance disk and network I/O. They look like a standard operating system driver for a disk, but understand how to talk to the hypervisor in such a way as to increase performance.

[3] In this case, I use iSCSI as the SAN, but AoE, FC, or FCoE are all similar in nature.

[4] The observant reader will note that at the guest layer, we might be writing to a virtual SATA disk, then using SCSI commands over TCP/IP, and finally writing via SCSI or SATA. Each of these protocols potentially has its own issues and tuning requirements. For simplicity, I've left the details of the challenges there out of these examples.

September012011

Here are a few of the data stories that caught my attention this week.

IBM's record-breaking data storage array

IBM Research is building a new data storage array that's almost 10 times larger than anything that's been built before. The data array is comprised of 200,000 hard drives working together, with a storage capacity of 120 petabytes — that's 120 million gigabytes. To give you some idea of the capacity of the new "drive," writes MIT Technology Review, "a 120-petabyte drive could hold 24 billion typical five-megabyte MP3 files or comfortably swallow 60 copies of the biggest backup of the Web, the 150 billion pages that make up the Internet Archive's WayBack Machine."

Data storage at that scale creates a number of challenges, including — no surprise — cooling such a massive system. But other problems include handling failure, backups and indexing. The new storage array will benefit from other research that IBM has been doing to help boost supercomputers' data access. Its General Parallel File System was designed with this massive volume in mind. The GPFS spreads files across multiple disks so that many parts of a file can be read or written at once. This system already demonstrated that it can perform when it set a new scanning speed record last month by indexing 10 billion files in just 43 minutes.

IBM's new 120-petabyte drive was built at the request of an unnamed client that needed a new supercomputer for "detailed simulations of real-world phenomena."

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Infochimps' new Geo API

The data marketplace Infochimps released a new Geo API this week, giving developers access to a number of disparate location-related datasets via one API with a unified schema.

According to Infochimps, the API addresses several pain points that those working with geodata face:

Difficulty in integrating several different APIs into one unified app

Lack of ability to display all results when zoomed out to a large radius

Limitation of only being able to use lat/long

To address these issues, Infochimps has created a new simple schema to help make data consistent and unified when drawn from multiple sources. The company has also created a "summarizer" to intelligently cluster and better display data. And finally, it has also enabled the API to handle queries other than just those traditionally associated with geodata, namely latitude and longitude.

As we seek to pull together and analyze all types of data from multiple sources, this move toward a unified schema will become increasingly important.

Hurricane Irene and weather data

The arrival of Hurricane Irene last week reiterated the importance not only of emergency preparedness but of access to real-time data — weather data, transportation data, government data, mobile data, and so on.

We've been through hurricanes before. What's different about this one is the unprecedented levels of connectivity that now exist up and down the East Coast. According to the most recent numbers from the Pew Internet and Life Project, for the first time, more than 50% of American adults use social networks. 35% of American adults have smartphones. 78% of American adults are connected to the Internet. When combined, those factors mean that we now see earthquake tweets spread faster than the seismic waves themselves. The growth of an Internet of things is an important evolution. What we're seeing this weekend is the importance of an Internet of people."

May122011

Big data creates a number of storage and processing challenges for developers — efficiency, complexity, cost, among others. London-based data storage startup Acunu is tackling these issues by re-engineering the data stack and taking a new approach to disk storage.

In the following interview, Acunu CEO Tim Moreton discusses the new techniques and how they might benefit developers.

Why do we need to re-engineer the data stack?

Tim Moreton: New workloads mean we must collect, store and serve large volumes of data quickly and cheaply. This poses two challenges. The first is a distributed systems challenge: How do you scale a database across many cheap commodity machines, and deal with replication, nodes failing, etc? There are now many tools that provide a good answer to this — Apache Cassandra is one. Then, the second challenge is once you've decided on the node in the cluster where you're going to read or write some data, how do you do that efficiently? That's the challenge we're trying to solve.

Most distributed databases see it as outside their domain to solve this problem: they support pluggable storage backends, and often use embedded tools like Berkeley DB. Cassandra and HBase go further, and implement their own storage engines based on Google BigTable — these amount to being file systems that run in userspaces as part of their Java codebases.

The problem is that underneath any of these sits a storage stack that hasn't changed much over 20 years. The workloads look different from 20 years ago, and the hardware looks very different. So, we built the Acunu Storage Core, an open-source Linux kernel module that contains optimizations and data structures that let you make better use of the commodity hardware that you already have.

It offers a new storage interface, where keys have any number of dimensions, and values can be very small or very large. Whole ranges can be queried, and large values streamed in and out. It's designed to be just general-purpose enough to model simple key-value stores, BigTable data models like Cassandra's, Redis-style data structures, graphs, and others.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Why would big data stores need versioning?

Tim Moreton: There are many possible reasons, but we're focusing on two. The first is whole-cluster backup. Service outages like Amazon's, and Google having to restore some Gmail data from tape, reminds us that just because our datasets may be different, backup can still be pretty important. Acunu takes snapshots at intervals across a whole cluster and you can copy these "checkpoints" off the cluster with little impact on your cluster's performance. Or, if you mess something up, you can roll back a Cassandra ColumnFamily to a previous point in time.

Speeding up your dev/test cycle is the second reason for versioning. Say you have a Cassandra application serving real users. If you want to develop a new feature in your app that changes what data you store or how you use it, how do you know it's going to work? Most people have separate test clusters and craft test data; others experiment to see if it works on a small portion of their users. Our versioning lets you take a clone of your production ColumnFamily and give it to a developer or automated test run. We're working on making sure these clones are entirely isolated from the production version so whatever you do to it, you won't affect your real users. This lets you try out new code on the whole dataset. When you're confident your code works, you can throw the clone away. This speeds up the dev cycle and reduces the risks of putting new code into production.

What kinds of opportunities do you see this speed boost creating?

Tim Moreton: The decisions around what data gets collected and analyzed are often economic. Cassandra and Hadoop help to make new data problems tractable, but we can do more.

In concrete terms, if you have a Cassandra cluster, and you're continuously collecting lots of log entries or sensor data, and you want to do real-time analytics on that, then our benchmarking shows that Acunu delivers those results up to 50 times faster than vanilla Cassandra. That means you can process 50 times the amount of data, or work at greatly increased detail, or do the same work while buying and managing much less hardware. And this is comparing Acunu against Cassandra, which is in our view the best-of-breed datastore for these types of workloads.

Do you plan to implement speedups for other database systems?

Tim Moreton: Absolutely. Although the first release focuses on Cassandra and an S3-compatible store, we have already ported Voldemort and memcached. The Acunu Storage Core and its language bindings will be open source, and we are actively working with developers on several other databases. Cassandra already gives us good support for a lot of the Hadoop framework. HBase is on the cards, but it's a trickier architectural fit since it sits above HDFS.

You'll be able to interoperate between these various databases. For example, if you have an application that uses memcached, you can read and write the same data that you access with Cassandra — perhaps ingesting it with Flume, then processing it with Cassandra's Hadoop or Pig integrations. We plan to let people use the right tools and interfaces for the job, but without having to move or transform data between clusters.

July162010

GPL WordPress Theme Angst -- a podcaster brought together Matt Mullenweg (creator of WordPress), and Chris Pearson (creator of the Thesis theme). Chris doesn't believe WordPress's GPL should be inherited by themes. Matt does, and the SFLC and others agree. The conversation is interesting because (a) they and the podcaster do a great job of keeping it civil and on-track and purposeful, and (b) Chris is unswayed. Chris built on GPLed software without realizing it, and is having trouble with the implications. Chris's experience, and feelings, and thought processes, are replicated all around the world. This is like a usability bug for free software. (via waxpancake on Twitter)

480G SSD Drive -- for a mere $1,599.99. If you wonder why everyone's madly in love with parallel, it's because of this order-of-magnitude+ difference in price between regular hard drives and the Fast Solution. Right now, the only way to rapidly and affordably crunch a ton of data is to go parallel. (via marcoarment on Twitter)

Pandas and Lobsters: Why Google Cannot Build Social Software -- this resonates with me. The primary purpose of a social application is connecting with others, seeing what they're up to, and maybe even having some small, fun interactions that though not utilitarian are entertaining and help us connect with our own humanity. Google apps are for working and getting things done; social apps are for interacting and having fun. Read it for the lobster analogy, which is gold.

Wayfinder -- The majority of all the location and navigation related software developed at Wayfinder Systems, a fully owned Vodafone subsidiary, is made available publicly under a BSD licence. This includes the distributed back-end server, tools to manage the server cluster and map conversion as well as client software for e.g. Android, iPhone and Symbian S60. Technical documentation is available in the wiki and discussions around the software are hosted in the forum. Interesting, and out of the blue. At the very least, there's some learning to be done by reading the server infrastructure. (via monkchips on Twitter)

Bacteria-Powered Micro-Machines -- A few hundred bacteria are working together in order to turn the gear. When multiple gears are placed in the solution with the spokes connected like in a clock, the bacteria will begin turning both gears in opposite directions and it will cause the gears to rotate in synchrony for a long time. Video embedded below (via BoingBoing)