I’ve worked in or near the database engine world for more than 25 years. And, ironically, every company I’ve ever worked at has been working on a massive-scale, parallel, clustered RDBMS system. The earliest variant was IBM DB2 Parallel Edition released in the mid-90s. It’s now called the Database Partitioning Feature.

Massive, multi-node parallelism is the only way to scale a relational database system so these systems can be incredibly important. Very high-scale MapReduce systems are an excellent alternative for many workloads. But some customers and workloads want the flexibility and power of being able to run ad hoc SQL queries against petabyte sized databases. These are the workloads targeted by massive, multi-node relational database clusters and there are now many solutions out there with Oracle RAC being perhaps the most well-known but there are many others including Vertica, GreenPlum, Aster Data, ParAccel, Netezza, and Teradata.

What’s common across all these products is that big databases are very expensive. Today, that is changing with the release of Amazon Redshift. It’s a relational, column-oriented, compressed, shared nothing, fully managed, cloud hosted, data warehouse. Each node can store up to 16TB of compressed data and up to 100 nodes are supported in a single cluster.

Amazon Redshift manages all the work needed to set up, operate, and scale a data warehouse cluster, from provisioning capacity to monitoring and backing up the cluster, to applying patches and upgrades. Scaling a cluster to improve performance or increase capacity is simple and incurs no downtime. The service continuously monitors the health of the cluster and automatically replaces any component, if needed.

The core node on which the Redshift clusters are build, includes 24 disk drives with an aggregate capacity of 16TB of local storage. Each node has 16 virtual cores and 120 Gig of memory and is connected via a high speed 10Gbps, non-blocking network. This a meaty core node and Redshift supports up to 100 of these in a single cluster.

There are many pricing options available (see http://aws.amazon.com/redshift for more detail) but the most favorable comes in at only $999 per TB per year. I find it amazing to think of having the services of an enterprise scale data warehouse for under a thousand dollars by terabyte per year. And, this is a fully managed system so much of the administrative load is take care of by Amazon Web Services.

Fast and Powerful – Amazon Redshift uses a variety to innovations to obtain very high query performance on datasets ranging in size from hundreds of gigabytes to a petabyte or more. First, it uses columnar storage and data compression to reduce the amount of IO needed to perform queries. Second, it runs on hardware that is optimized for data warehousing, with local attached storage and 10GigE network connections between nodes. Finally, it has a massively parallel processing (MPP) architecture, which enables you to scale up or down, without downtime, as your performance and storage needs change.

You have a choice of two node types when provisioning your own cluster, an extra large node (XL) with 2TB of compressed storage or an eight extra large node (8XL) with 16TB of compressed storage. You can start with a single XL node and scale up to a 100 node eight extra large cluster. XL clusters can contain 1 to 32 nodes while 8XL clusters can contain 2 to 100 nodes.

Scalable – With a few clicks of the AWS Management Console or a simple API call, you can easily scale the number of nodes in your data warehouse to improve performance or increase capacity, without incurring downtime. Amazon Redshift enables you to start with a single 2TB XL node and scale up to a hundred 16TB 8XL nodes for 1.6PB of compressed user data. Resize functionality is not available during the limited preview but will be available when the service launches.

Inexpensive – You pay very low rates and only for the resources you actually provision. You benefit from the option of On-Demand pricing with no up-front or long-term commitments, or even lower rates via our reserved pricing option. On-demand pricing starts at just $0.85 per hour for a two terabyte data warehouse, scaling linearly up to a petabyte and more. Reserved Instance pricing lowers the effective price to $0.228 per hour, under $1,000 per terabyte per year.

Fully Managed – Amazon Redshift manages all the work needed to set up, operate, and scale a data warehouse, from provisioning capacity to monitoring and backing up the cluster, and to applying patches and upgrades. By handling all these time consuming, labor-intensive tasks, Amazon Redshift frees you up to focus on your data and business insights.

Secure – Amazon Redshift provides a number of mechanisms to secure your data warehouse cluster. It currently supports SSL to encrypt data in transit, includes web service interfaces to configure firewall settings that control network access to your data warehouse, and enables you to create users within your data warehouse cluster. When the service launches, we plan to support encrypting data at rest and Amazon Virtual Private Cloud (Amazon VPC).

Reliable – Amazon Redshift has multiple features that enhance the reliability of your data warehouse cluster. All data written to a node in your cluster is automatically replicated to other nodes within the cluster and all data is continuously backed up to Amazon S3. Amazon Redshift continuously monitors the health of the cluster and automatically replaces any component, as necessary.

Designed for use with other AWS Services – Amazon Redshift is integrated with other AWS services and has built in commands to load data in parallel to each node from Amazon Simple Storage Service (S3) and Amazon DynamoDB, with support for Amazon Relational Database Service and Amazon Elastic MapReduce coming soon.

Petabyte-scale data warehouses no longer need command retail prices of upwards $80,000 per core. You don’t have to negotiate an enterprise deal and work hard to get the 60 to 80% discount that always seems magically possible in the enterprise software world. You don’t even have to hire a team of administrators. Just load the data and get going. Nice to see.

Facebook recently released a detailed report on their energy consumption and carbon footprint: Facebook’s Carbon and Energy Impact. Facebook has always been super open with the details behind there infrastructure. For example, they invited me to tour the Prineville datacenter just prior to its opening:

Reading through the Facebook Carbon and Energy Impact page, we see they consumed 532 million kWh of energy in 2011 of which 509m kWh went to their datacenters. High scale data centers have fairly small daily variation in power consumption as server load goes up and down and there are some variations in power consumption due to external temperature conditions since hot days require more cooling than chilly days. But, highly efficient datacenters tend to be effected less by weather spending only a tiny fraction of their total power on cooling. Assuming a flat consumption model, Facebook is averaging, over the course of the year, 58.07MW of total power delivered to its data centers.

Facebook reports an unbelievably good 1.07 Power Usage Effectiveness (PUE) which means that for every 1 Watt delivered to their servers they lose only 0.07W in power distribution and mechanical systems. I always take publicly released PUE numbers with a grain of salt in that there has been a bit of a PUE race going on between some of the large operators. It’s just about assured that there are different interpretations and different measurement techniques being employed in computing these numbers so comparing them probably doesn’t tell us much. See PUE is Still Broken but I Still use it and PUE and Total Power Usage Efficiency for more on PUE and some of the issues in using it comparatively.

Using the Facebook PUE number of 1.07, we know they are delivering 54.27MW to the IT load (servers and storage). We don’t know the average server draw at Facebook but they have excellent server designs (see Open Compute Server Design) so they likely average at or below as 300W per server. Since 300W is an estimate, let’s also look at 250W and 400W per server:

As a comparative data point, Google’s data centers consume 260MW in aggregate (Google Details, and Defends, It’s use of Electricity). Google reports their PUE is 1.14 so we know they are delivering 228MW to their IT infrastructure (servers and storage). Google is perhaps the most focused in the industry on low power consuming servers. They invest deeply in custom designs and are willing to spend considerably more to reduce energy consumption. Estimating their average server power draw at 250W and looking at the +/-25W about that average consumption rate:

I find the Google and Facebook server counts interesting for two reasons. First, Google was estimated to have 1 million servers more than 5 years ago. The number may have been high at the time but it’s very clear that they have been super focused on work load efficiency and infrastructure utilization. To grow the search and advertising as much as they have without growing the server count at anywhere close to the same rate (if at all) is impressive. Continuing to add computationally expensive search features and new products and yet still being able to hold the server count near flat is even more impressive.

In a past blog entry, One Size Does Not Fit All, I offered a taxonomy of 4 different types of structured storage system, argued that Relational Database Management Systems are not sufficient, and walked through some of the reasons why NoSQL databases have emerged and continue to grow market share quickly. The four database categories I introduced were: 1) features-first, 2) scale-first, 3) simple structure storage, and 4) purpose-optimized stores. RDBMS own the first category.

DynamoDB targets workloads fitting into the Scale-First and Simple Structured storage categories where NoSQL database systems have been so popular over the last few years. Looking at these two categories in more detail.

Scale-First is: Scale-first applications are those that absolutely must scale without bound and being able to do this without restriction is much more important than more features. These applications are exemplified by very high scale web sites such as Facebook, MySpace, Gmail, Yahoo, and Amazon.com. Some of these sites actually do make use of relational databases but many do not. The common theme across all of these services is that scale is more important than features and none of them could possibly run on a single RDBMS. As soon as a single RDBMS instance won’t handle the workload, there are two broad possibilities: 1) shard the application data over a large number of RDBMS systems, or 2) use a highly scalable key-value store.

And, Simple Structured Storage: There are many applications that have a structured storage requirement but they really don’t need the features, cost, or complexity of an RDBMS. Nor are they focused on the scale required by the scale-first structured storage segment. They just need a simple key value store. A file system or BLOB-store is not sufficiently rich in that simple query and index access is needed but nothing even close to the full set of RDBMS features is needed. Simple, cheap, fast, and low operational burden are the most important requirements of this segment of the market.

The DynamoDB service is a unified purpose-built hardware platform and software offering. The hardware is based upon a custom server design using Flash Storage spread over a scalable high speed network joining multiple data centers.

DynamoDB supports a provisioned throughput model. A DynamoDB application programmer decides the number of database requests per second their application should be capable of supporting and DynamoDB automatically spreads the table over an appropriate number of servers. At the same time, it also reserves the required network, server, and flash memory capacity to ensure that request rate can be reliably delivered day and night, week after week, and year after year. There is no need to worry about a neighboring application getting busy or running wild and taking all the needed resources. They are reserved and there whenever needed.

The sharding techniques needed to achieve high requests rates are well understood industry-wide but implementing them does take some work. Reliably reserving capacity so it is always there when you need it, takes yet more work. Supporting the ability to allocate more resources, or even less, while online and without disturbing the current request rate takes still more work. DynamoDB makes all this easy. It supports online scaling between very low transaction rates to applications requiring millions of requests per second. No downtime and no disturbance to the currently configured application request rate while resharding. These changes are done online only by changing the DynamoDB provisioned request rate up and down through an API call.

In addition to supporting transparent, on-line scaling of provisioned request rates up and down over 6+ orders of magnitude with resource reservation, DynamoDB is also both consistent and multi-datacenter redundant. Eventual consistency is a fine programming model for some applications but it can yield confusing results under some circumstances. For example, if you set a value to 3 and then later set it to 4, then read it back, 3 can be returned. Worse, the value could be set to 4, verified to be 4 by reading it, and yet 3 could be returned later. It’s a tough programming model for some applications and it tends to be overused in an effort to achieve low-latency and high throughput. DynamoDB avoids forcing this by supporting low-latency and high throughout while offering full consistency. It also offers eventual consistency at lower request cost for those applications that run well with that model. Both consistency models are supported.

It is not unusual for a NoSQL store to be able to support high transaction rates. What is somewhat unusual is to be able to scale the provisioned rate up and down while on-line. Achieving that while, at the same time, maintaining synchronous, multi-datacenter redundancy is where I start to get excited.

Clearly nobody wants to run the risk of losing data but NoSQL systems are scale-first by definition. If the only way to high throughput and scale, is to run risk and not commit the data to persistent storage at commit time, that is exactly what is often done. This is where DynamoDB really shines. When data is sent to DynamoDB, it is committed to persistent and reliable storage before the request is acknowledged. Again this is easy to do but doing it with average low single digit millisecond latencies is both harder and requires better hardware. Hard disk drives can’t do it and in-memory systems are not persistent so flash memory is the most cost effective solution.

But what if the server to which the data was committed fails, or the storage fails, or the datacenter is destroyed? On most NoSQL systems you would lose your most recent changes. On the better implementations, the data might be saved but could be offline and unavailable. With dynamoDB, if data is committed just as the entire datacenter burns to the ground, the data is safe, and the application can continue to run without negative impact at exactly the same provisioned throughput rate. The loss of an entire datacenter isn’t even inconvenient (unless you work at Amazon :-)) and has no impact on your running application performance.

Combining rock solid synchronous, multi-datacenter redundancy with average latency in the single digits, and throughput scaling to the millions of requests per second is both an excellent engineering challenge and one often not achieved.

Just as I was blown away when I saw it possible to create the world’s 42nd most powerful super computer with a few API calls to AWS (42: the Answer to the Ultimate Question of Life, the Universe and Everything), it is truly cool to see a couple of API calls to DynamoDB be all that it takes to get a scalable, consistent, low-latency, multi-datacenter redundant, NoSQL service configured, operational and online.

In data centers, power conversion is a repetitive, costly and inefficient process. Rule of thumb: The fewer the conversions, the better. It’s also helpful to bring high voltage as close as possible to servers.

This high scale storage system is based upon HBase and Haystack. HBase is a non-relational, distributed database very similar to Google’s Big Table. Haystack is simple file system designed by Facebook for efficient photo storage and delivery. More on Haystack at: Facebook Needle in a Haystack.

In this Facebook Message store, Haystack is used to store attachments and large messages. HBase is used for message metadata, search indexes, and small messages (avoiding the second I/O to Haystack for small messages like most SMS).

We see press releases go by all the time and most of them deserve the yawn they get. But, one caught my interest yesterday. At the PASS Summit conference Microsoft Vice President Ted Kummert announced that Microsoft will be offering a big data solution based upon Hadoop as part of SQL Azure. From the Microsoft press release, “Kummert also announced new investments to help customers manage big data, including an Apache Hadoop-based distribution for Windows Server and Windows Azure and a strategic partnership with Hortonworks Inc.”

This announcement is also a big win for the MapReduce processing model. First invented at Google and published in MapReduce: Simplified Data Processing on Large Clusters. The Apache Hadoop distribution is an open source implementation of MapReduce. Hadoop is incredibly widely used with Yahoo! running more than 40,000 nodes of Hadoop with their biggest single cluster now at 4,500 servers. Facebook runs a 1,100 node cluster and a second 300 node cluster. Linked in runs many clusters including deployments of 1,200, 580, and 120 nodes. See the Hadoop Powered By Page for many more examples.

In the cloud, AWS began offering Elastic MapReduce back in early 2009 and has been expanding the features supported by this offering steadily over the last couple of years adding support for Reserved Instances, Spot Instances, and Cluster Compute instances (on a 10Gb non-oversubscribed network – MapReduces just loves high bandwidth inter-node connectivity)and support for more regions with EMR available in Northern Virginia, Northern California, Ireland, Singapore, and Tokyo.

Microsoft expects to have a pre-production (what they refer to as a “community technology Preview”) version of a Hadoop service available by the “end of 2011”. This is interesting for a variety of reasons. First, its more evidence of the broad acceptance and applicability of the MapReduce model. What is even more surprising is that Microsoft has decided in this case to base their MapReduce offering upon open source Hadoop rather than the Microsoft internally developed MapReduce service called Cosmos which is used heavily by the Bing search and advertising teams. The What is Dryad blog entry provides a good description of Cosmos and some of the infrastructure build upon the Cosmos core including Dryad, DryadLINQ, and SCOPE.

As surprising as it is to see Microsoft planning to offer MapReduce based upon open source rather than upon the internally developed and heavily used Cosmos platform, it’s even more surprising that they hope to contribute changes back to the open source community saying “Microsoft will work closely with the Hadoop community and propose contributions back to the Apache Software Foundation and the Hadoop project.”

Amazon Web Services doesn’t say much about the data centers powering its cloud computing platform. But this week the company held a technology open house in Seattle, where AWS Distinguished Engineer James Hamilton discussed the company’s infrastructure. The presentation(PDF) included an image of a modular data center design used by Amazon, which is the first official acknowledgement that the company uses modular infrastructure.

A look at the Amazon Perdix container, included in a presentation at Amazon Technology Day ⇒

Hamilton also shared a factoid that provides a sense of the rapid growth of Amazon’s cloud platform. “Every day Amazon Web Services adds enough new capacity to support all of Amazon.com’s global infrastructure through the company’s first 5 years, when it was a $2.76 billion annual revenue enterprise,” Hamilton states in one of the slides.

Growth Driving Data Center Expansion Plans

Even without the exact numbers, that’s an indicator of how rapidly Amazon’s infrastructure is growing, and why the company has recently began acquiring additional sites in Ireland, northern Virginia and Oregon for data center expansion.

In his research at Microsoft and now at Amazon Web Services, Hamilton has focused on cost models for operating hyper-scale data centers. His presentation at the Amazon open house reviewed cost assumptions for an 8 megawatt data center, which could include 46,000 servers.

Hamilton estimated the cost at $88 million (about $11 million per megawatt), and presented a pie chart outlining monthly operating costs for a facility, which is dominated by the cost for servers (57 percent), followed by power and cooling (18 percent) and electric power (13 percent).

These percentages are consistent with Hamilton’s earlier published research on data center costs. His example assumes power costs of roughly 7 cents per kilowatt hour and a Power Usage Effectiveness of PUE, which both suggest that the example data center is a composite of Amazon’s global footprint rather than its best-performing data center.

Hamilton was an early advocate of using shipping containers to deploy large volumes of servers in a tightly-controlled environment, first discussing this approach in a series of 2007 presentations that preceded Microsoft’s decision to use modular units to deploy its cloud computing infrastructure. When Hamilton moved to AWS, it prompted speculation that Amazon might also be using containers.

A slide of a data center from a presentation at the Amazon Technology Open House ⇒

At Tuesday’s open house Hamilton displayed a slide of modular data centers, including the Amazon Perdix. The unit appears to be a custom-built unit that is wider and taller than standard ISO containers. While it’s hard to glean much from the exterior, vents at the side and top suggest cooling is managed in the upper section of the unit, which is air-cooled. Why Perdix? It’s the name of a character in Greek mythology known for inventing useful tools.

Is this Amazon’s current technology? Perhaps not. An Amazon affiliate that builds the company’s data centers has submitted building plans for a new facility in Umatilla, Oregon featuring six modules, according to local media. Plans show the structures will be about 20 feet wide and 108 feet long, situated side by side on the property, according to the East Oregonian.