In a past blog entry, One Size Does Not Fit All, I offered a taxonomy of 4 different types of structured storage system, argued that Relational Database Management Systems are not sufficient, and walked through some of the reasons why NoSQL databases have emerged and continue to grow market share quickly. The four database categories I introduced were: 1) features-first, 2) scale-first, 3) simple structure storage, and 4) purpose-optimized stores. RDBMS own the first category.

DynamoDB targets workloads fitting into the Scale-First and Simple Structured storage categories where NoSQL database systems have been so popular over the last few years. Looking at these two categories in more detail.

Scale-First is: Scale-first applications are those that absolutely must scale without bound and being able to do this without restriction is much more important than more features. These applications are exemplified by very high scale web sites such as Facebook, MySpace, Gmail, Yahoo, and Amazon.com. Some of these sites actually do make use of relational databases but many do not. The common theme across all of these services is that scale is more important than features and none of them could possibly run on a single RDBMS. As soon as a single RDBMS instance won’t handle the workload, there are two broad possibilities: 1) shard the application data over a large number of RDBMS systems, or 2) use a highly scalable key-value store.

And, Simple Structured Storage: There are many applications that have a structured storage requirement but they really don’t need the features, cost, or complexity of an RDBMS. Nor are they focused on the scale required by the scale-first structured storage segment. They just need a simple key value store. A file system or BLOB-store is not sufficiently rich in that simple query and index access is needed but nothing even close to the full set of RDBMS features is needed. Simple, cheap, fast, and low operational burden are the most important requirements of this segment of the market.

The DynamoDB service is a unified purpose-built hardware platform and software offering. The hardware is based upon a custom server design using Flash Storage spread over a scalable high speed network joining multiple data centers.

DynamoDB supports a provisioned throughput model. A DynamoDB application programmer decides the number of database requests per second their application should be capable of supporting and DynamoDB automatically spreads the table over an appropriate number of servers. At the same time, it also reserves the required network, server, and flash memory capacity to ensure that request rate can be reliably delivered day and night, week after week, and year after year. There is no need to worry about a neighboring application getting busy or running wild and taking all the needed resources. They are reserved and there whenever needed.

The sharding techniques needed to achieve high requests rates are well understood industry-wide but implementing them does take some work. Reliably reserving capacity so it is always there when you need it, takes yet more work. Supporting the ability to allocate more resources, or even less, while online and without disturbing the current request rate takes still more work. DynamoDB makes all this easy. It supports online scaling between very low transaction rates to applications requiring millions of requests per second. No downtime and no disturbance to the currently configured application request rate while resharding. These changes are done online only by changing the DynamoDB provisioned request rate up and down through an API call.

In addition to supporting transparent, on-line scaling of provisioned request rates up and down over 6+ orders of magnitude with resource reservation, DynamoDB is also both consistent and multi-datacenter redundant. Eventual consistency is a fine programming model for some applications but it can yield confusing results under some circumstances. For example, if you set a value to 3 and then later set it to 4, then read it back, 3 can be returned. Worse, the value could be set to 4, verified to be 4 by reading it, and yet 3 could be returned later. It’s a tough programming model for some applications and it tends to be overused in an effort to achieve low-latency and high throughput. DynamoDB avoids forcing this by supporting low-latency and high throughout while offering full consistency. It also offers eventual consistency at lower request cost for those applications that run well with that model. Both consistency models are supported.

It is not unusual for a NoSQL store to be able to support high transaction rates. What is somewhat unusual is to be able to scale the provisioned rate up and down while on-line. Achieving that while, at the same time, maintaining synchronous, multi-datacenter redundancy is where I start to get excited.

Clearly nobody wants to run the risk of losing data but NoSQL systems are scale-first by definition. If the only way to high throughput and scale, is to run risk and not commit the data to persistent storage at commit time, that is exactly what is often done. This is where DynamoDB really shines. When data is sent to DynamoDB, it is committed to persistent and reliable storage before the request is acknowledged. Again this is easy to do but doing it with average low single digit millisecond latencies is both harder and requires better hardware. Hard disk drives can’t do it and in-memory systems are not persistent so flash memory is the most cost effective solution.

But what if the server to which the data was committed fails, or the storage fails, or the datacenter is destroyed? On most NoSQL systems you would lose your most recent changes. On the better implementations, the data might be saved but could be offline and unavailable. With dynamoDB, if data is committed just as the entire datacenter burns to the ground, the data is safe, and the application can continue to run without negative impact at exactly the same provisioned throughput rate. The loss of an entire datacenter isn’t even inconvenient (unless you work at Amazon :-)) and has no impact on your running application performance.

Combining rock solid synchronous, multi-datacenter redundancy with average latency in the single digits, and throughput scaling to the millions of requests per second is both an excellent engineering challenge and one often not achieved.

Just as I was blown away when I saw it possible to create the world’s 42nd most powerful super computer with a few API calls to AWS (42: the Answer to the Ultimate Question of Life, the Universe and Everything), it is truly cool to see a couple of API calls to DynamoDB be all that it takes to get a scalable, consistent, low-latency, multi-datacenter redundant, NoSQL service configured, operational and online.

In May Google open-sourced a BigTable-inspired key-value database library called LevelDB under a BSD license. It was created by Jeff Dean and Sanjay Ghemawat of the BigTable project at Google. It’s available for Unix based systems, Mac OS X, Windows, and Android.

LevelDB is not a database server like other other key-value stores like Redis or Membase. Instead, it would most likely be used as an embedded database for other applications, much the way SQLite or Berkley DB are used. The technical advantage to using LevelDB instead of other key-value stores is its support for ordered data. Also, its BSD license is more liberal than the GPLSleepycat license of Berkley DB.

For example, LevelDB may be used by a web browser to store a cache of recently accessed web pages, or by an operating system to store the list of installed packages and package dependencies, or by an application to store user preference settings. We designed LevelDB to also be useful as a building block for higher-level storage systems. Upcoming versions of the Chrome browser include an implementation of the IndexedDB HTML5 API that is built on top of LevelDB. Google’s Bigtable manages millions of tablets where the contents of a particular tablet are represented by a precursor to LevelDB.

LevelDB isn’t limited to just being used as an embedded database, however. Basho is already exploring the possibility of using LevelDB with Riak as an alternative to Bitcask or InnoDB. The company conducted some benchmarks, which you can find in this blog post.

Big data has become one the new buzzwords on the Internet. It refers to the massive amounts of data that many modern web services deal with. This post will list some of the more useful software available to web developers for working with big data.

You don’t have to operate at the scale of Google or Facebook to enter into big data territory. Web analytics services, monitoring services (like our very own Pingdom), search engines, etc., all process and store massive amounts of data.

Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools. […] Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data.

At this scale, many traditional approaches for handling and processing data are either impractical or break down completely.

このスケールにおける、従来からのアプローによるデータの操作と処理の試みは、現実的なものにならず、また、完全に失敗する。

That’s why the web development community has been turning to alternative ways to handle all this data, developing new software that scales to these extremes. You may have heard about NoSQL databases, but that’s just a small piece of the puzzle.

So what are the various ingredients available for handling big data? We’ve divided them into four categories:

そして、このビッグデータを取り扱うために利用できる、各種の構成要素とは、何なのだろう？私たちは、それを を4つのカテゴリに分けてみた：

Storage and file systems

Databases

Querying and data analysis

Streaming and event processing

We figured this could be a good starting point, and we’re hoping that you’ll help us add to the list in this post by making your own suggestions in the comments. In other words, read the list, and help us add more useful ingredients!

Storage and file systems

When you need to store massive amounts of data, you’ll want a storage solution designed to scale out on multiple servers.

HDFS (Hadoop Distributed File System) – Part of the open source Hadoop framework, HDFS is a distributed, scalable file system inspired by the Google File System. It runs on top of the file system of the underlying OSs and is designed to scale to petabytes of storage. The Hadoop project (you’ll see several of the other components further down) has several high-profile contributors, the main one being Yahoo. Hadoop is used by Yahoo, AOL, eBay, Facebook, IBM, Meebo, Twitter and a large number of other companies and services.

CloudStore (KFS) – An open source implementation of the Google File System from Kosmix. It can be used together with Hadoop and Hypertable. A well-known CloudStore user and contributor is Quantcast.

GlusterFS – A free, scalable, distributed file system developed by Gluster

Databases

While classics like MySQL are still widely used, there are other options out there that have been designed with “web scalability” in mind, many of them so-called NoSQL databases (speaking of buzzwords…).

HBase – A distributed, fault-tolerant database modeled after Google’s BigTable. It’s part of the Apache Hadoop project, and runs on top of HDFS.

Hypertable – An open source database inspired by Google’s BigTable. A notable Hypertable user is Baidu.

Cassandra – A distributed key-value database originally developed by Facebook, released as open source, and now run under the Apache umbrella. Cassandra is used by Facebook, Digg, Reddit, Twitter and Rackspace, to name a few.

Membase – An open source, distributed, key-value database optimized for interactive web applications, developed by several team members from the famous Memcached project. Users include Zynga and Heroku. A month ago, the Membase project merged with CouchDB, creating a new project called Couchbase.

Querying and data analysis

All that data is of no use without the ability to access, process and analyze it.

Hadoop MapReduce – Open source version of Google’s MapReduce framework for distributed processing of large datasets.

Hive – An open source data warehouse infrastructure with tools for querying and analyzing large datasets in Hadoop. Supports an SQL-like query language called Hive QL.

Pig – A high-level language used for processing data with Hadoop. Funny aside: the language is sometimes referred to as Pig Latin.

Streaming and event processing

When you have massive amounts of data flowing into your system, you will often want to process and react on this data in real time.

S4 – A general-purpose, distributed, scalable platform for processing continuous streams of data. Developed by Yahoo and released as open source in 2010. It’s apparently not quite ready for prime time yet, although Yahoo is using a version of it internally.

StreamInsight – Microsoft’s entry in the EST/CEP field, included with SQL Server.

A small aside when speaking of streaming and event processing, you’ll hear two industry terms repeated over and over again: EST, Event Stream Processing, and CEP, Complex Event Processing. Just in case you were wondering what that actually stood for.

The Google legacy

It’s interesting how influential Google has been in the big data field in spite of having released very little actual software to the public.

Much of the open source big data movement is centered around Apache’s Hadoop project, which essentially has tried to replicate Google’s internal software based on the various whitepapers Google has made available. (More specifically, Hadoop has replicated GFS, BigTable and Mapreduce.)

Here is a list of some of Google’s proprietary software relating to big data:

BigTable – A distributed, high-performance database system built on top of GFS.

Mapreduce – A framework for distributed processing of very large data sets.

Pregel – A framework for analyzing large-scale graphs with billions of nodes.

Dremel – Meant as a faster complement to Mapreduce, Dremel is a scalable, interactive, ad-hoc query system for large data sets. According to Google, it’s capable of running aggregation queries over trillion-row tables in seconds and scales to thousands of CPUs.

If we may be so bold as to bring out our crystal ball, there will most likely be several open source implementations of Pregel and Dremel available soon. For example, there’s already an OpenDremel project in the works.

Help us add more ingredients!

What excellent big data software did we leave out? Let’s make this post a true resource, so please give us a hand in the comments.

Facebook is working on a real-time analytics dashboard that will let users determine which content on their pages is getting the most attention from visitors. As described in an educational session on Wednesday night in Facebook’s Seattle office, the service, which tracks both impressions and actions for plugins and newsfeeds, should be valuable to companies seeking to maximize the effectiveness of the marketing efforts on the popular social media site. However, the highlight of the session was the infrastructure underlying the forthcoming service.

The session video gives plenty of details, but here are some highlights. The analytics service tracks about 100 different metrics; is built atop HBase, with support from two Facebook-developed tools called pTail and Puma; and it aims for less than 30 seconds of lag time, a goal it has met a majority of the time during testing. It’s interesting that Facebook is becoming such a big user of the Hadoop-based HBase database, but the company line thus far is that the Cassandra NoSQL database it developed a few years ago just can’t hang with HBase when it comes to reliability and performance. HBase also underpins Facebook’s recently launched “social inbox” feature.

In 2010, Apache Cassandra increased its momentum as the leading scalable database. Here is a summary of the notable activity in three areas: code, community and controversy. As always, comments are welcome.

We also saw Lucandra, which implements a Cassandra back end for Lucene and is used in several high volume production sites, grow up into Solandra, embedding Solr and Cassandra in the same JVM for even more performance.

Controversy

Cassandra got a lot of negative publicity when Kevin Rose blamed Cassandra for Digg v4’s teething problems. However, there was no deluge of bug reports coming out of Digg’s Cassandra team, and Digg engineers Arin Sarkissian and Chris Goffinet (now working on Cassandra for Twitter) got on Quora to refute the idea that Cassandra was at fault:

The new version of Digg has a whole new architecture with a bunch of technologies involved. Problem is, over the last few months or so the only technological change we mentioned (blogged about etc) was Cassandra. That made it pretty easy for folks to cling on to it as the "problem".

Mollom is one of those cool SaaS companies every developer dreams of creating when they wrack their brains looking for a viable software-as-a-service startup. Mollom profitably runs a useful service—spam filtering—with a small group of geographically distributed developers. Mollom helps protect nearly 40,000 websites from spam, including one of mine, which is where I first learned about Mollom. In a desperate attempt to stop spam on a Drupal site, where every other form of CAPTCHA had failed miserably, I installed Mollom in about 10 minutes and it immediately started working. That’s the out of the box experience I was looking for.

From the time Mollom opened it’s digital inspection system they’ve rejected over 373 million spams and in the process they’ve learned that a stunning 90% of all messages are spam. This spam torrent is handled by only two geographically distributed machines that handle 100 requests/ second, each running a Java application server and Cassandra. So few resources are necessary because they’ve created a very efficient machine learning system. Isn’t that cool? So, how do they do it?

To find out I interviewed Benjamin Schrauwen, cofounder of Mollom, and Johan Vos, Glassfish and Java enterprise expert. Proving software knows no national boundaries, Mollom HQ is located in Belgium (other good things from Belgium: Hercule Poirot, chocolate, waffles).

Serving 40,000 active websites, many of which are very large customers like Sony Music, Warner Brothers, Fox News, and The Economist. A lot of big brands, with big websites, and a lot of comments.

Find 1/2 million spam messages each day.

Handle 100 API calls/second.

A spam check is low latency, taking between 30-50msecs. The slowest connection would be 500msec. The 95th percentile of latency is 250msecs. It’s really optimized for speed.

Spam classification efficiency is at 99.95%. This means that only 5 in 10,000 spam messages were not caught by Mollom.

Netlog, which is a social networking site in Europe, has their own Mollom setup in their own datacenter. Netlog handles about 4 million messages a day on custom classifiers that are trained on their data.

Mollom is a web service for filtering out various types of spam from user generated content: comments, forum posts, blog posts, polls, contact forms, registration forms, and password request forms. Spam determination is not only based on the posted content, but also on the past activity and reputation of the poster. Mollom’s machine learning algorithms act as your 24×7 digital moderator, so you don’t have to.

Edit Note: This is the fourth of a multi-part series of posts exploring the use cases for NoSQL deployments in the real world. So far, the series has covered case studies on MongoDB,Cassandra and Hbase.

ーーーーー

With all the excitement surrounding the relatively recent wave of non-relational – otherwise known as “NoSQL” – databases, it can be hard to separate the hype from the reality. There’s a lot of talk, but how much NoSQL action is there in the real world? In this series, we’ll take a look at some real-world NoSQL deployments.

Netflix provides rent-by-mail and streaming movies in the United States. The shift from mail-order to streaming video had fairly significant implications for Netflix’s application infrastructure. Netflix realized that it would need multiple geographically dispersed data centers and far more processing capacity. Rather than build these new data centers, Netflix decided to migrate its applications to Amazon’s AWS cloud. This allowed the company to concentrate its intellectual efforts on building customer value rather than nationwide data centers.

As a part of this bold move, Netflix migrated core parts of its database from Oracle to Amazon’s SimpleDB data store. This migration is one of the biggest migrations to the cloud yet undertaken, with the Netflix system serving the needs of more than 16 million subscribers and hosting over 100,000 DVD titles.

SimpleDB is a key-value store that runs within the Amazon Web Services (AWS) cloud, and promises reliable and transparently scalable storage together with a flexible schema that supports either immediate or eventual consistency. SimpleDB is a virtually zero-administration service: there is no database administration involved in scaling the system – storage and computer power is assigned dynamically and automatically by Amazon as the database grows.

Netflix needed to make significant compromises in exchange for the scalability provided by Amazon AWS and SimpleDB. Complex SQL operations such as joins between tables or aggregate “group by” operations which would normally be executed within the database were moved to the application layer. In some cases this required that the data model be de-normalized; data that would be stored in multiple tables in Oracle was flattened into a single SimpleDB structure so that joins could be avoided.

Relational database transactions were depreciated in favour of SimpleDB’s optimistic concurrency mechanism, which allows modifications to proceed only if an item is unchanged since it was last accessed. For instance, an attempt to increment a counter (number of rentals for a video for instance) would be rejected if the counter was simultaneously modified by another transaction. Even so application developers needed to be aware that certain operations (reading a value immediately after modifying it, for instance) might incorrect or at least unexpected results.

Netflix doesn’t use SimpleDB for all storage; Oracle, MySQL and the Amazon S3 service all form significant parts of the Netflix architecture. Nevertheless, with more than 16 million customers, Netflix has made a significant commitment to a non-relational alternative and one which, it says, allows them to better meet customer and shareholder needs. Netflix has been generous in sharing their experiences in articles such as this one.

Guy Harrison is a director of research and development at Quest Software, and has over 20 years of experience in database design, development, administration, and optimization. He can be found on the internet at www.guyharrison.net, on e-mail at guy.harrison@quest.com and is@guyharrison on twitter.