Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training,
learning paths, books, tutorials, and more.

Chapter 1. Introducing Cassandra

If at first the idea is not
absurd, then there is no hope for
it.

—Albert Einstein

Welcome to Cassandra: The Definitive Guide. The
aim of this book is to help developers and database administrators
understand this important new database, explore how it compares to the
relational database management systems we’re used to, and help you put it to
work in your own environment.

What’s Wrong with Relational Databases?

If I had asked people what
they wanted, they would have said
faster horses.

—Henry Ford

I ask you to consider a certain model for data, invented by a small
team at a company with thousands of employees. It is accessible over a
TCP/IP interface and is available from a variety of languages, including
Java and web services. This model was difficult at first for all but the
most advanced computer scientists to understand, until broader adoption
helped make the concepts clearer. Using the database built around this
model required learning new terms and thinking about data storage in a
different way. But as products sprang up around it, more businesses and
government agencies put it to use, in no small part because it was
fast—capable of processing thousands of operations a second. The revenue
it generated was tremendous.

And then a new model came along.

The new model was threatening, chiefly for two reasons. First, the
new model was very different from the old model, which it pointedly
controverted. It was threatening because it can be hard to understand
something different and new. Ensuing debates can help entrench people
stubbornly further in their views—views that might have been largely
inherited from the climate in which they learned their craft and the
circumstances in which they work. Second, and perhaps more importantly, as
a barrier, the new model was threatening because businesses had made
considerable investments in the old model and were making lots of money
with it. Changing course seemed ridiculous, even impossible.

Of course I’m talking about the Information Management System (IMS)
hierarchical database, invented in 1966 at IBM.

IMS was built for use in the Saturn V moon rocket. Its architect was
Vern Watts, who dedicated his career to it. Many of us are familiar with
IBM’s database DB2. IBM’s wildly popular DB2 database gets its name as the
successor to DB1—the product built around the hierarchical data model IMS.
IMS was released in 1968, and subsequently enjoyed success in Customer
Information Control System (CICS) and other applications. It is still used
today.

But in the years following the invention of IMS, the new model, the
disruptive model, the threatening model, was the relational
database.

In his 1970 paper “A Relational Model of Data for Large Shared Data
Banks,” Dr. Edgar F. Codd, also at
IBM, advanced his theory of the relational model for data while working at
IBM’s San Jose research laboratory. This paper, still available at http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf,
became the foundational work for relational database management systems.

Codd’s work was antithetical to the hierarchical structure of IMS.
Understanding and working with a relational database required learning new
terms that must have sounded very strange indeed to users of IMS. It
presented certain advantages over its predecessor, in part because giants
are almost always standing on the shoulders of other giants.

While these ideas and their application have evolved in four
decades, the relational database still is clearly one of the most
successful software applications in history. It’s used in the form of
Microsoft Access in sole proprietorships, and in giant multinational
corporations with clusters of hundreds of finely tuned instances
representing multiterabyte data warehouses. Relational databases store
invoices, customer records, product catalogues, accounting ledgers, user
authentication schemes—the very world, it might appear. There is no
question that the relational database is a key facet of the modern
technology and business landscape, and one that will be with us in its
various forms for many years to come, as will IMS in its various forms.
The relational model presented an alternative to IMS, and each has its
uses.

So the short answer to the question, “What’s wrong with relational
databases?” is “Nothing.”

There is, however, a rather longer answer that I gently encourage
you to consider. This answer takes the long view, which says that every
once in a while an idea is born that ostensibly changes things, and
engenders a revolution of sorts. And yet, in another way, such
revolutions, viewed structurally, are simply history’s business as usual.
IMS, RDBMS, NoSQL. The horse, the car, the plane. They each build on prior
art, they each attempt to solve certain problems, and so they’re each good
at certain things—and less good at others. They each coexist, even
now.

So let’s examine for a moment why, at this point, we might consider
an alternative to the relational database, just as Codd himself four
decades ago looked at the Information Management System and thought that
maybe it wasn’t the only legitimate way of organizing information and
solving data problems, and that maybe, for certain problems, it might
prove fruitful to consider an alternative.

We encounter scalability problems when our relational applications
become successful and usage goes up. Joins are inherent in any relatively
normalized relational database of even modest size, and joins can be slow.
The way that databases gain consistency is typically through the use of
transactions, which require locking some portion of the database so it’s
not available to other clients. This can become untenable under very heavy
loads, as the locks mean that competing users start queuing up, waiting
for their turn to read or write the data.

We typically address these problems in one or more of the following
ways, sometimes in this order:

Throw hardware at the problem by adding more memory, adding
faster processors, and upgrading disks. This is known as
vertical scaling. This can relieve you for a
time.

When the problems arise again, the answer appears to be similar:
now that one box is maxed out, you add hardware in the form of
additional boxes in a database cluster. Now you have the problem of
data replication and consistency during regular usage and in failover
scenarios. You didn’t have that problem before.

Now we need to update the configuration of the database
management system. This might mean optimizing the channels the
database uses to write to the underlying filesystem. We turn off
logging or journaling, which frequently is not a desirable (or, depending on your
situation, legal) option.

Having put what attention we could into the database system, we
turn to our application. We try to improve our indexes. We optimize
the queries. But presumably at this scale we weren’t wholly ignorant
of index and query optimization, and already had them in pretty good
shape. So this becomes a painful process of picking through the data
access code to find any opportunities for fine tuning. This might
include reducing or reorganizing joins, throwing out
resource-intensive features such as XML processing within a stored
procedure, and so forth. Of course, presumably we were doing that XML
processing for a reason, so if we have to do it somewhere, we move
that problem to the application layer, hoping to solve it there and
crossing our fingers that we don’t break something else in the
meantime.

We employ a caching layer. For larger systems, this might
include distributed caches such
as memcached, EHCache, Oracle Coherence, or other related products.
Now we have a consistency problem between updates in the cache and
updates in the database, which
is exacerbated over a cluster.

We turn our attention to the database again and decide that, now
that the application is built and we understand the primary query
paths, we can duplicate some of the data to make it look more like the
queries that access it. This process, called denormalization, is
antithetical to the five normal forms that characterize the relational
model, and violate Codd’s 12 Commandments for relational data. We
remind ourselves that we live in this world, and not in some
theoretical cloud, and then undertake to do what we must to make the
application start responding at acceptable levels again, even if it’s
no longer “pure.”

I imagine that this sounds familiar to you. At web scale, engineers
have started to wonder whether this situation isn’t similar to Henry
Ford’s assertion that at a certain point, it’s not simply a faster horse
that you want. And they’ve done some impressive, interesting work.

We must therefore begin here in recognition that the relational
model is simply a model. That is,
it’s intended to be a useful way of looking at the world, applicable to
certain problems. It does not purport to be exhaustive, closing the case
on all other ways of representing data, never again to be examined,
leaving no room for alternatives. If we take the long view of history, Dr.
Codd’s model was a rather disruptive one in its time. It was new, with
strange new vocabulary and terms such as “tuples”—familiar words used in a
new and different manner. The relational model was held up to suspicion,
and doubtless suffered its vehement detractors. It encountered opposition
even in the form of Dr. Codd’s own employer, IBM, which had a very
lucrative product set around IMS and didn’t need a young upstart cutting
into its pie.

But the relational model now arguably enjoys the best seat in the
house within the data world. SQL is widely supported and well understood.
It is taught in introductory university courses. There are free databases
that come installed and ready to use with a $4.95 monthly web hosting
plan. Often the database we end up using is dictated to us by
architectural standards within our organization. Even absent such
standards, it’s prudent to learn whatever your organization already has
for a database platform. Our colleagues in development and infrastructure
have considerable hard-won knowledge.

If by nothing more than osmosis—or inertia—we have learned over the
years that a relational database is a one-size-fits-all
solution.

So perhaps the real question is not, “What’s wrong with relational
databases?” but rather, “What problem do you have?”

That is, you want to ensure that your solution matches the problem
that you have. There are certain problems that relational databases solve
very well.

If massive, elastic scalability is not an issue for you, the
trade-offs in relative complexity of a system such as Cassandra may simply
not be worth it. No proponent of Cassandra that I know of is asking anyone
to throw out everything they’ve learned about relational databases,
surrender their years of hard-won knowledge around such systems, and
unnecessarily jeopardize their employer’s carefully constructed systems in
favor of the flavor of the month.

Relational data has served all of us developers and DBAs well. But
the explosion of the Web, and in particular social networks, means a
corresponding explosion in the sheer volume of data we must deal with.
When Tim Berners-Lee first worked on the Web in the early 1990s, it was
for the purpose of exchanging scientific documents between PhDs at a
physics laboratory. Now, of course, the Web has become so ubiquitous that
it’s used by everyone, from those same scientists to legions of
five-year-olds exchanging emoticons about kittens. That means in part that
it must support enormous volumes of data; the fact that it does stands as
a monument to the ingenious architecture of the Web.

But some of this infrastructure is starting to bend under the
weight.

In 1966, a company like IBM was in a position to really make people
listen to their innovations. They had the problems, and they had the brain
power to solve them. As we enter the
second decade of the 21st century, we’re starting to see similar
innovations, even from young companies such as Facebook and
Twitter.

So perhaps the real question, then, is not “What problem do I have?”
but rather, “What kinds of things would I do with data if it wasn’t a
problem?” What if you could easily achieve fault tolerance, availability
across multiple data centers, consistency that you tune, and massive
scalability even to the hundreds of terabytes, all from a client language
of your choosing? Perhaps, you say, you don’t need that kind of
availability or that level of scalability. And you know best. You’re
certainly right, in fact, because if your current database didn’t suit
your current database needs, you’d have a nonfunctioning system.

It is not my intention to convince you by clever argument to adopt a
non-relational database such as Apache Cassandra. It is only my intention
to present what Cassandra can do and how it does it so that you can make
an informed decision and get started working with it in practical ways if
you find it applies. Only you know what your data needs are. I do not ask
you to reconsider your database—unless you’re miserable with your current
database, or you can’t scale how you need to already, or your data model
isn’t mapping to your application in a way that’s flexible enough for you.
I don’t ask you to consider your database, but rather to consider your
organization, its dreams for the future, and its emerging problems. Would
you collect more information about your business objects if you
could?

Don’t ask how to make Cassandra fit into your existing environment.
Ask what kinds of data problems you’d like to have instead of the ones you
have today. Ask what new kinds of data you would like. What understanding
of your organization would you like to have, if only you could enable
it?

A Quick Review of Relational Databases

Though you are likely familiar with them, let’s briefly turn our
attention to some of the foundational concepts in relational databases.
This will give us a basis on which to consider more recent advances in
thought around the trade-offs inherent in distributed data systems,
especially very large distributed data systems, such as those that are
required at web scale.

RDBMS: The Awesome and the Not-So-Much

There are many reasons that the relational database has become so
overwhelmingly popular over the last four decades. An important one is
the Structured Query Language (SQL), which is feature-rich and uses a
simple, declarative syntax. SQL was first officially adopted as an ANSI
standard in 1986; since that time it’s gone through several revisions
and has also been extended with vendor proprietary syntax such as
Microsoft’s T-SQL and Oracle’s PL/SQL to provide additional
implementation-specific features.

SQL is powerful for a variety of reasons. It allows the user to
represent complex relationships with the data, using statements that
form the Data Manipulation Language (DML) to insert, select, update,
delete, truncate, and merge data. You can perform a rich variety of
operations using functions based on relational algebra to find a maximum
or minimum value in a set, for example, or to filter and order results.
SQL statements support grouping aggregate values and executing summary
functions. SQL provides a means of directly creating, altering, and
dropping schema structures at runtime using Data Definition Language
(DDL). SQL also allows you to grant and revoke rights for users and
groups of users using the same syntax.

SQL is easy to use. The basic syntax can be learned quickly, and
conceptually SQL and RDBMS offer a low barrier to entry. Junior
developers can become proficient readily, and as is often the case in an
industry beset by rapid changes, tight deadlines, and exploding budgets,
ease of use can be very important. And it’s not just the syntax that’s
easy to use; there are many robust tools that include intuitive
graphical interfaces for viewing and working with your database.

In part because it’s a standard, SQL allows you to easily
integrate your RDBMS with a wide variety of systems. All you need is a
driver for your application language, and you’re off to the races in a
very portable way. If you decide to change your application
implementation language (or your RDBMS vendor), you can often do that
painlessly, assuming you haven’t backed yourself into a corner using
lots of proprietary extensions.

Transactions, ACID-ity, and two-phase commit

In addition to the features mentioned already, RDBMS and SQL
also support transactions. A database transaction
is, as Jim Gray puts it, “a transformation of state” that has the ACID
properties (see http://research.microsoft.com/en-us/um/people/gray/papers/theTransactionConcept.pdf).
A key feature of transactions is that they execute virtually at first,
allowing the programmer to undo (using ROLLBACK) any changes that may
have gone awry during execution; if all has gone well, the transaction
can be reliably committed. The debate about support for transactions
comes up very quickly as a sore spot in conversations around
non-relational data stores, so let’s take a moment to revisit what
this really means.

ACID is an acronym for Atomic, Consistent, Isolated, Durable,
which are the gauges we can use to assess that a transaction has
executed properly and that it was successful:

Atomic

Atomic means “all or nothing”; that is, when a statement
is executed, every update within the transaction must succeed in
order to be called successful. There is no partial failure where
one update was successful and another related update failed. The
common example here is with monetary transfers at an ATM: the
transfer requires subtracting money from one account and adding
it to another account. This operation cannot be subdivided; they
must both succeed.

Consistent

Consistent means that data moves from one correct state to
another correct state, with no possibility that readers could
view different values that don’t make sense together. For
example, if a transaction attempts to delete a Customer and her
Order history, it cannot leave Order rows that reference the
deleted customer’s primary key; this is an inconsistent state
that would cause errors if someone tried to read those Order
records.

Isolated

Isolated means that transactions executing concurrently
will not become entangled with each other; they each execute in
their own space. That is, if two different transactions attempt
to modify the same data at the same time, then one of them will
have to wait for the other to complete.

Durable

Once a transaction has succeeded, the changes will not be
lost. This doesn’t imply another transaction won’t later modify
the same data; it just means that writers can be confident that
the changes are available for the next transaction to work with
as necessary.

On the surface, these properties seem so obviously desirable as
to not even merit conversation. Presumably no one who runs a database
would suggest that data updates don’t have to endure for some length
of time; that’s the very point of making updates—that they’re there for others to
read. However, a more subtle examination might lead us to want to find
a way to tune these properties a bit and control them slightly. There
is, as they say, no free lunch on the Internet, and once we see how
we’re paying for our transactions, we may start to wonder whether
there’s an alternative.

Transactions become difficult under heavy load. When you first
attempt to horizontally scale a relational database, making it
distributed, you must now account for distributed
transactions, where the transaction isn’t simply operating
inside a single table or a single database, but is spread across
multiple systems. In order to continue to honor the ACID properties of
transactions, you now need a transaction manager to orchestrate across
the multiple nodes.

In order to account for successful completion across multiple
hosts, the idea of a two-phase commit (sometimes referred to as “2PC”)
is introduced. But then, because two-phase commit locks all associate
resources, it is useful only for operations that can complete very
quickly. Although it may often be the case that your distributed
operations can complete in sub-second time, it is certainly not always
the case. Some use cases require coordination between multiple hosts
that you may not control yourself. Operations coordinating several
different but related activities can take hours to update.

Two-phase commit blocks; that is, clients
(“competing consumers”) must wait for a prior transaction to finish
before they can access the blocked resource. The protocol will wait
for a node to respond, even if it has died. It’s possible to avoid
waiting forever in this event, because a timeout can be set that
allows the transaction coordinator node to decide that the node isn’t
going to respond and that it should abort the transaction. However, an
infinite loop is still possible with 2PC; that’s because a node can
send a message to the transaction coordinator node agreeing that it’s
OK for the coordinator to commit the entire transaction. The node will
then wait for the coordinator to send a commit response (or a rollback
response if, say, a different node can’t commit); if the coordinator
is down in this scenario, that node conceivably will wait
forever.

So in order to account for these shortcomings in two-phase
commit of distributed transactions, the database world turned to the
idea of compensation. Compensation, often used in
web services, means in simple terms that the operation is immediately
committed, and then in the event that some error is reported, a new
operation is invoked to restore proper state.

There are a few basic, well-known patterns for compensatory
action that architects frequently have to consider as an alternative
to two-phase commit. These include writing off the transaction if it
fails, deciding to discard erroneous transactions and reconciling later. Another alternative
is to retry failed operations later on notification. In a reservation
system or a stock sales ticker, these are not likely to meet your
requirements. For other kinds of applications, such as billing or
ticketing applications, this can be acceptable.

Note

Gregor Hohpe, a Google architect, wrote a wonderful and
often-cited blog entry called “Starbucks Does Not Use Two-Phase
Commit.” It shows in real-world terms how difficult it is to scale
two-phase commit and highlights some of the alternatives that are
mentioned here. Check it out at http://www.eaipatterns.com/ramblings/18_starbucks.html.
It’s an easy, fun, and enlightening read.

The problems that 2PC introduces for application developers
include loss of availability and higher latency during partial
failures. Neither of these is desirable. So once you’ve had the good
fortune of being successful enough to necessitate scaling your
database past a single machine, you now have to figure out how to
handle transactions across multiple machines and still make the ACID
properties apply. Whether you have 10 or 100 or 1,000 database
machines, atomicity is still required in transactions as if you were
working on a single node. But it’s now a much, much bigger pill to
swallow.

Schema

One often-lauded feature of relational database systems is the
rich schemas they afford. You can represent your domain objects in a
relational model. A whole industry has sprung up around (expensive)
tools such as the CA ERWin Data Modeler to support this effort. In
order to create a properly normalized schema, however, you are forced
to create tables that don’t exist as business objects in your domain.
For example, a schema for a university database might require a
Student table and a Course table. But because of the “many-to-many”
relationship here (one student can take many courses at the same time,
and one course has many students at the same time), you have to create
a join table. This pollutes a pristine data model, where we’d prefer
to just have students and courses. It also forces us to create more
complex SQL statements to join these tables together. The join
statements, in turn, can be slow.

Again, in a system of modest size, this isn’t much of a problem.
But complex queries and multiple joins can become burdensomely slow
once you have a large number of rows in many tables to handle.

Finally, not all schemas map well to the relational model. One
type of system that has risen in popularity in the last decade is the
complex event processing system, which represents state changes in a
very fast stream. It’s often useful to contextualize events at runtime
against other events that might be related in order to infer some
conclusion to support business decision making. Although event streams
could be represented in terms of a relational database, it is an
uncomfortable stretch.

And if you’re an application developer, you’ll no doubt be
familiar with the many object-relational mapping (ORM) frameworks that
have sprung up in recent years to help ease the difficulty in mapping
application objects to a relational model. Again, for small systems,
ORM can be a relief. But it also introduces new problems of its own,
such as extended memory requirements, and it often pollutes the
application code with increasingly unwieldy mapping code. Here’s an
example of a Java method using Hibernate to “ease the burden” of having
to write the SQL code:

Is it certain that we’ve done anything but move the problem
here? Of course, with some systems, such as those that make extensive
use of document exchange, as with services or XML-based applications,
there are not always clear mappings to a relational database. This
exacerbates the problem.

Sharding and shared-nothing architecture

If you can’t split it, you can’t scale
it.

—Randy Shoup, Distinguished Architect,
eBay

Another way to attempt to scale a relational database is to
introduce sharding to your architecture. This has
been used to good effect at large websites such as eBay, which
supports billions of SQL queries a day, and in other Web 2.0
applications. The idea here is that you split the data so that instead
of hosting all of it on a single server or replicating all of the data on all of
the servers in a cluster, you divide up portions of the data
horizontally and host them each separately.

For example, consider a large customer table in a relational
database. The least disruptive thing (for the programming staff,
anyway) is to vertically scale by adding CPU, adding memory, and
getting faster hard drives, but if you continue to be successful and
add more customers, at some point (perhaps into the tens of millions
of rows), you’ll likely have to start thinking about how you can add
more machines. When you do so, do you just copy the data so that all
of the machines have it? Or do you instead divide up that single
customer table so that each database has only some of the records,
with their order preserved? Then, when clients execute queries, they
put load only on the machine that has the record they’re looking for,
with no load on the other machines.

It seems clear that in order to shard, you need to find a good
key by which to order your records. For example, you could divide your
customer records across 26 machines, one for each letter of the
alphabet, with each hosting only the records for customers whose last
names start with that particular letter. It’s likely this is not a
good strategy, however—there probably aren’t many last names that
begin with “Q” or “Z,” so those machines will sit idle while the “J,”
“M,” and “S” machines spike. You could shard according to something
numeric, like phone number, “member since” date, or the name of the customer’s state. It all
depends on how your specific data is likely to be distributed.

There are three basic strategies for determining shard
structure:

Feature-based shard or functional segmentation

This is the approach taken by Randy Shoup, Distinguished
Architect at eBay, who in 2006 helped bring their architecture
into maturity to support many billions of queries per day. Using
this strategy, the data is split not by dividing records in a
single table (as in the customer example discussed earlier), but
rather by splitting into separate databases the features that
don’t overlap with each other very much. For example, at eBay,
the users are in one shard, and the items for sale are in
another. At Flixster, movie ratings are in one shard and
comments are in another. This approach depends on understanding
your domain so that you can segment data cleanly.

Key-based sharding

In this approach, you find a key in your data that will
evenly distribute it across shards. So instead of simply storing
one letter of the alphabet for each server as in the (naive and
improper) earlier example, you use a one-way hash on a key data
element and distribute data across machines according to the
hash. It is common in this strategy to find time-based or
numeric keys to hash on.

Lookup table

In this approach, one of the nodes in the cluster acts as
a “yellow pages” directory and looks up which node has the data
you’re trying to access. This has two obvious disadvantages. The
first is that you’ll take a performance hit every time you have
to go through the lookup table as an additional hop. The second
is that the lookup table not only becomes a bottleneck, but a
single point of failure.

Note

Sharding can minimize contention depending on your strategy and
allows you not just to scale horizontally, but then to scale more
precisely, as you can add power to the particular shards that need
it.

Sharding could be termed a kind of “shared-nothing” architecture
that’s specific to databases. A shared-nothing
architecture is one in which there is no centralized (shared) state,
but each node in a distributed system is independent, so there is no
client contention for shared resources. The term was first coined by
Michael Stonebraker at University of California at Berkeley in
his 1986 paper “The Case for Shared Nothing.”

Shared Nothing was more recently popularized by Google, which
has written systems such as its Bigtable database and its MapReduce
implementation that do not share state, and are therefore capable of
near-infinite scaling. The Cassandra database is a shared-nothing
architecture, as it has no central controller and no notion of
master/slave; all of its nodes are the same.

Note

You can read the 1986 paper “The Case for Shared Nothing”
online at http://db.cs.berkeley.edu/papers/hpts85-nothing.pdf.
It’s only a few pages. If you take a look, you’ll see that many of
the features of shared-nothing distributed data architecture, such
as ease of high availability and the ability to scale to a very
large number of machines, are the very things that Cassandra excels
at.

MongoDB also provides auto-sharding capabilities to manage
failover and node balancing. That many nonrelational databases offer
this automatically and out of the box is very handy; creating and
maintaining custom data shards by hand is a wicked proposition. It’s
good to understand sharding in terms of data architecture in general,
but especially in terms of Cassandra more specifically, as it can take
an approach similar to key-based sharding to distribute data across
nodes, but does so automatically.

Summary

In summary, relational databases are very good at solving
certain data storage problems, but because of their focus, they also
can create problems of their own when it’s time to scale. Then, you
often need to find a way to get rid of your joins, which means
denormalizing the data, which means maintaining multiple copies of
data and seriously disrupting your design, both in the database and in
your application. Further, you almost certainly need to find a way
around distributed transactions, which will quickly become a
bottleneck. These compensatory actions are not directly supported in
any but the most expensive RDBMS. And even if you can write such a
huge check, you still need to carefully choose partitioning keys to
the point where you can never entirely ignore the
limitation.

Perhaps more importantly, as we see some of the limitations of
RDBMS and consequently some of the strategies that architects have
used to mitigate their scaling issues, a picture slowly starts to
emerge. It’s a picture that makes some NoSQL solutions seem perhaps
less radical and less scary than we may have thought at first, and
more like a natural expression and encapsulation of some of the work
that was already being done to manage very large databases.

Web Scale

An invention has to make sense in the world in which it is
finished, not the world in which it is started.

—Ray Kurzweil

Because of some of the inherent design decisions in RDBMS, it is
not always as easy to scale as some other, more recent possibilities
that take the structure of the Web into consideration. But it’s not only
the structure of the Web we need to consider, but also its phenomenal
growth, because as more and more data becomes available, we need
architectures that allow our organizations to take advantage of this
data in near-time to support decision making and to offer new and more
powerful features and capabilities
to our customers.

Note

It has been said, though it is hard to verify, that the
17th-century English poet John Milton had actually read every
published book on the face of the earth. Milton knew many languages
(he was even learning Navajo at the time of his death), and given that
the total number of published books at that time was in the thousands,
this would have been possible. The size of the world’s data stores
have grown somewhat since then.

In 2006, the amount of data on the Internet was approximately
166 exabytes (166EB). In 2010, that number reached nearly 1,000
exabytes. An exabyte is one quintillion bytes, or 1.1 million
terabytes. To put this statistic in perspective, 1EB is roughly the
equivalent of 50,000 years of DVD-quality video. 166EB is
approximately three million times the amount of information
contained in all the books ever written.

Wal-Mart’s database of customer transactions is reputed to
have stored 110 terabytes in 2000, recording tens of millions of
transactions per day. By 2004, it had grown to half a
petabyte.

The movie Avatar required 1PB storage
space, or the equivalent of a single MP3 song—if that MP3 were 32
years long (source: http://bit.ly/736XCz).

As of May 2010, Google was provisioning 100,000 Android phones
every day, all of which have Internet access as a foundational
service.

In 1998, the number of email accounts was approximately 253
million. By 2010, that number is closer to 2 billion.

As you can see, there is great variety to the kinds of data that
need to be stored, processed, and queried, and some variety to the
businesses that use such data. Consider not only customer data at
familiar retailers or suppliers, and not only digital video content, but
also the required move to digital television and the explosive growth of
email, messaging, mobile phones, RFID, Voice Over IP (VoIP) usage, and
more. We now have Blu-ray players that stream movies and music. As we
begin departing from physical consumer media storage, the companies that
provide that content—and the third-party value-add businesses built
around them—will require very scalable data solutions. Consider too that
as a typical business application developer or database administrator,
we may be used to thinking of relational databases as the center of our
universe. You might then be surprised to learn that within corporations,
around 80% of data is unstructured.

Or perhaps you think the kind of scale afforded by NoSQL solutions
such as Cassandra don’t apply to you. And maybe they don’t. It’s very
possible that you simply don’t have a problem that Cassandra can help
you with. But I’m not asking you to envision your database and its data
as they exist today and figure out ways to migrate to Cassandra. That
would be a very difficult exercise, with a payoff that might be hard to
see. It’s almost analytic that the database you have today is exactly
the right one for your application of today. But if you could
incorporate a wider array of rich data sets to help improve your
applications, what kinds of qualities would you then be looking for in a
database? The question becomes what kind of application would you want
to have if durability, elastic scalability, vast storage, and
blazing-fast writes weren’t a problem?

In a world now working at web scale and looking to the future,
Apache Cassandra might be one part of the answer.

The Cassandra Elevator Pitch

Hollywood screenwriters and software startups are often advised to
have their “elevator pitch” ready. This is a summary of exactly what their
product is all about—concise, clear, and brief enough to deliver in just a
minute or two, in the lucky event that they find themselves sharing an
elevator with an executive or agent or investor who might consider funding
their project. Cassandra has a compelling story, so let’s boil it down to
an elevator pitch that you can present to your manager or colleagues
should the occasion arise.

Cassandra in 50 Words or Less

“Apache Cassandra is an open source, distributed, decentralized,
elastically scalable, highly available, fault-tolerant, tuneably
consistent, column-oriented database that bases its distribution design
on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at
Facebook, it is now used at some of the most popular sites on the Web.”
That’s exactly 50 words.

Of course, if you were to recite that to your boss in the
elevator, you’d probably get a blank look in return. So let’s break down
the key points in the following sections.

Distributed and Decentralized

Cassandra is distributed, which means that it
is capable of running on multiple machines
while appearing to users as a unified whole. In fact, there is
little point in running a single Cassandra node. Although you can do it,
and that’s acceptable for getting up to speed on how it works, you
quickly realize that you’ll need multiple machines to really realize any
benefit from running Cassandra. Much of its design and code base is
specifically engineered toward not only making it work across many
different machines, but also for optimizing performance across multiple
data center racks, and even for a single Cassandra cluster running
across geographically dispersed data centers. You can confidently write
data to anywhere in the cluster and Cassandra will get it.

Once you start to scale many other data stores (MySQL, Bigtable),
some nodes need to be set up as masters in order to organize other
nodes, which are set up as slaves. Cassandra, however, is decentralized,
meaning that every node is identical; no Cassandra node performs certain
organizing operations distinct from any other node. Instead, Cassandra features a peer-to-peer
protocol and uses gossip to maintain and keep in sync a list of nodes
that are alive or dead.

The fact that Cassandra is decentralized
means that there is no single point of failure. All of the nodes in a
Cassandra cluster function exactly the same. This is sometimes referred
to as “server symmetry.” Because they are all doing the same thing, by
definition there can’t be a special host that is coordinating
activities, as with the master/slave setup that you see in MySQL,
Bigtable, and so many others.

In many distributed data solutions (such as RDBMS clusters), you
set up multiple copies of data on different servers in a process called
replication, which copies the data to multiple machines so that they can
all serve simultaneous requests and improve performance. Typically this
process is not decentralized, as in Cassandra, but is rather performed
by defining a master/slave relationship. That is,
all of the servers in this kind of cluster don’t function in the same
way. You configure your cluster by designating one server as the master
and others as slaves. The master acts as the authoritative source of the data, and operates in a
unidirectional relationship with the slave nodes, which must synchronize
their copies. If the master node fails, the whole database is in
jeopardy. The decentralized design is therefore one of the keys to
Cassandra’s high availability.
Note that while we frequently understand master/slave replication in the
RDBMS world, there are NoSQL databases such as MongoDB that follow the
master/slave scheme as well.

Decentralization, therefore, has two key advantages: it’s simpler
to use than master/slave, and it helps you avoid outages. It can be
easier to operate and maintain a decentralized store than a master/slave
store because all nodes are the same. That means that you don’t need any
special knowledge to scale; setting up 50 nodes isn’t much different
from setting up one. There’s next to no configuration required to
support it. Moreover, in a master/slave setup, the master can become a
single point of failure (SPOF). To avoid this, you often need to add
some complexity to the environment in the form of multiple masters.
Because all of the replicas in Cassandra are identical, failures of a
node won’t disrupt service.

In short, because Cassandra is distributed and decentralized,
there is no single point of failure, which supports high
availability.

Elastic Scalability

Scalability is an architectural feature of a system that can
continue serving a greater number of requests with little degradation in
performance. Vertical scaling—simply adding more hardware capacity and
memory to your existing machine—is the easiest way to achieve this.
Horizontal scaling means adding more machines that have all or some of
the data on them so that no one machine has to bear the entire burden of
serving requests. But then the software itself must have an internal
mechanism for keeping its data in sync with the other nodes in the
cluster.

Elastic scalability refers to a special
property of horizontal scalability. It means that your cluster can seamlessly scale up and
scale back down. To do this, the cluster must be able to accept new
nodes that can begin participating by getting a copy of some or all of
the data and start serving new user requests without major disruption or
reconfiguration of the entire cluster. You don’t have to restart your
process. You don’t have to change your application queries. You don’t
have to manually rebalance the data yourself. Just add another
machine—Cassandra will find it and start sending it work.

Scaling down, of course, means removing some of the processing
capacity from your cluster. You might have to do this if you move parts
of your application to another platform, or if your application loses
users and you need to start selling off hardware. Let’s hope that
doesn’t happen. But if it does, you won’t need to upset the entire apple
cart to scale back.

High Availability and Fault Tolerance

In general architecture terms, the availability of a system is
measured according to its ability to fulfill requests. But computers can
experience all manner of failure, from hardware component failure to
network disruption to corruption. Any computer is susceptible to these
kinds of failure. There are of course very sophisticated (and often
prohibitively expensive) computers that can themselves mitigate many of
these circumstances, as they include internal hardware redundancies and
facilities to send notification of
failure events and hot swap components. But anyone can accidentally
break an Ethernet cable, and catastrophic events can beset a single data
center. So for a system to be highly available, it must typically
include multiple networked computers, and the software they’re running must then be
capable of operating in a cluster and have some facility for recognizing
node failures and failing over requests to another part of the
system.

Cassandra is highly available. You can replace failed nodes in the
cluster with no downtime, and you can replicate data to multiple data
centers to offer improved local performance and prevent downtime if one
data center experiences a catastrophe such as fire or flood.

Tuneable Consistency

Consistency essentially means that a read
always returns the most recently written value. Consider two customers are
attempting to put the same item into their shopping carts on an
ecommerce site. If I place the last item in stock into my cart an
instant after you do, you should get the item added to your cart, and I
should be informed that the item is no longer available for purchase.
This is guaranteed to happen when the state of a write is consistent
among all nodes that have that data.

But there’s no free lunch, and as we’ll see later, scaling data
stores means making certain trade-offs between data consistency, node
availability, and partition tolerance. Cassandra is frequently called
“eventually consistent,” which is a bit misleading. Out of the box,
Cassandra trades some consistency in order to achieve total
availability. But Cassandra is more accurately termed “tuneably
consistent,” which means it allows you
to easily decide the level of consistency you require, in
balance with the level of availability.

Let’s take a moment to unpack this, as the term “eventual
consistency” has caused some uproar in the industry. Some practitioners
hesitate to use a system that is described as “eventually
consistent.”

For detractors of eventual consistency, the broad argument goes
something like this: eventual consistency is maybe OK for social web
applications where data doesn’t really matter.
After all, you’re just posting to mom what little Billy ate for
breakfast, and if it gets lost, it doesn’t really matter. But the data
I have is actually really important, and it’s ridiculous to think
that I could allow eventual consistency in my model.

Set aside the fact that all of the most popular web applications
(Amazon, Facebook, Google, Twitter) are using this model, and that
perhaps there’s something to it. Presumably such data is very important
indeed to the companies running these applications, because that data is their
primary product, and they are multibillion-dollar companies with
billions of users to satisfy in a sharply competitive world. It may be
possible to gain guaranteed, immediate, and perfect consistency
throughout a highly trafficked system running in parallel on a variety
of networks, but if you want clients to get their results sometime this
year, it’s a very tricky proposition.

The detractors claim that some Big Data databases such as
Cassandra have merely eventual consistency, and that all other
distributed systems have strict consistency. As
with so many things in the world, however, the reality is not so black
and white, and the binary opposition between consistent and
not-consistent is not truly reflected in practice. There are instead
degrees of consistency, and in the real world they
are very susceptible to external circumstance.

Eventual consistency is one of several consistency models
available to architects. Let’s take a look at these models so we can
understand the trade-offs:

Strict consistency

This is sometimes called sequential consistency, and is the
most stringent level of consistency. It requires that any read
will always return the most recently written value. That sounds
perfect, and it’s exactly what I’m looking for. I’ll take it!
However, upon closer examination, what do we find? What precisely
is meant by “most recently written”? Most recently to whom? In one
single-processor machine, this is no problem to observe, as the
sequence of operations is known to the one clock. But in a system
executing across a variety of geographically dispersed data
centers, it becomes much more slippery. Achieving this implies
some sort of global clock that is capable of timestamping all
operations, regardless of the location of the data or the user
requesting it or how many (possibly disparate) services are
required to determine the response.

Causal consistency

This is a slightly weaker form of strict consistency. It
does away with the fantasy of the single global clock that can
magically synchronize all operations without creating an
unbearable bottleneck. Instead of relying on timestamps, causal
consistency instead takes a more semantic approach, attempting to
determine the cause of events to create some consistency in their
order. It means that writes that are potentially related must be
read in sequence. If two different, unrelated operations suddenly
write to the same field, then those writes are inferred not to be
causally related. But if one write occurs after another, we might
infer that they are causally
related. Causal consistency dictates that causal writes must be
read in sequence.

Weak (eventual) consistency

Eventual consistency means on the surface that all updates
will propagate throughout all of the replicas in a distributed
system, but that this may take some time. Eventually, all replicas
will be consistent.

Eventual consistency becomes suddenly very attractive when you
consider what is required to achieve stronger forms of
consistency.

When considering consistency, availability, and partition
tolerance, we can achieve only two of these goals in a given distributed
system (we explore the CAP Theorem in the section Brewer’s CAP Theorem). At the center of the problem is data update
replication. To achieve a strict consistency, all update operations will
be performed synchronously, meaning that they must block, locking all
replicas until the operation is complete, and forcing competing clients
to wait. A side effect of such a design is that during a failure, some
of the data will be entirely unavailable. As Amazon CTO Werner Vogels puts it, “rather
than dealing with the uncertainty of the correctness of an answer, the
data is made unavailable until it is absolutely certain that it is
correct” (“Dynamo: Amazon’s Highly Distributed Key-Value Store”: [http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html],
207).

We could alternatively take an optimistic approach to replication,
propagating updates to all replicas in the background in order to avoid
blowing up on the client. The difficulty this approach presents is that
now we are forced into the situation of detecting and resolving
conflicts. A design approach must decide whether to resolve these
conflicts at one of two possible times: during reads or during writes.
That is, a distributed database designer must choose to make the system
either always readable or always writable.

Dynamo and Cassandra choose to be always writable, opting to defer
the complexity of reconciliation to read operations, and realize
tremendous performance gains. The alternative is to reject updates
amidst network and server failures.

In Cassandra, consistency is not an all-or-nothing proposition, so
we might more accurately term it “tuneable consistency” because the
client can control the number of replicas to block on for all updates.
This is done by setting the consistency level against the replication
factor.

The replication factor lets you decide how
much you want to pay in performance to gain more consistency. You set
the replication factor to the number of nodes in the cluster you want
the updates to propagate to (remember that an update means any add, update, or delete
operation).

The consistency level is a setting that
clients must specify on every operation and that allows you to decide
how many replicas in the cluster must acknowledge a write operation or
respond to a read operation in order to be considered successful. That’s
the part where Cassandra has pushed the decision for determining
consistency out to the client.

So if you like, you could set the consistency level to a number
equal to the replication factor, and gain stronger consistency at the
cost of synchronous blocking operations that wait for all nodes to be
updated and declare success before returning. This is not often done in
practice with Cassandra, however, for reasons that should be clear (it
defeats the availability goal, would impact performance, and generally
goes against the grain of why you’d want to use Cassandra in the first
place). So if the client sets the
consistency level to a value less than the replication factor,
the update is considered successful even if some nodes are down.

Brewer’s CAP Theorem

In order to understand Cassandra’s design and its label as an
“eventually consistent” database, we need to understand the CAP theorem.
The CAP theorem is sometimes called Brewer’s theorem after its author,
Eric Brewer.

While working at University of California at Berkeley, Eric Brewer
posited his CAP theorem in 2000 at the ACM Symposium on the Principles
of Distributed Computing. The theorem states that within a large-scale
distributed data system, there are three requirements that have a
relationship of sliding dependency: Consistency, Availability, and
Partition Tolerance.

Consistency

All database clients will read the same value for the same
query, even given concurrent updates.

Availability

All database clients will always be able to read and write
data.

Partition Tolerance

The database can be split into multiple machines; it can
continue functioning in the face of network segmentation
breaks.

Brewer’s theorem is that in any given system, you can strongly
support only two of the three. This is analogous to the saying you may
have heard in software development: “You can have it good, you can have
it fast, you can have it cheap: pick two.”

We have to choose between them because of this sliding mutual
dependency. The more consistency you demand from your system, for
example, the less partition-tolerant you’re likely to be able to make
it, unless you make some concessions around availability.

The CAP theorem was formally proved to be true by Seth Gilbert and
Nancy Lynch of MIT in 2002. In distributed systems, however, it is very
likely that you will have network partitioning, and that at some point,
machines will fail and cause others to become unreachable. Packet loss,
too, is nearly inevitable. This leads us to the conclusion that a
distributed system must do its best to continue operating in the face of
network partitions (to be Partition-Tolerant), leaving us with only two
real options to choose from: Availability and Consistency.

Figure 1-1 illustrates visually that there
is no overlapping segment where all three are obtainable.

Figure 1-1. CAP Theorem indicates that you can realize only two of these
properties at once

It might prove useful at this point to see a graphical depiction
of where each of the nonrelational
data stores we’ll look at falls within the CAP spectrum. The graphic in
Figure 1-2 was inspired by a slide in a 2009
talk given by Dwight Merriman, CEO and founder of MongoDB, to the MySQL
User Group in New York City (you can watch it online at http://bit.ly/7r6kRg). However, I have modified the
placement of some systems based on my research.

Figure 1-2 shows the general focus of
some of the different databases we discuss in this chapter. Note that
placement of the databases in this chart could change based on
configuration. As Stu Hood points out, a distributed MySQL database can
count as a consistent system only if you’re using Google’s synchronous
replication patches; otherwise, it can only be Available and
Partition-Tolerant (AP).

It’s interesting to note that the design of the system around CAP
placement is independent of the orientation of the data storage
mechanism; for example, the CP edge is populated by graph databases and
document-oriented databases alike.

Figure 1-2. Where different databases appear on the CAP continuum

In this depiction, relational databases are on the line between
Consistency and Availability, which means that they can fail in the
event of a network failure (including a cable breaking). This is
typically achieved by defining a single master server, which could
itself go down, or an array of servers that simply don’t have sufficient
mechanisms built in to continue functioning in the case of network
partitions.

Graph databases such as Neo4J and the set of databases derived at
least in part from the design of Google’s Bigtable database (such as
MongoDB, HBase, Hypertable, and Redis) all are focused slightly less on
Availability and more on ensuring Consistency and Partition
Tolerance.

Note

If you’re interested in the properties of other Big Data or
NoSQL databases, see this book’s Appendix A.

Finally, the databases derived from Amazon’s Dynamo design include
Cassandra, Project Voldemort, CouchDB, and Riak. These are more focused
on Availability and Partition-Tolerance. However, this does not mean
that they dismiss Consistency as unimportant, any more than Bigtable
dismisses Availability. According to the Bigtable paper, the average
percentage of server hours that “some data” was unavailable is 0.0047%
(section 4), so this is relative, as we’re talking about very robust
systems already. If you think of each of these letters (C, A, P) as
knobs you can tune to arrive at the system you want, Dynamo derivatives
are intended for employment in the many use cases where “eventual
consistency” is tolerable and where “eventual” is a matter of
milliseconds, read repairs mean that reads will return consistent
values, and you can achieve strong consistency if you want to.

So what does it mean in practical terms to support only two of the
three facets of CAP?

CA

To primarily support Consistency and Availability means that
you’re likely using two-phase commit for distributed transactions.
It means that the system will block when a network partition
occurs, so it may be that your system is limited to a single data
center cluster in an attempt to mitigate this. If your application
needs only this level of scale, this is easy to manage and allows
you to rely on familiar, simple structures.

CP

To primarily support Consistency and Partition Tolerance,
you may try to advance your
architecture by setting up data shards in order to scale. Your
data will be consistent, but you still run the risk of some data
becoming unavailable if nodes fail.

AP

To primarily support Availability and Partition Tolerance,
your system may return inaccurate data, but the system will always
be available, even in the face of network partitioning. DNS is
perhaps the most popular example of a system that is massively
scalable, highly available, and partition-tolerant.

Note

Note that this depiction is intended to offer an overview that
helps draw distinctions between the broader contours in these systems;
it is not strictly precise. For example, it’s not entirely clear where
Google’s Bigtable should be
placed on such a continuum. The Google paper describes Bigtable as
“highly available,” but later goes on to say that if Chubby (the
Bigtable persistent lock service) “becomes unavailable for an extended
period of time [caused by Chubby outages or network issues], Bigtable
becomes unavailable” (section 4). On the matter of data reads, the
paper says that “we do not consider the possibility of multiple copies
of the same data, possibly in alternate forms due to views or
indices.” Finally, the paper indicates that “centralized control and
Byzantine fault tolerance are not Bigtable goals” (section 10). Given
such variable information, you can see that determining where a
database falls on this sliding scale is not an exact
science.

Row-Oriented

Cassandra is frequently referred to as a “column-oriented”
database, which is not incorrect. It’s not relational, and it does
represent its data structures in sparse multidimensional hashtables. “Sparse”
means that for any given row you can have one or more columns, but each
row doesn’t need to have all the same columns as other rows like it (as
in a relational model). Each row has a unique key, which makes its data
accessible. So although it’s not wrong to say that Cassandra is columnar
or column-oriented, it might be more helpful to think of it as an
indexed, row-oriented store, as we examine more thoroughly in Chapter 3. I list the data orientation as a feature,
because there are several data models that are easy to visualize and use
in a nonrelational model; it’s a weird mixture of laziness and possibly
inviting far more work than necessary to just assume that the relational
model is always best, regardless of yourapplication.

Cassandra stores data in what can be thought of for now as a
multidimensional hash table. That means you don’t have to decide ahead
of time precisely what your data structure must look like, or what
fields your records will need. This can be useful if you’re in startup
mode and are adding or changing features with some frequency. It is also
attractive if you need to support an Agile development methodology and
aren’t free to take months for up-front analysis. If your business
changes and you later need to add or remove new fields on the fly
without disrupting service, go ahead; Cassandra lets you.

That’s not to say that you don’t have to think about your data,
though. On the contrary, Cassandra requires a shift in how you think
about it. Instead of designing a pristine data model and then designing
queries around the model as in RDBMS, you are free to think of your
queries first, and then provide the data that answers them.

Schema-Free

Cassandra requires you to define an outer container, called a
keyspace, that contains column families. The keyspace is essentially
just a logical namespace to hold column families and certain
configuration properties. The column families are names for associated
data and a sort order. Beyond that, the data tables are sparse, so you
can just start adding data to it, using the columns that you want;
there’s no need to define your columns ahead of time. Instead of
modeling data up front using expensive data modeling tools and then
writing queries with complex join statements, Cassandra asks you to
model the queries you want, and then provide the data around
them.

High Performance

Cassandra was designed specifically from the ground up to take
full advantage of multiprocessor/multicore machines, and to
run across many dozens of these machines housed in multiple data
centers. It scales consistently and seamlessly to hundreds of terabytes.
Cassandra has been shown to perform exceptionally well under heavy load.
It consistently can show very fast throughput for writes per second on a
basic commodity workstation. As you add more servers, you can maintain
all of Cassandra’s desirable properties without sacrificing
performance.

Where Did Cassandra Come From?

The Cassandra data store is an open source Apache project available
at http://cassandra.apache.org. Cassandra originated
at Facebook in 2007 to solve that company’s inbox search problem, in which
they had to deal with large volumes of data in a way that was difficult to
scale with traditional methods. Specifically, the team had requirements to
handle huge volumes of data in the form of message copies, reverse indices
of messages, and many random reads and many simultaneous random
writes.

The team was led by Jeff Hammerbacher, with Avinash Lakshman,
Karthik Ranganathan, and Facebook engineer on the Search Team Prashant
Malik as key engineers. The code was
released as an open source Google Code project in July 2008. During its
tenure as a Google Code project in 2008, the code was updateable only by
Facebook engineers, and little community was built around it as a result.
So in March 2009 it was moved to an Apache Incubator project, and on
February 17, 2010 it was voted into a top-level project.

Cassandra today presents a kind of paradox: it feels new and
radical, and yet it’s solidly rooted in many standard, traditional
computer science concepts and maxims that successful predecessors have
already institutionalized. Cassandra is a realist’s kind of database; it doesn’t depart from the
relational model to be a fun art project or experiment for smart
developers. It was created specifically to solve a real-world problem that
existing tools weren’t able to solve. It acknowledges the limitations of
prior methods and faces our new world of big data head-on.

How Did Cassandra Get Its Name?

I’m a little surprised how often people ask me where the database
got its name. It’s not the first thing I think of when I hear about a
project. But it is interesting, and in the case of this database, it’s
felicitously meaningful.

In Greek mythology, Cassandra was the daughter of King Priam and
Queen Hecuba of Troy. Cassandra was so beautiful that the god Apollo
gave her the ability to see the future. But when she refused his amorous
advances, he cursed her such that she would still be able to accurately
predict everything that would happen—but no one would believe her.
Cassandra foresaw the destruction of her city of Troy, but was powerless
to stop it. The Cassandra distributed database is named for her. I
speculate that it is also named as kind of a joke on the Oracle at
Delphi, another seer for whom a database is named.

Use Cases for Cassandra

We have now unpacked the elevator pitch and have an understanding of
Cassandra’s advantages. Despite Cassandra’s sophisticated design and smart
features, it is not the right tool for every job. So in this section let’s
take a quick look at what kind of projects Cassandra is a good fit
for.

Large Deployments

You probably don’t drive a semi truck to pick up your dry
cleaning; semis aren’t well suited for that sort of task. Lots of
careful engineering has gone into Cassandra’s high availability,
tuneable consistency, peer-to-peer protocol, and seamless scaling, which
are its main selling points. None of these qualities is even meaningful
in a single-node deployment, let alone allowed to realize its full
potential.

There are, however, a wide variety of situations where a
single-node relational database is all we may need. So do some
measuring. Consider your expected traffic, throughput needs, and SLAs.
There are no hard and fast rules here, but if you expect that you can
reliably serve traffic with an acceptable level of performance with just
a few relational databases, it might be a better choice to do so, simply
because RDBMS are easier to run on a single machine and are more
familiar.

If you think you’ll need at least several nodes to support your
efforts, however, Cassandra might be a good fit. If your application is
expected to require dozens of nodes, Cassandra might be a great
fit.

Lots of Writes, Statistics, and Analysis

Consider your application from the perspective of the ratio of
reads to writes. Cassandra is optimized for excellent throughput on
writes.

Many of the early production deployments of Cassandra involve
storing user activity updates, social network usage,
recommendations/reviews, and application statistics. These are strong
use cases for Cassandra because they involve lots of writing with less
predictable read operations, and because updates can occur unevenly with
sudden spikes. In fact, the ability to handle application workloads that
require high performance at significant write volumes with many
concurrent client threads is one of the primary features of
Cassandra.

According to the project wiki, Cassandra has been used to create a
variety of applications, including a windowed time-series store, an
inverted index for document searching, and a distributed job priority
queue.

Geographical Distribution

Cassandra has out-of-the-box support for geographical distribution
of data. You can easily configure Cassandra to replicate data across
multiple data centers. If you have a globally deployed application that
could see a performance benefit from putting the data near the user,
Cassandra could be a great fit.

Evolving Applications

If your application is evolving rapidly and you’re in “startup
mode,” Cassandra might be a good fit given its schema-free data model.
This makes it easy to keep your database
in step with application changes as you rapidly
deploy.

Who Is Using Cassandra?

Cassandra is still in its early stages in many ways, not yet seeing
its 1.0 release at the time of this writing. There are few easy, graphical
tools to help manage it, and the community has not settled on certain key
internal and external design questions that have been revisited. But what
does it say about the promise, usefulness, and stability of a data store
that even in its early stages is being used in production by many large,
well-known companies?

Note

It is a logical fallacy, informally called the Bandwagon Fallacy,
to argue that just because something is growing in popularity means that
it is “true.” Cassandra is without a doubt enjoying skyrocketing growth
in popularity, especially over the past year or so. Still, my point here
is that the many successful
production deployments at a variety of companies for a variety of
purposes is sufficient to suggest its usefulness and readiness.

The list of companies using Cassandra is growing. These companies
include:

Twitter is using Cassandra for analytics. In a much-publicized
blog post (at http://engineering.twitter.com/2010/07/cassandra-at-twitter-today.html),
Twitter’s primary Cassandra engineer, Ryan King, explained that
Twitter had decided against using Cassandra as its primary store for
tweets, as originally planned, but would instead use it in production
for several different things: for real-time analytics, for geolocation
and places of interest data, and for data mining over the entire user
store.

Mahalo uses it for its primary near-time data store.

Facebook still uses it for inbox search, though they are using a
proprietary fork.

Digg uses it for its primary near-time data store.

Rackspace uses it for its cloud service, monitoring, and
logging.

Reddit uses it as a persistent cache.

Cloudkick uses it for monitoring statistics and
analytics.

Ooyala uses it to store and serve near real-time video analytics
data.

SimpleGeo uses it as the main data store for its real-time
location infrastructure.

Onespot uses it for a subset of its main data store.

Cassandra is also being used by Cisco and Platform64, and is
starting to see use at Comcast and bee.tv for personalized television
streaming to the Web and to mobile devices. There are others. The bottom
line is that the uses are real. A wide variety of companies are finding
use cases for Cassandra and seeing success with it. As of this writing,
the largest known Cassandra installation is at Facebook, where they have
more than 150TB of data on more than 100 machines.

Many more companies are currently evaluating Cassandra for
production use in different projects, and a services company called
Riptano, cofounded by Jonathan Ellis, the Apache Project Chair for
Cassandra, was started in April of 2010. As more features are added and
better tooling and support options are rolled out, anticipate even broader
adoption.

Summary

In this chapter, we’ve taken an introductory look at Cassandra’s
defining characteristics, history, and major features. We have seen which
major companies are using it and
what they’re using it for. We also examined a bit of history of
the evolution of important contributions to the database field in order to
gain a historical view of Cassandra’s value proposition.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training,
learning paths, books, interactive tutorials, and more.