This announcement comes shortly after Facebook launched 60 social apps that focused mostly on cooking, eating, travel, running and reviewing movies. With this bevy of now 80+ social apps, Facebook hopes that its users will think about the platform more as a space for lifestreaming everything rather than just a place for connecting with friends and family members.

Of course, try asking Facebook what it thinks Facebook should be used for, it will not give an answer. Use it for whatever you want. Install these apps if you feel like it. It’s Facebook’s world, sure, but users do have the ability to shape it, or opt-out completely.

One of the apps is viral content site Buzzfeed, which is a shoe in for Facebook. Content from Buzzfeed does remarkably well on Facebook. Especially if it involves adorable pups like this lovely collection of smiling Corgis.

Social entertainment app GetGlue joins Facebook as well. Already users of GetGlue have the ability to share content to Facebook, so this integration will only strengthen the relationship between the two social platforms.

In 2010, Apache Cassandra increased its momentum as the leading scalable database. Here is a summary of the notable activity in three areas: code, community and controversy. As always, comments are welcome.

We also saw Lucandra, which implements a Cassandra back end for Lucene and is used in several high volume production sites, grow up into Solandra, embedding Solr and Cassandra in the same JVM for even more performance.

Controversy

Cassandra got a lot of negative publicity when Kevin Rose blamed Cassandra for Digg v4’s teething problems. However, there was no deluge of bug reports coming out of Digg’s Cassandra team, and Digg engineers Arin Sarkissian and Chris Goffinet (now working on Cassandra for Twitter) got on Quora to refute the idea that Cassandra was at fault:

The new version of Digg has a whole new architecture with a bunch of technologies involved. Problem is, over the last few months or so the only technological change we mentioned (blogged about etc) was Cassandra. That made it pretty easy for folks to cling on to it as the "problem".

Cassandra was a tragic figure in Greek myth — she could hear the future and thus was able to foretell what was coming next (usually death and destruction). It’s no surprise that no one wanted her hanging around. It’s ironic that an open source NoSQL software of the same name has often found itself amidst controversy. Today, Cassandra was blamed for scaling (and availability) problems at Digg, which led to the yet-unconfirmed departure of Digg VP of Engineering John Quinn, who was a big champion of Cassandra at Digg.

This is not the first time Cassandra — which was created inside Facebook and later open sourced — is taking a beating. Back in July, Twitter reversed its plans to move from MySQL to Cassandra for storing its tweets. Comments by Digg founder Kevin Rose as he tries to explain some problems on Digg’s new site aren’t helping either. But a call to Matt Pfeil, CEO of Riptano — an Austin, Texas-based startup — put thing in perspective. Riptano is building its business providing service and eventually an easy-to-implement version of Cassandra for companies (see my video interview with Pfeil here.) Pfeil said that Riptano is working with Digg and noted that he would be “shocked” if Digg abandoned Cassandra.

When asked if the problems Digg has had with its upgrade stemmed from Cassandra, Pfeil said, “We’ve reached out to Digg to ID what those problems are. I don’t know the full extent of them, and am learning more from them about their situation. We know Cassandra can scale to levels that are equal to or greater than a Digg is putting on it and I have full faith in Cassandra, but there are these little knobs that need to be tuned and you have to know where they are.”

For Pfeil this could be an opportunity simply because helping find and turn “those little knobs” are what Riptano was formed to do. He said Riptano has been involved with Digg since around April, which was soon after Digg announced its plans to use Cassandra. And while Digg may be able to blame Cassandra for some glitches, the database technology still seems to be on the upswing. Today, Quest — an enterprise software-database support company — decided to support Cassandra through a partnership with Riptano, and companies such as Cisco, Ooyala and Rackspace are also using it.

As Pfeil points out, Cassandra is still new, having been open sourced in 2008. “Cassandra has come a long way, especially in the last year or so … there is a lot to be done before it is close to where it will compare in production environments to something like MySQL, but we’re getting close.” So maybe unlike the Greek prophetess, the database technology will be able to rehabilitate its reputation.

As everyone probably knows by now, Cassandra was originated at Facebook as a solution for inbox search and then open sourced under the ASF umbrella and an Apache license. Since then, Twitter, Digg, Reddit and quite a few others started using it, but not much have been heard from Facebook.

Fact_1: Cassandra’s multi-datacenter replication is one of its earliest features and is by far the most battle-tested in the NoSQL space. Facebook had Cassandra deployed on east and west coast datacenters since before open sourcing it. SimpleGeo’s Cassandra cluster spans 3 EC2 availability zones , and Digg is also deployed on both coasts. Claims that this can’t possibly work are an excellent sign that you’re reading an article by someone who doesn’t know what he’s talking about.

Fiction_2: "It’s impossible to tell when [Cassandra] replicas will be up-to-date."

Fiction_2: Cassandra レプリカがアップデートされるタイミングを、知らせることができない。

Fact_2: Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor), to use the Dynamo vocabulary . If you do writes and reads both with QUORUM, for one example, you can expect data consistency as soon as there are enough reachable nodes for a quorum. Cassandra also provides read repair and anti-entropy , so that even reads at ConsistencyLevel.ONE will be consistent after either of these events.

Fact_3: Although popularity has never been a good metric for determining correctness, it’s true that when using bleeding edge technology, it’s good to have company. As I write this late at night (in the USA), there are 175 people in the Cassandra irc channel, 60 in the HBase one, 32 in Riak’s, and 15 in Voldemort’s. (Six months ago, the numbers were 90, 45, and 12 for Cassandra, HBase, and Voldemort. I did not hang out in #riak yet then.) Mailing list participation tells a similar story. It’s also interesting that the creators of Thrudb and dynomite are both using Cassandra now, indicating that the predicted NoSQL consolidation is beginning.

Fiction_5: Cassandra cannot support Hadoop, or supporting tools such as Pig.

Fiction_5:Cassandra は、Hadoop および、Pig などのサポーティング･ツールに対応できない。

Fact_5: It has always been straightforward to send the output of Hadoop jobs to Cassandra, and Facebook, Digg, and others have been using Hadoop like this as a Cassandra bulk-loader for over a year. For 0.6, I contributed a Hadoop InputFormat and related code to let Hadoop jobs process data from Cassandra as well, while cooperating with Hadoop to keep processing on the nodes that actually hold the data. Stu Hood then contributed a Pig LoadFunc, also in 0.6.

Fact_6: unlike some NoSQL databases (notably MongoDB and HBase ), Cassandra offers full single-server durability . Relying on replication is not sufficient for can’t-afford-to-lose-data scenarios; if your data center loses power, you are highly likely to lose data if you are not syncing to disk no matter how many replicas you have, and if you run large systems in production long enough, you will realize that power outages through some combination of equipment failure and human error are not occurrences you can ignore. But with its fsync ‘d commitlog design, Cassandra can protect you against that scenario too. What to do after your data is saved, e.g. backups and snapshots, is outside of my scope here but covered in the operations wiki page .

Ryan King explained in an interview with Alex Popescu why Twitter is moving to Cassandra for tweet storage, and why they selected Cassandra over the alternatives. My experience is that the more someone understands large systems and the problems you can run into with them from an operational standpoint, the more likely they are to choose Cassandra when doing this kind of evaluation. Ryan’s list of criteria is worth checking out.

Om Malik quoted extensively from the Digg announcement and from Rackspace engineer Stu Hood, who explained Cassandra’s appeal: "Over the Bigtable clones, Cassandra has huge high-availability advantages, and no single point of failure. When compared to the Dynamo adherents, Cassandra has the advantage of a more advanced datamodel, allowing for a single row to contain billions of column/value pairs: enough to fill a machine. You also get efficient range queries for the top level key, and even within your values."

The Twitter and Digg news kicked off a lot of publicity, including a lot of "me too" articles but some interesting ones, including a highscalability post wondering if this was the end of the mysql + memcached era. If not quite yet the end, then the beginning of it. As Ian Eure from Digg said, "If you’re deploying memcache on top of your database, you’re inventing your own ad-hoc, difficult to maintain NoSQL system. Possibly the best commentary on this idea is Dare Obasanjo’s, who explained "Digg’s usage of Cassandra actually serves as a rebuttal to [an article claiming SQL scales just fine] since they couldn’t feasibly get what they want with either horizontal or vertical scaling of their relational database-based solution."

CloudKick blogged about how they use Cassandra for time series data, including a sketch of their data model. CloudKick migrated from PostgreSQL, skewering the theory you will sometimes see proffered that "only MySQL users are migrating to NoSQL, not people who use [my favorite vendor’s relational database]."

The FightMyMonster team switched from HBase to Cassandra after concluding that "HBase is more suitable for data warehousing, and large scale data processing and analysis… and Cassandra is more suitable for real time transaction processing and the serving of interactive data." Dominic covers CAP, architecture considerations, benchmarks, map/reduce, and durability in explaining his conclusion.

Eric Peters gave a talk on Cassandra use at his company, Frugal Mechanic, at the Seattle Tech Startups Meetup. This was interesting not because Frugal Mechanic is a big name but because it’s not. I haven’t seen Eric’s name on the Cassandra mailing lists at all, but there he was deploying it and giving a talk on it, showing that Cassandra is starting to move beyond early adopters. (And, just maybe, that our documentation is improving. :)

Finally, Eric Florenzano has a live demo up now of Cassandra running a Twitter clone at twissandra.com, with source at github, as an example of how to use Cassandra’s data model. If you’re interested in the nuts and bolts of how to build an app on Cassandra, you should check it out.

The last six months have been exciting for Digg’s engineering team. We’re working on a soup-to-nuts rewrite. Not only are we rewriting all our application code, but we’re also rolling out a new client and server architecture. And if that doesn’t sound like a big enough challenge, we’re replacing most of our infrastructure components and moving away from LAMP.

Perhaps our most significant infrastructure change is abandoning MySQL in favor of a NoSQL alternative. To someone like me who’s been building systems almost exclusively on relational databases for almost 20 years, this feels like a bold move.

Our primary motivation for moving away from MySQL is the increasing difficulty of building a high performance, write intensive, application on a data set that is growing quickly, with no end in sight. This growth has forced us into horizontal and vertical partitioning strategies that have eliminated most of the value of a relational database, while still incurring all the overhead.

As our system grows, it’s important for us to span multiple data centers for redundancy and network performance and to add capacity or replace failed nodes with no downtime. We plan to continue using commodity hardware, and to continue assuming that it will fail regularly. All of this is increasingly difficult with MySQL.

Digg is committed to the use and development of open source software and we’re keen to avoid the cost of proprietary large-scale storage solutions. We were inspired by Google and Amazon’s broad use of their non-relational BigTable and Dynamo systems. We evaluated all the usual open source NoSQL suspects. After considerable debate, we decided to go with Cassandra.

Simplistically, Cassandra is a distributed database with a BigTable data model running on a Dynamo like infrastructure. It is column-oriented and allows for the storage of relatively structured data. It has a fully decentralized model; every node is identical and there is no single point of failure. It’s also extremely fault tolerant; data is replicated to multiple nodes and across data centers. Cassandra is also very elastic; read and write throughput increase linearly as new machines are added.

At the time of writing, we’ve reimplemented most of Digg’s functionality using Cassandra as our primary datastore. We’ve supplemented Cassandra-based indexing using full text, relational and graph indexing systems. We’re getting used to dealing with eventual consistency.