It's worth reading Nathan Marz's book Big Data. It's an easy read but as a MEAP there are a few glaring typos.

It goes into the Lamda architecture in depth and the though processes underpinning it.

The message is that it is not just the tech underpinning what Twitter do it is the manner in which they use it.

As the blog posts suggest, they are trying to solve old problems by coming up with new blends of existing solutions. What they are doing goes far beyond what SQL Server can do and yet the "One DB to rule them all" approach really is where SQL Server has been moving

I took an interesting one-day course from DataStax on Apache Cassandra - a big data solution used by Netflix, Comcast, eBay, and dozens of other companies.

Like what Twitter uses, it's designed to be easily scaled horizontally, and provides great performance, but with data-duplication and at the cost of consistency; you get eventual consistency, which is fine for social media, but not so hot for banks or financial institutions.

DataStax sells a commercial version (support, extra tools), so I was expecting some bias, but what struck me as amusing (and a little over the top) was the presenter calling RDBMSs "legacy" all day long.

Cobol is a legacy language - it's been superseded by languages that can do everything Cobol can do, and more.

Cassandra, and other big databases, can't do everything that SQL Server, Oracle, MySQL, etc, can do. In fact, big-data databases are kind of crappy at a lot of what an RDBMS can do with ease. I am confident Netflix doesn't use Cassandra as their billing database, for example. But maybe they use Cassandra to see if there is a correlation between the movies and tv-shows you watch and your credit card of choice.

Fortunately, both solutions play well with each other, just as an OLTP system integrates well with a data warehouse.

but what struck me as amusing (and a little over the top) was the presenter calling RDBMSs "legacy" all day long.

I've found that the techy types at the NOSQL vendors are usually pretty honest about what their use case and what doesn't fit their use case.

The marketing types are another matter entirely. One marketing type extolled the virtues of their distro of Hadoop stating that they could ingest 500GB in half an hour and that was way beyond the capabilities of traditional RDBMS. Clearly they hadn't seen the Microsoft article on loading 1TB in 30 minutes.

One of the SQLCat labs at SQL Bits (Liverpool) actually went into the details of how this was done and it was remarkably straight forward.

Of course the Hadoop marketeer didn't give details of what that 500GB looked like. Was it a simple log file or a complex structured JSON document. we'll never know!

Then there is the irritating tendency to demonstrate the NOSQL products superiority by using an awful schema in an RDBMS to say that the RDBMS can't cope. That's like saying that if you make Usain Bolt wear diving boots it prooves he's a lousy sprinter.

NoSQL databases are the digital equivalent of a basement level archive with pallets full of boxes more or less organized. They typically can't function as a primary database of record or even an operational data store. It's the transactional or meta-data layer that "rules" all the other disparate data stores. For example, on the Hadoop platform, HCatalog, the meta-data layer that maps files to logical tables names, must be a relational database like MySQL or SQL Server.

Unfortunately there are many IT shops who treat their relational database like a digital landfill: no thought for normalization, constraints, or optimization. Archived emails, website clicks, basically anything consisting of 0s and 1s, gets dumped into the landfill that gets so large it needs it own zip code.

When the database starts crashing, they ask themselves: "Do we need a bigger dump truck for all this?"