Friday, December 14, 2012

Comparison of Popular NoSql databases (MongoDb,CouchDb,Hbase,Neo4j,Cassandra)

There are many SQL databases so far.But i personally feel the 15 years history of SQL coming to an end as everyone is moving to an era of BigData. As experts say SQL databases are not a best fit for Big Data No Sql databases came into picture as a best fit for this which provides more flexibility in storing data.
I just want to compare few popular NoSql databases that are available at this point of time.Few well known NoSql databases are

NoSql databases differ each other more than the way Sql databases differ from each other.I think its one's responsibility to choose the appropriate NoSql database for their application based on their use case.Lets do a quick comparison of these databases.

Best used: If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If you need good performance on a big DB. If you wanted CouchDB, but your data changes too much, filling up disks.

For example: For most things that you would do with MySQL or PostgreSQL, but having predefined columns really holds you back.

Cassandra

Written in: Java

Main point: Best of BigTable and Dynamo

License: Apache

Protocol: Custom, binary (Thrift)

Tunable trade-offs for distribution and replication (N, R, W)

Querying by column, range of keys

BigTable-like features: columns, column families

Has secondary indices

Writes are much faster than reads (!)

Map/reduce possible with Apache Hadoop

All nodes are similar, as opposed to Hadoop/HBase

Best used:When you write more than you read (logging). If every component of the system must be in Java. ("No one gets fired for choosing Apache's stuff.")

For example:Banking, financial industry (though not necessarily for financial transactions, but these industries are much bigger than that.) Writes are faster than reads, so one natural niche is real time data analysis.

HBase

Written in: Java

Main point: Billions of rows X millions of columns

License: Apache

Protocol: HTTP/REST (also Thrift)

Modeled after Google's BigTable

Uses Hadoop's HDFS as storage

Map/reduce with Hadoop

Query predicate push down via server side scan and get filters

Optimizations for real time queries

A high performance Thrift gateway

HTTP supports XML, Protobuf, and binary

Cascading, hive, and pig source and sink modules

Jruby-based (JIRB) shell

Rolling restart for configuration changes and minor upgrades

Random access performance is like MySQL

A cluster consists of several different types of nodes

Best used: Hadoop is probably still the best way to run Map/Reduce jobs on huge datasets. Best if you use the Hadoop/HDFS stack already.