Friday, September 18, 2015

MongoDB

Document Oriented - aggregates the data in minimal number of documents.

Ad hoc queries - like regular expression search, by ranges or field is supported.

Indexing - any field in the document can be indexed.

Replication - high availability is supported by maintaining replicas of data in more than one replica set member. The data is eventually consistent between the replica members.

Load Balancing - uses sharding (a shard is a master with one or more slaves) to distribute the data split into ranges (based on shard key) between multiple shards.

File storage - supports storing a file not as a single document but split across multiple shards - GridFS feature of MongoDB comes built in and is used by NGNIX and lighthttpd.

Aggregation - is similar to SQL GROUP BY clause.

Capped collections - can be used to store data in insertion order and once specified size is reached, behaves like a circular queue. Similar to RRD.

Server side Javascript execution - is supported.

MongoDB is one of the top performing NoSQL database. Benchmarks have reported MongoDB performance to be better than some other NoSQL DBs by as much as 25x.

Eventual Consistency - eventually all nodes in
the cluster of NoSQL DB will have the same data as data may not be propagated
if the network breaks down or the node goes down but eventually when node is up
and the network is working then data will be consistent across all shards.

2. Document - BSON document with dynamic schema - means that documents in the same collection do not need to have the same set of fields or structure, and common fields in a collection's documents may hold different types of data.

Replication: client application always interact with primary node and primary node then replicate the data to the secondary nodes.

All write operations goes to primary

Reads can happen from any of primary or secondary nodes.

Replication - is keeping multiple copies of the
same data for HA and failover. One node is primary and others are secondary.
Minimum 3 nodes needed to form a replica set.

Writes are always written to
primary node in a replica set.

Within the replica there will
be some delay when the writes to the primary node gets replicated to the
secondary. In your application you may want to wait for the writes to be
replicated. This is controlled by w flag - or write to replica flag. This
is set to w:1 by default in drivers.

Mongodb keeps the data in
memory and flushes it to disk periodically. If application wants to wait
for the data to be written to disk then it can set J:1 (where, j is
journal written to disk). W:1, J:1is default for drivers.

Another setting is
w:majority, which means write should propagate to majority nodes in the
replica set. These w and j settings are called write concerns.

An application can set its
read preference to read from secondary. Secondary's data may be stale so
this is not recommended.

Sharding - is a solution for
horizontal scaling of mongodb. More shards (which are in turn made up of
replica sets) can be added depending on the load on the system. A shard
key needs to be an indexed key in the collection and should be present in
all documents. The key need not be unique. It is used to identify the
right shard to send the data to for persistence. Within the shard, replica
sets will create copies of the data on replica set member nodes. So
sharding helps in splitting the data based on a shard key in the document.
Its like storing data in a hashmap. The better the key selection, the
better the data will be divided among the shards.

Sharding is controlled via
mongos router. Application connects to the mongos router which will listen
to the 27017 port for example, and will know based on the shard key on
which shard to insert the data to. For read/find operation mongos will
query the primary node (of the replica set) in each shard and collate the
result.

Experiences:

Liked:

Easy to setup and develop
against in multiple languages, like Java, python etc. Good to build POC
applications.

Stable enough.

Querying is very powerful.

Pretty performant in querying
and inserts. If the schema has embedded documents mostly then it is worth
using it.

JSON structures can model
quite complex objects.

Schema-less DB can be useful
(at least it appears to be so in theory) as it can reduce the pain in
migration, though you still need to write the migration scripts
nevertheless but not have to worry about alter-ing the schema (as there is
no DDL or schema definition in MongoDB, all schema gets realized at
runtime and documents within the same collection can be quite dissimilar).

Disliked:

By design, no referential
integrity is supported:

So no cascade
deletion - we need to handle it in application.

Very hard to design with only
embedded documents in the schema. Most of the time we end up with having
documents with References or links which is akin to relations in the
RDBMS. This comes at a price that to build a transaction with rollback is
too much work in the application. MongoDB only guarantees atomicity within
a document's boundary. So when you have too most documents having
references instead of embedded documents in them then consider using
RDBMS.