Contents

What is NoSQL?

What
is a document-oriented database?

MongoDB

Installation

Basic
usage

Index support

Real
world applications

Storing
files in MongoDB

Map-Reduce

Auto-sharding

Conclusion
and future work

What is NoSQL?

NoSQL
is a kind of database that, unlike most relational databases, does not
provide a SQL interface to manipulate data. NoSQL databases usually
organize the data in a different way other than tables.

NoSQL
databases are divided into three categories: column-oriented, key-value
pairs, and document-oriented databases. This article focus
document-oriented databases, as it seems to be best solution for many
Web sites.

SQL based relational
databases do not scale well when they are distributed over multiple
cluster nodes. Data partition is not an easy to implement solution when
the applications use join queries and transactions.

NoSQL
databases are not new. Actually, there were key-value pair based
databases before relational database became popular.

What
is a document-oriented database?

For document-oriented
databases, a document is a data structure that has variable number of
properties. Each property has a value that can be scalar (number,
string, etc.) or a vector (arrays or objects).

You
can see a document as an object or associative array like in PHP. To
understand better this concept, here is the
definition of a "person" document:

Documents
do not have a predefined schema like relational database tables.
Documents can have multiple properties. Documents can also be grouped in
collections. The term collection is
used from now on distinguish from the tables
used by relational databases to store records with a fixed number of
fields.

Another important
characteristic of documents is that they can have sub-documents.
Sub-documents are used in the place parent-child tables used with
relational databases.

MongoDB

MongoDB is a
very interesting document-oriented database implementation for several
reasons:

It uses JSON, instead of XML

It
is fast, as it is written in C++

Supports index definitions

Provides
an easy to use query interface, very similar to some database abstraction
layers

Supports operations with sub-documents

Provides a
native PHP extension

Supports auto-sharding

Supports
map-reduce for data transformation

Installation

Installing
the MongoDB server is very easy but it outside the scope of this
article to describe the process. You can grab the latest version of its source
code and build MongoDB server following the
instructions in the README file. You can also
download and install prepackaged binaries for your platform.

The
PHP MongoDB client extension can be installed using the pecl command:

Basic
usage

Connecting
to a database

As explained above,
MongoDB manages documents, which for PHP developers are like simple
associative arrays. This means that all operations with MongoDB are
defined using arrays, even for queries.

The
database connection establishing code is very similar to code for
connecting to other types of databases:

Connecting
to database server at localhost port 27017:

$connection = new Mongo();

Connecting to a remote
host with optional custom port:

$connection= new Mongo( "192.168.2.1" );

$connection = new Mongo( "192.168.2.1:65432" );

When
the connection to MongoDB server is established, it is necessary to
select a database to work with. If no database exists already, a new
database is created. Currently there are two ways to do this:

$db = $connection->selectDB('dbname');

$db = $connection->dbname;

Then
it is necessary to select a collection to work with, like we would pick
a table to work with when using relational databases.

$collection = $db->selectCollection('people');

or simply

$collection = $db->people;

Inserting
new documents

The collection object
should be used to perform basic operations to manipulate its
information. For instance, if want to store information about a person
you use code like this:

As you may have noted, the
$safe_insert parameter is passed to the
insert function. It is meant to make the MongoDB client library wait for
the request to finish, so it is possible to determine whether it
succeeded or not.

If anything goes wrong, an exception is thrown.
If the safe insert parameter is not used, it is the same as setting it
to false. In this case, the insert call returns immediately, which is
fast but you do not know immediately if the operation succeeded. Anyway,
this possibility may be useful when a lot of records need to be
inserted.

Also note that the person
collection object is passed by reference, so MongoDB client can set the
person id entry to return the newly created record identifier.

Updating existing
documents

Updating documents is a bit more complicated and
tends to be confusing in the beginning. If you submit a regular document
it would replace the whole document definition. To do it correctly
MongoDB supports special properties that work as modifier operations.

If
you update the information of a person just to change some properties, for instance incrementing the sessions property,
add the property address2 to the first address and delete the second
address property, it can be done like this below.

First, it is necessary to define a filter to tell MongoDB
to update just a specific document.

MongoDB
also supports multiple in-place updates, like relational databases can,
which means that it can update all documents that match a given
criteria. For that it is necessary to pass the set the option multiple
to true.

Retrieving
documents

Retrieving one or more documents that match a given
criteria requires defining a condition filter using query selectors, as you may see in the
following examples:

One important detail worth
mentioning is that queries are not executed until the result is
actually requested. In the first example the query is executed when the
foreach loop starts.

This is a nice feature
because it allows adding options to the cursor object used to retrieve
the results, right after defining the query but before executing it. For
instance, you can set options to perform result set pagination, or to
retrieve the number matching documents.

Grouping
allows you to define MongoDB server side functions written in
Javascript that perform operations on the groups properties. It is a bit
more flexible because you can perform many types of operations with
grouped values, but it is a bit harder than SQL to perform simple
grouping operations like SUM(), AVG(), etc..

Here is an example
of how to retrieve the countries of a list of addresses and the number
of times a country appears in the matching addresses.

Deleting documents

Deleting
a document is very similar to retrieving or update documents.

$filter = array('field' => 'foo');$collection->remove($filter);

Be careful. By default
all documents that match a given criteria will be deleted. If you just
want to delete the first document that matches the criteria, pass true
to the second parameter of the remove function..

Index support

A
very important feature that might influence your decision to choose
MongoDB over other similar document-oriented databases, is the support for indexes,
which are very similar to relational database table indexes. Not all
document-oriented databases provide built-in index support.

With MongoDB you can create indexes to avoid full document scan
during searches, like relational databases can use
indexes to avoid full table scan. This allows accelerating
queries for documents matching conditions that envolve indexed
properties.

For instance, if you want to have an unique index on
the e-mail address property, it can be defined like this:

The first array parameter
describes all the properties that should be part of the index. It may be
just one property or more properties.

By
default the index creating is synchronous operation, but it may be a
good idea to let the indexes be created in the background if the
document count is too large. It is done as demonstrated in the example
above.

Having indexes with just one property may not useful
enough. Here follows an example on how to accelerate the fourth query
example above by defining an index on two properties.

Real
world applications

Some developers may be afraid to try a new type of
database because it works differently from others that they worked
before.

Learning new things in
theory is different from learning how to use them in practice.
Therefore, this section was written to explain how to develop real world
applications with MongoDB in comparison to using a SQL based relational
database, like MySQL for instance, so you get familiar with the
differences between each of the approaches.

For instance, lets
say you want to build a blog system with users, posts and comments. You
would implement it defining with a table schema like this when using a
relational database.

The equivalent document
definition that represents the same structures in MongoDB would be
defined like this:

Map-Reduce

Map-Reduce are operations meant to
manipulate large sets of information. The map operations apply a
function to every document and produces a new set of key-value pair
data. The reduce operations takes the map
function results and apply another function to return a single result
per key.

MongoDB Map-Reduce
functions can be applied to a collection for data transformation, in a
way very similar to Hadoop.

When the map process is
finished, the result is sorted and grouped by key values. For every
result key, the reduce function is called with two parameters: the key
and an array with all the values.

To understand better how
this works, lets suppose that we have our blog post document defined
above, and you would like that every post can have a list of tags. If
you need to get statistics about tags, you just need to count them like
this.

If MongoDB runs in a
sharded environment, the data processing functions will run in parallel
on all shards.

Keep in mind that
map-reduce processing is often slow. Its purpose is to distribute the
processing of large data sets among multiple servers. So, if you have
many servers, you can split the processing to be done and achieve the
result in less time than if it was done by a single server.

In any case, it is
recommended that map-reduce processes be run in the background, as they
often take too long to be finished. In that case it would be a perfect
case for starting it as an asynchronous job managed for instance by Gearman.

Auto-sharding

Sharding was mentioned
several times above, but you may not be familiar with the concept.

Data sharding is database technique meant to distribute the data across
multiple servers.

This is a diagram that
represents how a MongoDB works in a shared environment so you have an
idea of what happens when you use sharding.

Conclusion
and future work

This article introduced a
new type of database that probably will change the way Web development
is done.

Currently, I am working in
a ActiveRecord framework for MongoDB,
that will make it easier to deal with MongoDB objects. It will be
published in the PHPClasses site soon. I am also working on a stream wrapper to make it easier to
store and retrieve files from MongoDB, as if they were regular files.

5. MongoDB and PHP (2010-04-19 01:38)Looked for a reliable document oriented database, and after some tests and code, decided to give a try on MongoDB for a small proof of concept that i am working on...