Data Science Laboratory System – Document Store Databases

A Document Store Database (DSD) is similar to a Relational Database Management system with the exceptions that a DSD allows for unstructured data and sharding a single database across multiple machines. So when or why would you choose a document database over a relational one? Buck Woody has the answer and an example using the DSD MongoDB on his lab system.

My plan is to set up a system that allows me to install and test various methods to store, process and deliver data. These systems range from simple text manipulation to Relational Databases and distributed file and compute environments. Where possible, I plan to install and configure the platforms and code locally. The outline of the series so far looks like this:

I’ll repeat a disclaimer I’ve made in the previous articles - I do this in each one because it informs how you should read the series:

This information is not an endorsement or recommendation to use any particular vendor, software or platform; it is an explanation of the factors that influenced my choices. You can choose any other platform, cloud provider or other software that you like - the only requirement is that it fits your needs. As I always say – use what works for you. You can examine the choices I’ve made here, change the decisions to fit your needs and come up with your own system. The choices here are illustrative only, and not meant to sell you on a software package or vendor.

In this article, I’ll explain my choices for working with a Document Store Database. I’ll explain briefly the concepts of this type of database, and then methods to use and manage them, along with my choice of the software for my lab system.

Concepts

In concept and “feel”, a Document Store Database is very similar to a Relational Database Management system, but with two broad differences. Firstly, a Document Store Database allows us to work with semi-structured data and, secondly, we can, for scaling, "shard" a single database across multiple machines. That being said, a Document Store Database can also use a single node, as in the example system I’ll set up for this article.

A Document Store Database (DSD) stores collections of related documents. A collection is loosely analogous to a table in an RDBMS, and can define a set of indexes for each collection. Each "row", or "record", in the collection is a document, and each document comprises a set of "field and value pairs". Therefore, information stored within a DSD does have a structure, for at least part of the data, although that structure can be different in each “row”, and have embedded values within a single Collection. There are text versions of these collections, such as variations of XML and JSON, and binary versions, such as Microsoft Word files – although binary data is often streamed as characters rather than dealt with as a binary object.

To work with each document, the DSD system needs a key value to reach it. This key is treated more as an address than a pure key, and each DSD system handles this a little differently. In the system I'll show in this article, the key is a long, automatically generated GUID value.

The Key and the document information that follows make up the fields in the system. In other words, technically, there are only two "fields" in a document database, a Key and the elements that follow the key. However, as noted above, the attributes within the document information are sometimes referred to as its fields. This concept is similar to the Key/Value Pair systems I described in the last article.

Each document can be different, so the information might be "sparse" in the sense of fields. This flexible structure is one of the primary advantages to working with a document database system, since it allows for a rapid change within a schema, or to define a schema, per row, as the project progresses.

Much like the Key/Value Pair systems, calling a particular system a "pure" document database might provoke a little argument. There are several systems that work primarily as a document database system, including Cassandra, Couchbase, RavenDB, and even Lotus Notes can be treated as a document store database.

So when or why would you choose a document database over a relational one? Both store data, have access for developers, wide adoption, good documentation and even paid support. Both types can represent almost any given data structure. The answer lies in three areas: speed of data query access (and in some cases insert operations), flexibility to change structure within the same data space (or database, or domain, based on which term fits the given situation) and scalability.

One of the most powerful arguments for using a Document Database system, and in particular MongoDB (covered in more detail in the next section), is how well and how easily it scales horizontally. In many relational systems, the method for gaining more speed and data volume is to increase the resources such as CPU, Storage, Storage and Network bandwidth on a single system. Using MongoDB, you can add more servers, and use the built-in Map/Reduce features to query across multiple machines in a single go. I'll explain Map/Reduce in a future article, but for now you can picture several servers that hold parts of your entire dataset. The overall system "knows" where the data is held, and sends the particular code to the system that holds it. After the particular system processes the data, it sends the results on to be combined with the data from the other systems, and the result is further processed if needed.

MongoDB in particular is quite fast, allowing for rapid insertion of data and queries. Of course, to gain that speed of inserts and queries, some relaxing of consistency is needed – something that you have to balance. Of course, not every data requirement has a high need for consistency, although some do.

MongoDB, along with most of the "NoSQL" variants I've described in this series, allows for a separate set of data columns, or attributes, per row, meaning that a developer can create a data layout per insertion if desired. This capability allows for a simpler path to coding, although time will tell if that code is easy to maintain.

Rationale

The DSD I'll use on my lab system is the popular and widely-used MongoDB (mongo from “humongous”, meaning very large). True to the promise of its name, even CERN used it in collecting huge amounts of data from the large Hadron Collider system. It performs well, with multiple indexes possible, and has great documentation. There are issues that have cropped up with it, however, so as with all software, research is necessary, such as creating a lab system and trying things out for yourself, to ensure you know the system's capabilities and constraints.

MongoDB supports ad-hoc queries, load balancing, replication, it can index any field and supports secondary indexes as well. It uses an Application Programming Interface (API) for queries and also supports Map/Reduce patterns, which I'll cover in another article. It has a REST interface, which makes it available to almost any platform, from laptops to mobile devices.

A MongoDB installation can contain many databases, similar to an RDBMS. Each database stores collections ("tables"), each comprising a set of documents ("rows", or "records"). One document within a collection can have a totally different structure from another, so the tables-to-fields comparison with an RDBMS isn't the same. Each Collection can have a different structure.

MongoDB databases use an internal storage format called BSON, which is similar to Javascript Object Notation (JSON). JSON is a human-readable text format, with some light data type enforcement. Here’s one example of a JSON collection. Note the nested (embedded) address within the record:

BSON, Binary JSON, is normally smaller in size than JSON, because it’s compiled to a binary format, but each individual record is slightly longer. This is because the format has “header” type information to make scans and seeks faster, such as a prefix of length and other metadata. It also has more types, such as Date, to make working within a database more efficient.

To access MongoDB data, you can use various API’s such as Javascript, C#, Python, and a host of others. There’s a command-line tool to work with the system, and it’s much the same on Windows and Linux.

Installation and Examples

I navigated to the MongoDB downloads page and then selected the 64-bit version for Windows Server 2008 R2, for my test lab. Note that MongoDB also installs on Windows 7 or even XP if you need it there. This Win2K8R2+ version has enhancements that my Windows Server 2012 lab system can take advantage of.

Installation is actually quite simple, consisting of only two steps. The first is to download and extract the ZIP file into a directory of your choosing. I’ll use the c:\mongodb directory on my local lab system, and e:\mongodb on my Windows Azure IaaS VM. Whenever you are using a cloud provider for your VMs, it’s usually a best-practice to install only operating system binaries on the C:\ drive, temporary files on another (D: is a temporary drive on Windows Azure) and then any user programs and data on yet another drive.

The second step is to create a separate directory where the databases will live, much like you would as a best practice with an RDBMS. Using yet another drive is a great idea, but in my case I’ll create a directory off of the root - c:\data\db and e:\data\db for my local and Azure labs, respectively.

That’s it! There are no other installation steps (unless you want the MongoDB process to run in the background as a Windows Service). You can simply type c:\mongodb\bin\mongod.exe and you’re off and running.

I do want the system available as a service, but I’ll set it to manual. To run MongoDB as a Windows Service, you’ll need to create a mongod.cfg text file, defining various parameters such as logging and data directories. You can read more about use of configuration files here: http://docs.mongodb.org/manual/reference/configuration-options/.

After I created a simple CFG file, I typed the following commands, the first to set up a logging directory (so I can see the output if the startup fails) and the second to install MongoDB as a service. Note my drive letters and directories here, and note that I run this as an Administrator account so that I have proper permissions to install a service:

md C:\mongodb\log

C:\mongodb\bin\mongod.exe --config C:\mongodb\mongod.cfg --install

Now it’s simply a matter of using the Control Panel Services applet or typing this command, which I’ll do now to ensure the service is started:

net start MongoDB

Now it’s time to set up an example database, enter some documents ("records"), and query those records out. Before you start, if you have a SQL background, I recommend checking out this comparison chart so that the terms will map to what you already know: http://docs.mongodb.org/manual/reference/sql-comparison/

The command-line interface to MongoDB is called mongo - and it starts in the bin directory where you extracted the MongoDB ZIP file:

C:\mongodb\bin\mongo

Once inside, you can see the databases installed (if any) with the show command:

> show dbs

It’s interesting to see the commands in one CMD shell and their interactions with the server in another:

You can get a list of commands with the help function:

> help

Creating a database is simply a matter of using it; if it does not exist, the use command will create it:

> use labdemo

To list the Collections (tables) in the database, use the show command again:

> show collections

Of course I haven’t created any yet, so here I’ll switch from commands to the Javascript insert() function to create some data, with the system-generated key. This is done with the db. prefix on the Collection name, followed by the function you want to perform on that Collection. In this example, I’ll name my Collection labdemocol. The JSON-like format for the Collection looks similar to what I explained earlier:

> db.labdemocol.insert({FirstName: "Buck", LastName: "Woody"})

> db.labdemocol.insert({FirstName: "Marjorie", LastName: "Woody"})

To search for all records in the collection, the find() function returns all records:

So far, working with MongoDB is quite simple. When you want to go beyond inserting or deleting a record, or searching for a record with a specific set of requirements, it can get different from a domain-specific language like SQL, quickly. For instance, searching for Buck in my sample Collection isn’t too difficultto do, using the find() function and the element name along with a desired value:

But finding a name that starts with B requires knowledge of JavaScript programming, at least in the mongo interface. Using Perl-compatible expressions (PCRE), finding everyone that starts with the letter B in the first name looks more like this:

And finding data with operators such as greater-than or less-than uses the parameters $gt and $lt, something you might not find as natural to type. You can learn more about the query language operators here: http://docs.mongodb.org/manual/reference/operator/

Installing and working with MongoDB fulfills one of the primary goals of my test lab system in that it allows me to experiment with different ways of working with data, regardless of interface or language I use to access it.

References:

There’s an online environment you can use (click the “Try It Out” button) to run a tutorial on MongoDB without installation, right in the browser: http://www.mongodb.org/#.

Buck Woody has been working with Information Technology since 1981. He has worked for the U.S. Air Force, at an IBM reseller as technical support, and for NASA as well as U.S. Space Command as an IT contractor. He has worked in most all IT positions from computer repair technician to system and database administrator, and from network technician to IT Manager and with multiple platforms as a Data Professional. He has been a DBA and Database Developer on Oracle systems running on a VAX to SQL Server and DB2 installations.
He has been a Simple-Talk DBA of the Day