Archive

I read an article today with a laundry list of 10 reasons MongoDB didn’t work out for someone. The list didn’t make a lot of sense to me since there were absolutely no details about why MongoDB didn’t work out. No comments allowed on the article either. In my experience over the past two years with MongoDB, these deficiencies don’t really exist, so I thought I’d debunk them, and leave the article open for comments in case there might be more to the author’s story.

MongoDB logging: it logs at –logpath, or the mongod output. You can adjust verbosity, query profiling, etc.

Slow query optimization: .explain() is a very good friend to understand what indexes are being used on a “slow” query. Also, profiling can show you the queries that are performing slowly. A new addition to improving query performance is the touch command, which can keep data & indexes into memory for better performance.

Init scripts: my development team maintains a Javascript file for initializing the database on each deployment. It’s run as a parameter to the mongo client process. I find this to be very flexible, and being Javascript, much nicer for writing imperative logic than in a SQL script.

Graphing: I’m guessing this refers to generating charts and graphs. The lack of built in tools here doesn’t come as a surprise – it’s expected that your application will want to do this. The database stores data and provides a means for querying, and rendering images is something to be done on the client side.

Sharding (and rebalancing) strategy: You have to think through sharing, it’s not going to be something you should jump into. I don’t really see this as a MongoDB problem – the same challenges exist no matter how you choose to shard any database. The selection of shard key is crucial for sharding success.

Backups: there are many, many options, here. For my needs, since I’m running with a replica set, it so take a secondary offline and copy the files. This causes no downtime and has no impact on the other members of replica set. There are also options for performing hot backups using mongodump, using OS snapshots, file system snapshots, and so on. http://docs.mongodb.org/manual/administration/backups/

Restoration: the complexity of restoration depends in many ways on the complexity of backup, but if you do a hot backup with mongodump, you can do mongorestore. If you copy from a secondary, take one node offline, copy the data files into place, tell the other nodes to freeze or stepDown using rs.freeze and rs.stepDown, then start the restored node. It will become primary because the other nodes are forced to be secondaries.

50 other things: read the documentation. This goes along with any database solution. MongoDB has a lot of it, and also a very active and helpful community.

The MongoDB oplog is a rolling log of the most recent operations that occurred in a MongoDB instance. It’s intended for replication so that multiple data nodes can follow the oplog and immediately get updates, however, your application code has access to it as well. This means you can “tail the oplog” and your application code is notified as soon as an insert, update, or delete operation occurs anywhere on the database. This is very cool stuff, so cool, in fact, that it’s nice to be able to enable it even if you aren’t using a multiple node replica sets. Replica sets provide high availability and disaster recovery, but you might not need that or have the extra server resources for it – maybe you just want real time notification. Here is how you enable the oplog on a standalone MongoDB instance.

In the /etc/mongodb.conf file on Debian or similar file on RHEL, you should set two options and then restart your MongoDB instance.

replSet=rs0
oplogSize=100

Or you can set these at the command line, as you might on Windows:

--replSet rs0 --oplogSize 100

The first setting, “replSet rs0” tells you this will be a replica set node, which will allow you to run “rs.initiate()” from the mongo shell. Doing so makes this into a single node replica set. The name “rs0” is complete arbitrary – call it whatever you want. The next option, “oplogSize 100” limits the oplog to 100 MB. You can leave this option off and it will default to using 5% of your free disk space. You’d want to support a huge oplog if your need multiple data nodes to survive an outtage of several hours or days and be able to catch back up without needing a full resync. However, if you’re running a single node just to get an oplog and real time notifications, you can cap it at 100 MB, or maybe much less.

After setting these options and starting your MongoDB instance, connect to the shell, change to the local database, and run “rs.initiate()”. You’ll now have an oplog and to tail for real time notifications.

I’m starting a new app today and building out the data layer with MongoDB as my database. The app uses a collection from the USDA, that I thought makes a good sample for getting started with the “Load” portion of ETL into MongoDB.

Step 1 – Define a Class for the data

Although not absolutely necessary as you could build a raw BSON document directly from XML, you kind of miss out on some of the C# driver’s niceties if you do. Looking at the raw data, I came up with this class, along with a constructor that takes an XElement to handle the parsing. Strict DTO people might move that parsing to a function within the ETL process…up to you. The only MongoDB specific code here is the BsonId attribute, which I’ll put on the FoodCode property – a unique ID from the source system.

You might notice I’m using float for my decimal values. That’s all the accuracy I need, but it does lose some precision. I’m rounding the data when I use it so it won’t really matter, but if your needs differ, choose a different numeric type.

Step 2 – Function for reading the XML file

This is a pretty small data file, only about 750 records, but loading it all into memory at once is a waste. I want to load the “Food_Display_Row” XML elements one at a time, convert to a Food object, store in MongoDB, and move on to the next. It’s a job for a streaming API and an iterator, powered by “yield return” to get one XElement at a time:

Step 3 – Pull it all together and load the data

With the pieces in place, the load process is pretty simple. Connect to the server, get the database (MongoDB creates it on first use), get the collection (MongoDB also creates the collection), and use the iterator to read the XML file, load each element into a Food object and insert into the MongoDB collection. At the end, we have a MongoDB database with a collection of data from the food guide pyramid.

My favorite thing about this is that I never had to leave C# to create the database, parse the source XML, or load the data. I don’t have to run a separate ETL process or use management tools to configure my database schema. It’s a simple, self-contained solution.

My second favorite part is that I ran all of this under Ubuntu and Mono. It should work just as well under Windows and .NET, but life is better running under an open source software stack.

I hope you find this helpful if you’re getting started with MongoDB and want a little data to play with.