Archive

I read an article today with a laundry list of 10 reasons MongoDB didn’t work out for someone. The list didn’t make a lot of sense to me since there were absolutely no details about why MongoDB didn’t work out. No comments allowed on the article either. In my experience over the past two years with MongoDB, these deficiencies don’t really exist, so I thought I’d debunk them, and leave the article open for comments in case there might be more to the author’s story.

MongoDB logging: it logs at –logpath, or the mongod output. You can adjust verbosity, query profiling, etc.

Slow query optimization: .explain() is a very good friend to understand what indexes are being used on a “slow” query. Also, profiling can show you the queries that are performing slowly. A new addition to improving query performance is the touch command, which can keep data & indexes into memory for better performance.

Init scripts: my development team maintains a Javascript file for initializing the database on each deployment. It’s run as a parameter to the mongo client process. I find this to be very flexible, and being Javascript, much nicer for writing imperative logic than in a SQL script.

Graphing: I’m guessing this refers to generating charts and graphs. The lack of built in tools here doesn’t come as a surprise – it’s expected that your application will want to do this. The database stores data and provides a means for querying, and rendering images is something to be done on the client side.

Sharding (and rebalancing) strategy: You have to think through sharing, it’s not going to be something you should jump into. I don’t really see this as a MongoDB problem – the same challenges exist no matter how you choose to shard any database. The selection of shard key is crucial for sharding success.

Backups: there are many, many options, here. For my needs, since I’m running with a replica set, it so take a secondary offline and copy the files. This causes no downtime and has no impact on the other members of replica set. There are also options for performing hot backups using mongodump, using OS snapshots, file system snapshots, and so on. http://docs.mongodb.org/manual/administration/backups/

Restoration: the complexity of restoration depends in many ways on the complexity of backup, but if you do a hot backup with mongodump, you can do mongorestore. If you copy from a secondary, take one node offline, copy the data files into place, tell the other nodes to freeze or stepDown using rs.freeze and rs.stepDown, then start the restored node. It will become primary because the other nodes are forced to be secondaries.

50 other things: read the documentation. This goes along with any database solution. MongoDB has a lot of it, and also a very active and helpful community.

I’m starting a new app today and building out the data layer with MongoDB as my database. The app uses a collection from the USDA, that I thought makes a good sample for getting started with the “Load” portion of ETL into MongoDB.

Step 1 – Define a Class for the data

Although not absolutely necessary as you could build a raw BSON document directly from XML, you kind of miss out on some of the C# driver’s niceties if you do. Looking at the raw data, I came up with this class, along with a constructor that takes an XElement to handle the parsing. Strict DTO people might move that parsing to a function within the ETL process…up to you. The only MongoDB specific code here is the BsonId attribute, which I’ll put on the FoodCode property – a unique ID from the source system.

You might notice I’m using float for my decimal values. That’s all the accuracy I need, but it does lose some precision. I’m rounding the data when I use it so it won’t really matter, but if your needs differ, choose a different numeric type.

Step 2 – Function for reading the XML file

This is a pretty small data file, only about 750 records, but loading it all into memory at once is a waste. I want to load the “Food_Display_Row” XML elements one at a time, convert to a Food object, store in MongoDB, and move on to the next. It’s a job for a streaming API and an iterator, powered by “yield return” to get one XElement at a time:

Step 3 – Pull it all together and load the data

With the pieces in place, the load process is pretty simple. Connect to the server, get the database (MongoDB creates it on first use), get the collection (MongoDB also creates the collection), and use the iterator to read the XML file, load each element into a Food object and insert into the MongoDB collection. At the end, we have a MongoDB database with a collection of data from the food guide pyramid.

My favorite thing about this is that I never had to leave C# to create the database, parse the source XML, or load the data. I don’t have to run a separate ETL process or use management tools to configure my database schema. It’s a simple, self-contained solution.

My second favorite part is that I ran all of this under Ubuntu and Mono. It should work just as well under Windows and .NET, but life is better running under an open source software stack.

I hope you find this helpful if you’re getting started with MongoDB and want a little data to play with.