How This Web Site Uses MongoDB

Warning: this post is intended for developers. It gets a bit
technical!

Sometimes I'm asked what platform we're running the Business
Insider on. Well, we're using LAMP, of
course: Linux, Apache, Mongo, PHP. After I get
past defending our choice of PHP to the haters (you know who you
are!), people ask what the M stands for. Our database is not
MySQL, but Mongo.

So what's Mongo?

MongoDB is an open-source,
non-relational database that combines three key qualities:
scalable, schemaless, and queryable. It has native drivers for
pretty much every major language, and a small but growing
community.

Mongo's design trades off a few traditional features of databases
(notably joins and transactions) in order to achieve much better
performance. It is perhaps most comparable to CouchDB for its JSON
document-oriented approach, but has much better querying
capabilities: you can do dynamic queries without pregenerating
expensive views.

So Mongo occupies a sweet spot for powering web apps.

Full disclosure: TBI and 10gen,
the developers of MongoDB,
share certain investors and board members. (Specifically,
10gen was also founded by our co-founder, Dwight Merriman, and
Dwight still owns stock in both companies). In early 2008, 10gen
assisted in the development of what was then Alleyinsider, and
adopted MongoDB at that time. But that was all before I got here.
We continue to use MongoDB today, despite rearchitecting the rest
of the platform, because I believe it's the best technology for
us.

Here's why and how we use it:

It's Scalable

TBI gets fairly high traffic, and we're growing quickly. On a
typical business day we serve upwards of 600k pageviews, and
we're blowing towards the 1m mark rapidly. We have three
load-balanced Apache webservers, but our database is just running
on a single box. (We do have a slave, but use it only as a hot
backup.) Our one box isn't running anywhere near total capacity
despite fielding a few hundred reads and writes per second.
Typically, even when the site's busy, we're running at under 5%
of total CPU time.

When we do eventually need to scale up further, Mongo has
automatic sharding features to distribute data and load across
multiple boxes. We don't need these features yet, but it's good
to know they exist.

Document-oriented, not relational

RDMSs were invented in the 1970s, long before object-oriented
programming and dynamic scripting languages became popular. By
now, we're all accustomed to the process of translating our
code's data structures back and forth to the tables in our
database, but it doesn't have to be that way.

Rather than rows in a table, Mongo stores documents in
collections. Documents are (slightly enhanced) JSON
objects, so you can stash much more complex structured data in a
single document than you can store in a table row. Natural data
structures: arrays, objects, dictionaries. Data modelling becomes
a much more natural process.

Embedding objects

Our data modelling approach is different -- instead of using
multiple tables and joining them together with foreign keys, we
can embed objects within a single document.

For example, each post on our site is a document. Similarly, in a
MySQL-based system, a post would be a row in a table. But
comments are different. We embed comments directly
within the post document as an array of objects. All of the
comment data, including the text of each comment, information on
who posted it, and the thumbs up/thumbs down voting, is stored
directly within the post document.

When our code pulls up a post like this one, the database doesn't
have to query over a separate comments table. The comments are
right there as part of the post object, ready to be displayed.
This is faster, and makes intuitive sense.

No Object-Relational Mappers

And we don't have to use an ORM. They've been described as
"The
Vietnam of Computer Science" for a reason. Our code winds up
simpler, since we don't need to introduce an artificial layer of
abstraction.

(We do use a light wrapper library I've written for PHP called
SimpleMongoPhp --
you're welcome to make use of it if it helps you. There are other
similar libraries for PHP
and otherlanguages,
including plugins to popular existing frameworks.)

Schemaless

If you've ever managed a medium-sized site, you know what a pain
making changes to your schema can be. On a large dataset, you can
lock the database for a long time doing an alter, meaning you
have to schedule downtime concurrent with code releases.
Rollbacks can be even worse. Even though some frameworks and
libraries will help you a little, it's a big deployment problem.

Plus you have the day-to-day nuisance of managing a schema:
making sure your dev database has the changes, as well as all of
your individual developers' versions of their databases.

We don't have to deal with that using Mongo. There's no
database-enforced schema, so when we make a big change (like
adding thumbs-ups to the comments, as we did last month), we can
easily make it in a backwards-compatible way. We just make sure
the code handles the case where a field isn't set.

Life isn't perfect: once in a while, we still have to do a data
migration that goes along with a code release. Rather than an
alter, maybe we need to do some kind of transformation of
pre-existing data. But it doesn't happen nearly as often: maybe
about a tenth as often as it did when I used MySQL.

Tagging

Mongo is excellent at indexed "tag" type queries. If we have a
post tagged Apple and iPhone, we
store that internally as an array of strings:

{ categories: ['Apple', 'iPhone'] }

Mongo can index that field and understand that it needs to search
the contents of the array, so we can query for all posts tagged
"iPhone" very easily.

Caching

It's still useful to have a caching layer, and so we do -- we use
memcached. But we
do a lot less caching than we would on a MySQL database. Mongo is
very fast at retrieving individual objects, so we don't need to
cache individual posts. The post you're looking at right now is
not cached; it's being pulled live from the database. That
doesn't kill Mongo because it will generally keep that document
in memory.

But we do still do some caching on more complex queries. For
example, all of our homepages are on a three-minute cache delay.
Or the "Most Popular" listing on the sidebar to your right. But
because Mongo is usually going to be as fast as memcached for
retrieving individual documents, a lot of common situations that
I used to cache don't have to be anymore.

And Mongo itself can be used as an effective caching
layer. If your collection is small, Mongo will keep it entirely
in memory and performance will be comparable to a cache. We do
this with our "settings" collection, which stores dynamically
customizable options, like which ads are turned currently on, and
the "Hot" links below our nav bar.

Real-Time Analytics

Like many sites, we use Google
Analytics and other packages to get detailed information about
our traffic. But things move too fast to wait until the next day
for data. We have statistics pages that provide up-to-the-second
live data on what's happening on our site: what pages people are
looking at and clicking on right now. Editors can use
that data throughout the day for instant feedback on what they're
doing.

Mongo is ideally suited to these real-time analytics. Our
internal tracker does between 3 and 8 upserts on the database per
pageview, and Mongo handles these without any trouble.

This is such an ideal use of Mongo that it'd be a great way to
augment a lot of sites that plan to use a RDMS for the
foreseeable future. Just get a spare box, throw Mongo on it, and
have Mongo power your real-time analytics. There's more about
this topic
here.

Image Storage

Mongo can store binary data in the database, so that we don't
have to deal with the common hassle of having files in the
filesystem and metadata in the database. Using its GridFS API, we
can easily stash all of our images on the site in Mongo.

We do use a CDN (CDNetworks) in front of our images, but on the
occasions we've taken the CDN off, Mongo has performed fine
serving the images.

Why Not Use Mongo?

Mongo's pretty great in general, and it's become my default
choice for a datastore in a web app. But there are a few things
it's not great at, at least right now:

it lacks transactions, so if you're a bank, I wouldn't use
it.

it doesn't support SQL, so if you have a legacy codebase that
relies on SQL, or if you need to use some of the more complex
collation/grouping features of SQL-based DBs, Mongo may not be
the best choice

it doesn't have any built-in revisioning like CouchDB

it doesn't have real fulltext searching features (slow
regexes and tag-stemming is the best you can do)