Replication Internals

Displacer beast... seemed related (it's sort of in two places at the same time).

This is the first in a three-part series on how replication works.

Replication gives you hot backups, read scaling, and all sorts of other goodness. If you know how it works you can get a lot more out of it, from how it should be configured to what you should monitor to using it directly in your applications. So, how does it work?

MongoDB’s replication is actually very simple: the master keeps a collection that describes writes and the slaves query that collection. This collection is called the oplog (short for “operation log”).

The oplog

Each write (insert, update, or delete) creates a document in the oplog collection, so long as replication is enabled (MongoDB won’t bother keeping an oplog if replication isn’t on). So, to see the oplog in action, start by running the database with the –replSet option:

$ ./mongod --replSet funWithOplogs

Now, when you do operations, you’ll be able to see them in the oplog. Let’s start out by initializing out replica set:

You can see that each operation has a ns now: “test.foo”. There are also three operations represented (the op field), corresponding to the three types of writes mentioned earlier: i for inserts, u for updates, and d for deletes.

The o field now contains the document to insert or the criteria to update and remove. Notice that, for the update, there are two o fields (o and o2). o2 give the update criteria and o gives the modifications (equivalent to update()‘s second argument).

Using this information

MongoDB doesn’t yet have triggers, but applications could hook into this collection if they’re interested in doing something every time a document is deleted (or updated, or inserted, etc.) Part three of this series will elaborate on this idea.

No, at the moment it’ll actually log an update op. This is sort of just an implementation detail, though, not by-design. It actually may change to no being logged for 2.4, which is having a lot of the update code re-written.

Thanks for the quick reply, Kristina! I was hoping to use this data for tracking changes to documents over a period and ignoring updates that were effectively no-ops (nothing modified). Perhaps I can work out a way to determine this using the current output.

OK. I’m concerned that there’s some chance that the document may have changed in the time since the log entry was created and the process that is recording the event reads it. If so then the underlying document may no longer be current.

I set up a test to listen to the oplog for changes to a certain collection. It’s working fine, but I notice there is a delay of about 1 second maybe… 80% of the time. The other times it executes quickly, completing in under 100ms. Is there some delay on the oplog generation? Is it polling to generate the entries? Is there some configuration I could use to decrease the lag, and would that have other penalties such as over taxing the CPU?

I am using the node.js driver. This is my implementation, though I don’t imagine it will work outside of my application environment:

I’ll release this as a simple watcher library once I get it cleaned up.

First, it might be the rate you’re writing to the oplog: oplog queries will hang around for a while waiting for results. Second, you might want to use the “oplog replay” option (no idea what it’s called in Coffeescript) which makes querying the oplog more efficient.

I changed the tailableRetryInterval in my code to 100ms, and that reduced the delay. So it looks like the tailable cursor is actually polling for results. Hmm… not ideal. I’ll take this up on the node-mongodb-native google group though, as it seems it’s an implementation detail of the driver.

Ok, so I actually found an answer. If you initialize the cursor with ‘awaitdata: true’ then it will rely on Mongo server functionality to push out the new data. Here’s an example from the tests in node-mongodb-native:

I think I’ll have to ask Christian more specifically if his driver handles that… I grepped the code and cannot find it, but maybe he is passing all the options through directly. What specifically does this flag do? I can’t seem to find much documentation on it, though I see it mentioned by Scott here:

Yes, that’s the flag I’m talking about. The oplog doesn’t have any indexes, so when you query mongod has to scan every document. The oplog replay flag makes the query start at the latest document and jump back by 200MB at a time to try to find a timestamp earlier than the one you’re querying for. Then, once it’s found the right oplog segment to search, it’ll move forward one document at a time. It makes querying the oplog for a particular timestamp much faster.