Kuba Suder's blog on Mac & iOS development

This week I went to Berlin for 2 days to attend the MongoBerlin conference organized by 10gen, the creators of MongoDB. I took a lot of notes from the presentations, and I figured this may be useful for someone if they missed a presentation (or just couldn’t go to the conference).

Feel free to correct me if I got something wrong :)

“BRAINREPUBLIC: A high-scale web-application using MongoDB and RabbitMQ”

MongoDB vs CouchDB:

CouchDB – slower (because it uses a HTTP API), easier replication

MongoDB – faster

they needed high performance, so they chose MongoDB

Mongo has a rich query API that lets you do more than Couch’s map/reduce

they use RabbitMQ, Solr, Varnish proxies and Heartbeat

MongoDB is a swiss knife of NoSQL, very fast, reliable and very easy to use

NoSQL doesn’t mean you can be lazy about proper application and database design

current problems with MongoDB:

quite slow replication speed (someone from the audience said that replica sets are fast)

indexes can be very slow if they don’t fit into memory

simple authentication – sometimes too simple

map/reduce is single-threaded (locks the machine)

“MongoDB Internals: How It Works”

what does an insert do:

constructs Object ID (includes timestamp, machine/process ID, counter); two clients at the same time can’t generate the same ID, and IDs always increase with time, so order by id = order by time

converts data to BSON

sends a TCP message (using MongoDB Wire Protocol)

query:

returned as a cursor: first n results + cursor id, and then you can get next results using the cursor (n depends on the size of documents)

contrary to relational databases, cursors are not associated with sockets, and can be shared between connections; server cleans up cursors periodically if you don’t use them (this gives more flexibility in connection management on the client side)

authentication – should rarely be required, it’s recommended to protect the database at the network level (using firewalls)

query optimizer:

it’s empirical, i.e. at first it tries all possible ways to get the results, and then remembers which one works best (it runs all algorithms in parallel and finishes as soon as one of them finishes), then reuses that knowledge in future requests

if the selected algorithm becomes very slow, it tries all possible ways again

so first time a query is called, it might be quite slow

on the other hand, if something changes later, e.g. an index becomes slow, Mongo will work around that

commands:

return short values as results, not actual data (there’s a limit on how much data you can return from a command)

internally, they’re performed as finds on $cmd collection

all commands can be invoked via runCommand with the name as one of parameters

safe mode:

normally, commands return as soon as the request is written to the socket

if you want to wait until it arrives on the other side and is executed, you use safe mode

it uses getlasterror command, which waits until the operation is executed, and returns an error if there was an error, and nothing if the operation succeeded

getpreviouserror – returns last error that actually happened (not necessarily from last request)

database files:

data and indexes are stored in foo.0, foo.1 etc.

metadata about namespaces (collections) is stored in foo.ns

database files are preallocated (so don’t go crazy if the files get big – they may be mostly empty)

one set of files per database, all in one directory (there’s an option to put them in separate directories)

$freelist namespace – list of areas that can be reused (because data was deleted)

document record contains:

header (includes: size, links to previous/next, info about how fast the record grows to estimate how much padding is needed)

BSON data

padding

it’s good to design schema so that the document size doesn’t change very often, because then the server has to move the data often

capped collection:

collection with a limited size

if it hits the limit, it deletes the oldest documents

useful for logs

“Indexing and Query Optimizer”

it’s often difficult to guess up front what indexes are needed, you need to know your actual, real query patterns

a collection may have 64 indexes, but a query can only use one at a time

indexes aren’t free – they add linear cost to all inserts and updates

for compound indexes, order of keys is important

the indexer is smart enough to handle cases where a field is a number in some documents, and string in others, and optimizer will use that (for arrays, it adds one entry for each element to the index)

if you query by regexp, it can use the index on that parameter, but only if it’s a left-anchored regexp (/…/)

geospatial search is a special case, it can only use indexes and won’t run without them

in what queries indexes can’t be used:

negations

$mod

most regexps

JavaScript calls ($where, map/reduce)

in general, if you can imagine how an index is used, it will be, and if you can’t, it won’t be

db.collection.find(…).explain – explain what exactly the query does:

cursor:

BtreeCursor – means it uses an index

BasicCursor – table scan, doesn’t use an index

nscanned – number of documents scanned

if something is slow, and you think it has an index, check this first

make sure this isn’t the initial query run, where it tries all possible ways to run a query

query profiler – logs all queries slower than a given limit to the log

how to pick a limit: get some statistics of query run times, if you see two maximums on the graph and a minimum between them, pick that minimum

if you really want to force Mongo to use an index:

find(…).hint({ x: 1 }) to use an index on x

find(…).hint({ $natural: 1 }) to disable indexes

this should almost never be necessary…

“Sharding Internals”

when to use sharding – when you:

have too much data to fit on one box

have too big indexes to fit into memory on one box

want to divide write load on many boxes to make writes faster

want to divide reads on many boxes and keep consistency (if you don’t need full consistency, just add more slaves)

auto balancing:

the shards will automatically rebalance themselves in the background with no downtime when you add more nodes

you can also add sharding seamlessly without downtime to non-sharded master/slave or replica set; everything is consistent as if it was a single server

this won’t work if you want to change the sharding key on existing shards – too much data to move

it’s up to you to decide how to partition the data, because it all depends on the kind of data you have

examples:

user profiles – shard by some kind of unique id (so lookup by id will go to single shard; lookup by other things, e.g. location, will have to check all shards)

user activity, timeline – shard by user id, so user’s timeline will load from a single shard

photos, like flickr – shard by photo id, because very often you’ll know the photo id

logging:

you could shard by date, but then all currently written logs will go to a single server and will put all load on that single server

you can divide by source machine, but then searching for all recent logs will have to access all shards

you could also divide by log type (e.g. mysql logs, apache logs)

in general, for writes it’s good if they’re divided across shards, and for reads it’s good if they read from one shard

always think what exactly is the bottleneck (data, speed), why do you want to shard, and how the system is going to be used

shard config servers:

they store all metadata about shards

they’re very important – if all of them go down, it’s very difficult to restore the database (not impossible, but very time consuming)

they use two-phase commits

you should have at least 3 of them

if one goes down, metadata becomes temporarily read only, so you can’t add/modify shards during that time (but you can still add data)

they require very little resources, so can be installed on existing machines

each shard is a master/slave pair or a replica set (or just a master, but that’s a bad idea)

mongos – sharding router:

acts as a frontend of the databases for the application

clients can’t tell the difference between a regular mongo database and a sharding router, they access it in exactly the same way

redirects all requests to the shards (using the metadata from a config server)

you can have as many as you want

one solution is to have one on every application server – then there’s no extra network load, because every application server accesses its own mongos on the same machine

chunk migrating:

moving chunks (parts of a collection) to a different shard

during the copy, the origin keeps a log of all operations, because clients can still do writes in the meantime

migration is finished when the data is copied and all new operations are applied (when the target catches up with the changes on the source)

balancing: server automatically detects when a shard has too much data and moves some chunks to other shards

usually only one or a few biggest collections are sharded, and the non-sharded data is kept on one shard which is automatically selected by the system as primary

but you can start with 2 or 3 (one box can hold master for one shard and slave for another)

“Replication Internals”

oplog:

it’s a capped collection that stores recent operations on master

it should be large enough to hold all operations for some period of time – how much exactly depends on the speed of adding new data

on master/slave setups oplog is kept only on master, and on replica sets – on all nodes

syncing: if slave is empty, it first copies all data from master, then applies all the changes from oplog that happened in the meantime

some operations are converted when added to oplog, to guarantee that they are idempotent:

incs are translated to updates with $set

deletes are divided to one per record (because it won’t be possible to match the same set of records again)

no-op operations are added to oplog in regular intervals to ensure that it isn’t paged out to disk

—fastsync

assumes that data on the slave is identical to the master data

you can use that if you copy all the data manually, e.g. by sending whole drives to another data center

FedEx delivers replicated data at about 11 Mb/s ;)

with capped collections, it’s possible to do a ‘blocking read’ that will wait until new data is available (like tail -f)

replica pair: something old that was replaced by replica sets, don’t use that

replica sets:

instead of master and slave, we have primary and secondary, because they’re dynamic

when primary is down, for a short time writes are blocked, and remaining nodes have an election process to find out which one has the newest data, and that node becomes a new primary

some really new data, that wasn’t written to any secondary nodes, may be lost

when old primary reconnects, it marks all changes that didn’t make it into the new primary as invalid and logs them somewhere

there may be some problems with determining if the primary is dead or just inaccessible – may be solved by setting vote strengths, adding extra nodes, or adding arbiters (nodes that don’t store any data and only take part in the election process)

“Scaling with MongoDB”

often you can change performance by orders of magnitude only by rethinking the schema

embedding:

it’s great for read performance if you need to read the main object and all embedded objects together (there’s one request to load all objects)

but writes can be slow if you’re adding embedded objects all the time, because very often the document will have to be moved somewhere else on the disk

if the updates to embedded objects are e.g. 1-2 per second then it will copy the data all the time

indexing: if you have an index on (A, B) then you don’t need an index on A

sometimes you need to think hard if having an index will bring better performance than not having it – because huge indexes that swap out constantly are very bad for performance

right balanced indexes:

index is right balanced if all the new elements go to the right side (when they’re sorted by date or id)

in such index, usually only the right side of the index needs to be kept in memory, even if the whole index is huge

it might sometimes make sense to add a new key just to have a right balanced index

use this for data that is mostly accessed only for time, e.g. recent photos, messages, etc.

working set: you need to figure out how long a user stays on the site and how much data they’re going to need during that time (and how many users come simultaneously)

horizontal scaling:

usually read scaling is most important, because most people just read the data, and only some of them modify it

simplest way: master + one or more slaves; slaves' data may be inconsistent at times

for data size and write scaling: add shards

typical setup for sharding: 9 servers (3 masters, 6 slaves)

for sharding, choose such keys that are more or less evenly distributed in the collection