NoSQL’s great, but bring your A game

MongoDB might be a popular choice in NoSQL databases, but it’s not perfect — at least out of the box. At last week’s MongoSV conference in Santa Clara, Calif., a number of users, including from Disney, Foursquare and Wordnik, shared their experiences with the product. The common theme: NoSQL is necessary for a lot of use cases, but it’s not for companies afraid of hard work.

If you’re in the cloud, avoid the disk

According to Wordnik technical co-founder and vice president of engineering Tony Tam, unless you’re willing to spend beaucoup dollars on buying and operating physical infrastructure, cloud computing is probably necessary to match the scalability of NoSQL databases.

As he explained, Wordnik actually launched on Amazon Web Services and used MySQL, but the database hit a wall at around a billion records, he said. So, Wordnik switched to MongoDB, which solved the scaling problem but caused its own disk I/O problems that resulted in a major performance slowdown. So, Wordnik ported everything back onto some big physical servers, which drastically improved performance.

And then came the scalability problem again, only this time it was in terms of infrastructure. So, it was back to the cloud. But this time, Wordnik got smart and tuned the application to account for the strengths and weaknesses of MongoDB (“Your app should be smarter than your database,” he says), and MongoDB to account for the strengths and weaknesses of the cloud.

Among his observations was that in the cloud, virtual disks have virtual performance, “meaning it’s not really there.” Luckily, he said, you can design to take advantage of virtual RAM. It will fill up fast if you let it, though, and there’s trouble brewing if requests start hitting the disk. “If you hit indexes on disk,” he warned, “mute your pager.”

Foursquare’s Cooper Bethea echoed much of Tam’s sentiment, noting that “for us, paging the disk is really bad.” Because Foursquare works its servers so hard, he said, high latency and error counts start occurring as soon as the disk is invoked. Foursquare does use disk in the form of Amazon Elastic Block Storage, but it’s only for backup.

EBS also brings along issues of its own. At least once a day, Bethea said, queued reads and writes to EBS start backing up excessively, and the only solution is to “kill it with fire.” What that means changes depending on the problem, but it generally means stopping the MongoDB process and rebuilding the affected replica set from scratch.

Monitor everything

Curt Stevens of the Disney Interactive Media Group explained how his team monitors the large MongoDB deployment that underpins Disney’s online games. MongoDB actually has its own tool called the Mongo Monitoring System that Stevens said he swears by, but it isn’t always enough. It shows traffic and performance patterns over time, which is helpful, but only the starting point.

Once a problem is discovered, “it’s like CSI on your data” to figure out what the underlying problem is. Sometimes, an instance just needs to be sharded, he explained. Other times, the code could be buggy. One time, Stevens added, they found out a poor-performing app didn’t have database issues at all, but was actually split across two data centers that were experiencing WAN issues.

Oh, and just monitoring everything isn’t enough when you’re talking about a large-scale system, Stevens said. You have to have alerts in place to tell you when something’s wrong, and you have to monitor the monitors. If MMS or any other monitoring tools go down, you might think everything is just fine while the kids trying to have a magical Disney experience online are paying the price.

By the numbers

If you’re wondering what kind of performance and scalability requirements forced these companies to MongoDB, and then to customize it so heavily, here are some statistics:

Foursquare: 15 million users; 8 production MongoDB clusters; 8 shards of user data; 12 shards of check-in data; ~250 updates per second on user database, with maximum output of 46 MBps; ~80 check-ins per second on check-in database, with maximum output of 45 MBps; up to 2,500 HTTP queries per second.

Wordnik: Tens of billions of documents with more always being added; more than 20 million REST API calls per day; mapping layer supports 35,000 records per second.