Using Riak for Ranking Collection

At SEOmoz, one of the features of our PRO campaign application is the weekly retrieval of rankings for all of your campaign’s keywords and search engines across you + your three competitors. This can add up to quite a lot of rankings quite fast. For instance, if you have 300 keywords, 3 engines and 3 competitors, that works out to 3,600 (300 keywords x 3 engines x 4 (you + 3 competitors)) individual datapoints (rankings) every week. That’s a lot of data!

Across all our campaigns, this works out to somewhere in the neighborhood of 2.5 million+ SERPs collected a week. This number is trending up as we add more campaigns and the see the average number of keywords per campaign go up. Yet rankings collection is just the beginning – we still need to store all this data in a scalable and easy to manage format, and, be able to retrieve it quickly and in meaningful ways for our customers. No small feat.

Our first system to handle the challenge of rankings was a homegrown MySQL-based solution baked directly in to our Rails app. Every night a cron would run a script to enqueue collection jobs for all the rankings to be collected that day. Once the collections were finished, rankings for customer campaigns would be extracted and the results were written to a small set of ever-expanding MySQL tables, the largest of which was well over 1TB in size after less than a year.

Kate has some details on the difficulty we had in this system in her Scaling the F*@# out of a Rails App post, but needless to say performance and management of the old system was becoming quite the headache. The decision was then made to create “Rankings 2.0″ as a replacement to the existing system. Working on the new rankings system has been the focus of me and my coworker Myron’s time for quite a few months now. It’s too big of a system to go over all its details in just one post, so instead I only wanted to focus on our choice of Riak as the main data store.

In this post I’ll cover:

What our criteria was for the rankings data store and why we chose Riak.

Our initial experience learning about and getting familiar with how to develop with Riak.

How we use Riak to collect rankings.

Some pain points and gotchas we found along the way.

Why we chose Riak

Running the MySQL-based system for a while, we became quite familiar with its pain points and were some of the core things we wanted to fix with the new system. Things like:

It was hard to scale. Storage & processing was long and expensive.

Maintaining high availability while doing things like schema migrations or hardware/software updates was hard.

There were a lot of possible data stores to consider: MySQL, flat files, HBase, Riak, Cassandra, etc – but eventually narrowed down the possibilities down to Cassandra and Riak.

Before I talk about why we chose Riak, I wanted to talk about why didn’t we choose Cassandra and how come MySQL was never in the running as an a possibility.

MySQL

We have lots of data and quickly would need to grow beyond a single MySQL server in to something more distributed. The options out there for doing that are some sort of sharding/partitioning scheme – which is operationally taxing and complicated to maintain. There is also the issue that migrations of large tables in MySQL take forever and you have to engineer either coordinated outages or some sort of out of band table rebuild to deal with that. Maintaining high availability under those circumstances is quite hard.

Cassandra

We actually have operations experience with Cassandra (it powers our crawl service database) and it has lots of desirable properties (distributed, fault tolerant, etc) that would work well for the Ranking service. But its storage model – sorted rows of columns (I’m hugely simplifying here) – benefits contiguous reads and range queries on data that you know the sorting and structure of ahead of time.

This storage format works well for the crawl service where the entire data graph is built off of one piece of immutable campaign data (the campaign’s domain) and where the possible sorts/filters are well known and easy to model.Rankings data, however, is much more fluid and dynamic than crawl data. There is an intractable number of possible keyword, engine and competitor combinations a client may configure, and a customer may change their campaign configuration at any time. The “latest ranks” of these combinations, as well as sets of rankings customers care about, are constantly changing.

In addition, the UI for rankings sorting, filtering and pagination is much more complicated than crawl data. We bring in your Google Analytics traffic, allow filtering to just specific keyword labels, 4 flavors of sorting etc. Trying to model all of that with Cassandra would have been a nightmare. For more information on Riak vs Cassandra, Basho, the makers of Riak, offer a (obviously biased) comparison of Riak to Cassandra on their website.

What is Riak?

So we’re going to be using Riak…but what is it? Well, the tag line on Riak is a, “open source, highly scalable, fault-tolerant distributed database.” Under the covers you’ll find a dynamo-based distributed architecture, pluggable data backends, HTTP and protocol buffer interfaces and map/reduce functions written in Erlang or Javascript.

Riak itself is written in Erlang, and makes use of Erlang’s OTP platform, which provides primitives for inter-node communication, message queues, failure detectors and supervision trees (to name a few). These primitives, plus a loosely coupled architecture, make Riak very fault tolerant and in practice quite difficult to completely crash.

If you’re familiar with Eric Brewer’s CAP theorem, Riak falls in the “AP” end of things. It values availability and partition tolerance over rigorous consistency. Instead, Riak promises “eventual consistency,” achieved through the use of vector clocks and allowing clients to write their own conflict resolution logic if the vector clocks resolution fails[1].

Riak’s data model is very simple: buckets, keys and documents. Documents can be any kind of data (even binary), which you access (for all CRUD operations) via the document’s key. Buckets conceptually “hold” documents, but really they are just a namespace keys can be grouped under. Riak uses a consistent hashing algorithm to determine which nodes[2] hold your data. Riak does not sort your data on disk or offer any kind of specific data locality benefits – it’s purely constant time random access.

When storing any particular document, Riak can replicate the document to n other nodes in the ring[3]. This is called the “replication factor” and commonly called n. Replicating your data has a few added benefits:

As long as you have at least n nodes in your ring, you can lose n – 1 nodes and you will not have any data loss. If the lost nodes are unrecoverable, and you have to replace them with nodes having empty data directories, the new empty nodes replace their missing data via a process called “read repair[4].”

Because any node in the the ring can service a request for any key, you maintain high availability and are still able to service all requests for all documents, again as long as you still have num_nodes – (n + 1) nodes in your ring.

When a node goes down, the remaining nodes in the ring detect that it is no longer available and start accepting writes on its behalf. When the offline node eventually re-joins the ring, the nodes that had accepted its writes send the newly joined node its missing data. This process is called hinted handoff.

There is a lot more to the Riak data model and replication, including the concept of tunable consistency parameters on both reads and writes, “virtual nodes,” “links,” read/write quorum, etc. I neglected to cover it in this blog post since I wanted to focus on our actual use of Riak, and less its technical underpinnings. If you’d like to learn more, I strongly encourage you to check out the Riak Wiki.

Why Riak Works for Us

Riak solves a lot of the pain points of the old rankings system. It’s:

Easy to scale. Just add new nodes.

Low operational overhead.

Fast (both writes and reads).

Has map/reduce built in.

The last point, built in map reduce is a big win for us. When we looked at how we wanted to store and retrieve rankings data for our customers, we knew precalculating made the most sense. No matter what data store we chose, we’d be doing sort of map/reduce-esque process to go over all the user’s rankings to generate their report.

With databases like Cassandra, the current best practice on using map/reduce to plug in external map/reduce software (like Hadoop). With Riak, map reduce comes built in for free, which is really nice. Riak also has built in search via Lucene as another alternative to querying your data store, which may become useful for us in the future.

One other huge benefit to the way Riak does map/reduce is that it leverages data locality by performing the map phase on the actual nodes with the data in parallel. This is much preferable to more traditional map/reduce processes which need to process the whole data set every time.

Initial Experience with Riak

Getting going with Riak was quite easy. The Riak wiki is full of helpful, well-written and clear documentation. Installation and configuration is relatively painless, so you can be up and running very quickly. Simple concepts, like creating, updating and deleting keys were easy to understand. Myron and I were able to get Riak integrated in to our dev cycle quite easily.

That said, there were numerous concepts and Riak features that were not so easy to understand/use:

When we wanted to move beyond using vector clocks for conflict resolution and actually resolve conflict ourselves, there wasn’t a whole lot of guidance. It took some tinkering to figure out.

Early documentation on writing map/reduce queries (especially in Erlang) was sparse. It went over some simple examples, but a lot of the details, best practices and help on how to debug errors were lacking.

Additionally, when we got off the beaten track, the documentation of the official Ruby client for Riak, Ripple, got a little sparse as well.

To troubleshoot some issues, it was beneficial to learn some Erlang. I eventually bought the Pragmatic Programmers book on Erlang, which was hugely helpful. Until we did, though, a lot of things didn’t make sense.

We ended up being heavy contributors to Ripple, with Myron added on the core team. Basho was even nice enough to fly Myron + the other core contributors to Basho’s west coast office for a hackathon to bring Riak closer to 1.0

Overall however, I will say that our experience with Riak and its community have been overwhelmingly positive. We have had a few issues/problems with Riak in production (more on this later), but community help has always been quick and useful. Basho has spent a ton of time improving their wiki and other documentation, which should make starting out with Riak much easier in the future.

How we Use Riak

Our usage of Riak for storing rankings data is actually quite simple. There is a lot of implementation details that make this process possible, but conceptually at a high level it’s quite easy to understand.

All user campaigns are mapped to a “Subscription” in our rankings service. A Subscription holds a listing of the keywords, search engines and url fragments (root domains, sub domains, subfolders) a campaign cares about.

Nightly we collect rankings from search engines for the subscriptions whose rankings collect that day. These SERPs are serialized into JSON and stored in Riak, the documents keyed by a string which is the combination of engine, locale, keyword and date collected.

For every SERP that is collected, we spawn two new jobs. One job collects Linkscape data for every URL seen in the SERP. The other analyzes the SERP on behalf of the Subscription, looking for any URLs that rank for the subscription’s URL fragments.

The results of the SERP analyzing are saved into another Riak document called a Ranking List, which is keyed by a string that is the combination of engine, locale, keyword and URL fragment. Keying the Ranking List in this way allows us to compute all the Ranking List document keys for any given Subscription.

Once all of a subscription’s rankings have been analyzed, we run a map/reduce job across all of the possible Ranking Lists for a Subscription and generate a “Recent Rankings Report” that gets cached in Riak.

Our PRO web application then downloads this cached recent rankings report (a JSON document) and transfers the data from the report into its local relational database for quick and easy sorting, filtering and pagination.

If a user requests a “ranking history” page – the one with the ranking graph at the top – we actually make an on-demand map reduce to the 1..4 Ranking List documents for the subscription. We can do the ranking history map/reduce job in real time because it’s much quicker than the full Recent Rankings map/reduce job, which can go over tens of thousands of Ranking List documents.

Issues and Gotchas

Things have certainly not been smooth sailing with our usage of Riak in production. The majority of the issues have been things outside of Riak’s control, but there were still some problems/bugs with Riak that we ran in to. A lot of these have been fixed in the latest release of Riak (1.0). It is noted where relevant.

Basho advertised a list_keys operation to traverse all keys in a bucket. This was an in-memory O(n) traversal, which was slow, but we were planning on using it sparingly for small buckets (like subscriptions). Turns out that because a bucket is basically just a conceptual namespace prefix, list_keys actually goes over every single key in the entire database. The recommendation now is to never list keys in production. Instead, we created a few “schedule” documents for each possible collection schedule (there are only a few possibles ones), and read subscriptions to collect rankings for from those schedule documents.

Riak also has this concept of key filters, which allows you to ask for all keys in a bucket that meet certain characteristics. This was a useful feature to us in locating all Subscription documents meeting certain scheduling criteria. Unfortunately, key filters uses list_keys under the covers and, as we learned in bullet point #1, is not a good idea.

Riak has a concept of Links, which let you create one-way links between documents and also “walk the links” of a document, which basically does map/reduce under the covers to return linked-to documents meeting certain filtering criteria. At one point in our design, we were using this feature heavily with some documents having hundreds or thousands of links. Links end up in HTTP headers, which a lot of existing HTTP software (like HAProxy or Nginx) puts a size restriction on, so we had to abandon our links.

We managed to cause Riak to segfault when running Javascript-powered map/reduce queries. Specifically it was the in erlang_js_drv.so. Riak still uses the Spidermonkey javascript engine, which is pretty ancient and had caused issues for other users of Riak so just a heads up. We’ve heard of other users who abandoned doing Javascript map/reduce as well. We ended up writing our map/reduce queries in Erlang to get around this problem.

The rankings service was originally running on Ubuntu 10.04, which, while under Xen (we were in Ec2) has issues. Our main problem was that EC2 nodes would randomly restart for no apparent reason – almost like someone pulled the power cord. When that happened, it left some Bitcask (default data backend for Riak) in a truncated/errored state. Now, Bitcask is supposed to be tolerant of this since it only writes to one file at a time in an append-only fashion, but because of the sudden truncation of the bitcask files, we ran in to a bitcask bug and had to restore from our backups. This restart, truncation, error, restore phase happened three times before we could upgrade our servers to 11.04, which totally fixed the restart problem.

When you restore from a backup, any writes that were written to the restored node since the backup are lost. This isn’t terrible, as other nodes have replicas of the data (remember the n value from above), but the data is not automagically restored. Data is only restored if a read is made on a key that the restored node is supposed to have. When that happens, Riak notice the data is missing from the restored node and restore that data to it.

Lack of automagically restoring data isn’t a big deal either, as you still have enough nodes with the data to meet quorum and service the request. Problem is, when doing map/reduce, Riak does a bunch of internal reads using an r value of 1. This means that if Riak happens to read from the restored node for a key that hasn’t been read-repaired yet, you get a “not found.” Also, keep in mind that this was only the case when running map/reduce with Erlang, and not Javascript. To fix this problem, we would manually read any documents that were “not found” during running of the Erlang map/reduce query to verify they actually were not found.

I/O performance for Riak is very important – especially disk. We originally ran Riak in EC2, but found the disks to be too slow for our workload and ended up moving the cluster to physical hardware to achieve better performance. On EC2, consistent I/O wait of 30-40% is not uncommon. It’s also worth noting that Basho says, “Riak will run best when not virtualized,” but they do offer some tips if you still want to.

When you add a new node to the cluster in Riak 0.14.2, the new node claims a portion of the ring and immediately starts servicing requests for those keys. This is prior, however, to the node actually receiving all the data for those keys. As a result, the node can serve phantom “not found” results.

Phantom not founds shouldn’t be a problem if you meet read quorum (other nodes still have the data), except that while a new node is being added to the ring Riak may actually shuffle data to/from any node, not just the one being added. So it’s entirely possible you may have only one (or in rare cases zero) nodes that claim a key who actually have the data. Even a read with an r value of 1 won’t help you because of an internal optimization Riak makes called “basic quorum.” This mailing list thread has all the gory details. Needless to say, this clustering code in Riak 1.0 does not have this problem and adding nodes in that version should Just Work™.

On Riak 0.14.2, nodes are by default configured to have a “map reduce queue.” If the node receives a request for a map/reduce job and it does not have the capacity to service it, it is placed in the queue to be processed later. The map reduce queue has given us operational trouble, causing request times on nodes to occasionally spike until we manually truncate the queue. Talking on IRC to Basho employees and other users, we’ve been told the map/reduce queue is a “Bad Citizen.” The queue has since been removed from Riak 1.0, but for now when we’ve had problems we’ve stopped the node, truncated the queue and started the node back up again.

Conclusions

In spite of the issue and gotchas that we encountered with Riak, overall we are extremely happy with Riak and how well it worked for our project. It has solved some very pressing scaling issues with the original rankings system and given us a solid and flexible platform to build other rankings-related features for the future. Riak isn’t necessarily the right choice for every data model or workload you may have, but if you’re looking for a scalable fault tolerant database with a document-based data model, I would strongly encourage Riak.

[1] Clients may only write their own conflict resolution if a bucket has its “allow_mult” property set to true. Under this scenario, Riak creates “sibling” documents when it cannot determine proper conflict resolution via vector clocks. When allow_mult is set to false, and Riak cannot determine conflict resolution via vector clocks, it resorts to using document timestamps.

[3] Really, the document is replicated to n other virtual nodes. Riak makes attempts to not put virtual nodes which hold replicas of the same document on the same physical node, but this is not a 100% guarantee.

[4] Read repair is not automatic, it is only triggered if a read is requested for a document that the node is missing. One suggested tactic is to “force reads” on missing documents if you would like to restore missing data.

Link please? I’m not seeing anything in Wiki saying it was removed. Perhaps removed from Ripple?

Jon L.

It appears you’re still running 0.14.2, despite 1.0.1 being out (just this week tho). Considering the variety of notable issues that you mentioned in your article alone that exist with the 0.14 line, is there some deeper reason why you haven’t yet migrated to 1.0+ versions of Riak?

jeff

Jon,

Thanks for the comment. I misread a note I saw about key filters and it has not actually been removed from 1.0. I’ve edited the blog post to reflect that – thanks for the catch!

As to why we haven’t upgraded to 1.0 yet – it’s certainly very high on our list of things to do, we’ve just been busy with other more pressing Riak operations things. We spent the last couple weeks migrating to real hardware from EC2 and also just added a new node over the weekend to our cluster, which was a high priority due to disk usage. Now that those tasks are completed, moving to 1.0 is next on our list.

Joe Van Dyk

On ec2, were you using ebs or instance storage?

I’ve found instance storage to be much faster and reliable.

jeff

Hey Joe,

We used instance storage, which is faster and more reliable than EBS. We did hear of people using EBS in a RAID configuration but did not attempt that.

Peter

Interesting read, and thanks for posting your findings.

Did you also consider CouchDB or MongoDB? Couch in particular sounds like it should’ve been a good match for your needs.

The University of West Indies is also a home to many
tourist attractions. The overall game also comes upwards with content
to help keep people focused throughout the ready.
Similarly, whenever you talk about the pricing
of dragon city cheats engine then you will be happy to examine that these
tend to be absolutely free associated with cost.

Welcome to our dev blog!

This is the blog that is written by members of the Moz engineering team covering topics and things that interest us.