Category Archives: API

Hey Moz fans, Brad here! I’m the technical lead in charge of the Mozscape API, and I have some exciting news to share with you today.

Supercharged Mozscape!

We recently released a brand new, shiny version of the Mozscape API code. I wouldn’t fault you if you didn’t notice — the change was meant to be functionally transparent.

We sure noticed the difference, though!

This release is a complete port from the old version of the API code (which was written in C++) to a new, leaner version written in Python. We knew that we could reduce the size of our codebase this way, but another primary motivation was increased performance.

Yes, you read that right — we ported code from C++ to Python in order to improve performance.

Not something you hear every day, right? After all, C++ is known for its lean object code and its highly optimized standard libraries. Python is an interpreted, high-level language which usually languishes in comparative performance testing. All this is very true, in general, but this project taught us some interesting lessons.

First, let’s take a look at how the two codebases compare. If you’re unfamiliar with Mozscape, the four endpoints we expose are URL Metrics, Top Pages, Links, and Anchor Text. In our benchmarks, we found measurable improvements to three of the four calls:

Anchor Text average response time improved by a modest 4.5%

Links average response time improved by an impressive 12%

Top Pages average response time improved by an incredible 21%!

We measured performance on batched URL metrics calls, by which you can POST up to 200 URLs to the endpoint and get some bulk metrics data (more about batched requests). These figures represent the maximum batch size of 200 URLs, requesting a wide assortment of columns. These are IO-expensive calls, and we saw an amazing 26% improvement in response times!

These performance improvements mean that Mozscape will scale even better as our user base grows. We can also handle traffic spikes with fewer operational headaches that could wake us up in the wee hours of the morning. All very good things!

Mozscape is now easier than ever for our engineers to maintain. The old C++ codebase was a massive, 13,000-line goliath that could be kindly described as “spaghetti code” and less kindly described as “entirely unintelligible.” Python’s expressiveness through concise code is one of my favorite features of the language. We created a functionally equivalent port in just 3,000 lines of Python!

Lessons learned

First, we discovered the joy of greenlets. Python does support user-space threading, but greenlets are far more lightweight. They’re also superior to the threading module when code is IO-bound rather than CPU-bound, which fits our profile nicely. The module intelligently switches contexts from one greenlet to another and continues to execute Python commands even if one greenlet is blocked by, say, an IO-heavy instruction. Especially when coupled with gevent, greenlets are an extremeley powerful tool that should be in your Python toolbox.

For the fun of it, I made a little debug flag that disables gevent in the code. The performance dropped to an unusable level — a result I expected, since Mozscape has an index that is hundreds of gigabytes in size and requires a decent amount of IO to serve requests.

Also, we’ve proven that tight code in a high-level language can still beat loosey-goosey code in a speedy language. Any time you port 13,000 lines of code to a functionally equivalent 3,000-line codebase, you can bet pretty safely on the improved efficiency of the latter, regardless of the languages you choose.

Recently in SEOmoz Engineering, the Linkscape team had the opportunity to evaluate Membase, a distributed caching technology similar to Memcache. Linkscape’s continued popularity had confronted us with a new problem: what is the proper way to cache data in the cloud? The journey didn’t lead us where we expected, but the road to discovery is always interesting!

OUR PROBLEM

Linkscape, the driving force behind Open Site Explorer, is a large read-only database that runs in Amazon’s cloud. It has almost no user state to manage, so the metaphor of a paging cache is handy for thinking about its API. The bulk of the data is kept in “blocks” in S3, compressed and indexed in a certain way. At request time, a few blocks are pulled from S3, decompressed and examined for the desired records.

Blocks organization tries to take advantage of locality in requests, so it makes sense to cache entire blocks at a time. This can be a boon to performance, but it comes with a price. The Internet is not equally interesting in all places, and as a reflection of the Internet, neither is Linkscape. Some blocks are vastly more popular than others, because they contain information about popular sites. Unless special care is taken, these “hotspots” can be bad for performance: an unexpectedly disproportionate amount of resources is sometimes needed to handle them.

Linkscape is happily growing, and with that come growing pains. In this story those pains are increased throttling from S3. Our hotspots, which we were essentially passing on to S3, had finally grown large enough to aggravate S3 into occasional bursts of rate-limited service denial errors. It was time to shape up our cranky access pattern.

CurrentlyMemcache is our chosen caching technology, we like its simplicity and maturity. Each of our (100 or so) API nodes runs a local, isolated Memcache installation. S3 requests originating from that node only pass through it. This unclustered arrangement is clearly redundant (in a bad way), but fast since only cache misses require actual network activity.

Before we found ourselves in the hotspot hot seat, this was a reasonable trade-off: the cache miss penalty was the latency of an S3 request. Now, however, the miss penalty was that same latency plusthe amortized risk of exponential back-off, should our request be denied.

Suddenly the image of a distributed cluster of Memcache instances starts looking mighty fine. Since the penalty of a cache miss is becoming more severe, a reasonable tactic is to reduce the chance of misses altogether. Clever eviction schemes aside, the way one achieves that is with a bigger cache. Our Memcache instances were oblivious of each other: a miss in one might be a hit in another, but since the two instances don’t “talk”, they’d never know.

To be fair, Memcache does support a limited form of distributed caching (after all, the word “distributed” appears in its tagline). This feature, however, amounts to the observation that the client could keep track of everything. By having a list of all nodes in the caching cluster and sticking to some scheme for assigning keys to them, a smart client could treat several nodes as one large cache. Many do. It just doesn’t sound that great to us.

For one thing, we had tried it before, long ago way back when. Managing the client was all right, but we found our (somewhat naive) approach led to a lot of network “cross-talk”, to the point that it had a negative performance impact. Additionally, in our new situation, I fear we’d only end up moving the hotspot network activity from S3 to the cache nodes: without some kind of key replication, it seems likely that some nodes would be pounded an order of magnitude harder than others.

OUR SOLUTION

Enter Membase. Like Memcache, it is a typeless key-value store meant for “predictable, low-latency, random access to data with high sustained throughput.” Additionally its notion of “distributed” is a bit more hands-on: using a flavor of consistent hashing called “vbuckets”, it will deal with server management as well as providing a configurable degree of replication. Thus hot (and cold) blocks could be shared between (say) 3 cache nodes to spread the heat around. Finally, the coup de grace: “It is protocol-compatible with Memcached (both text and binary protocols), so if an application is already using Memcached, Membase can be dropped in without any change to application code or client configuration.”Rock on!

After grabbing the stock Debian package and perusing around the command line tool’s help, I found the first bit of fine print: Membase will happily speak modern versions of the Memcache protocol. We, however, were holding onto a moldy old libmemcached-0.26 due to an as-of-yet undiagnosed bug and/or feature present in later versions, including the 0.44 version that Membase is apparently coded against. Thus we got (somewhat) mysterious segfaults immediately after initial memcache client usage. This can’t be reasonably held against Membase as a fault, and so now that we had a defensible reason for upgrading we went about tracing the bug and replacing the client library (which is itself a tale for another time.)

Having migrated to a modern libmemcached client version, the next logical step is to swap out our unclustered, isolated Memcache instances for unclustered, isolated Membase instances on a set of testbed nodes. This wouldn’t garner any of the benefits of a distributed cache, but if we could demonstrate a working system in this state, the problem would be reduced essentially to configuration.

As promised, Membase supplies a means for Memcache clients to connect to Membase on Memcache’s usual port, while transparently adding in its consistent hashing and replication features. It turns out this is realized with a proxy process called “moxi”, which takes the role of the “smart client” envisioned in Memcache’s original distributed architecture. This proxy fired up along with the instance proper in the stock daemon control script, so our API process was indeed able to start up and begin going through its paces according to our warmup script.

OUR NOT-SO-HAPPY ENDING

During the warmup script, funny things began happening. The cached blocks, arriving compressed from S3, began failing their checksums. On fresh machines, with cold caches, the error would appear slowly and non-deterministically. On warm machines the error would be swift and assured. What happened?

An obvious thing to check is the data itself. Upon manually fetching it from S3 the checksums were found to be fine, and different from the failing checksum the API process complained of. Dusting off the ol’ packet sniffer turned up something unfortunate: pairs of cache insertions and probes where one set of data goes in, and another set of slightly different data comes out. The change was usually restricted to only a few bytes, but it was there. Some aspect of how we’re using the client was causing corruption somewhere in the Membase stack.

After hammering our isolated Membase with many flavors of traffic via simple testing scripts, we concluded that the corruption seems to only hit our specific usage pattern. Lucky us. This also made it quite difficult in helping the Membase team diagnose the issue, as there was no obvious way to communicate a reproducing script. In the end, we decided that it wasn’t for us, and instead built a varnish proxy cluster to stand between our API machines and S3, an arrangement that has been serving us well ever since.

Jeff Barr (Senior Web Services Evangelist, Amazon) opened the event by speaking about the present and future of AWS. He was also generous enough to give every attendee a free copy of his new book, “Host Your Web Site In The Cloud: Amazon Web Services Made Easy“. I don’t work intimately with a web-stack these days (the examples in the book are mostly PHP), but this seems like a nice resource to have around. I am a little wary of the shelf life of the information it contains (AWS prices, at least, will change, if not APIs and best practices — this is the kind of stuff I usually Google for).

Next up was Tobias Kunze Briseno, who spoke at length about his company’s product Makara, which provides an auto-scaling PaaS that can be layered on top of several cloud providers, including AWS. I was personally a little disappointed that most of the presentation seemed like a pitch for the product, instead of useful details and best practices related to AWS.

Finally, I stepped up and spoke about SEOmoz’s use of AWS, specifically

Last week we tried to roll out an update to our Linkscape index. In the process we kept running into problems with our deployment. Namely, the new machines weren’t performing up to par and we couldn’t tell why.

At a high level our API works by having an index loaded into an EBS volume on EC2, then uses that index to pull data from S3. In the past we have always warmed up our caches, by running a slew of queries against the whole API, and have generally had no issues. However, this time, we found that the EBS volumes took a bit longer to load into memory, and we had to expand out cache warm ups to exercise all types of API calls.

Here is a description of the problem, a little bit about our API design, and some of our learnings from this experience.

First, a little background….

LSAPI (Linkscape API) requests are basically database queries. However, the database is not a classic RDMS speaking SQL, but something we created ourselves (that tends to fall more into the NoSQL camp). Like a sql database, though, it relies heavily on disk, and disk performance. Speed drives most every aspect of the general architecture.

As far as our API is concerned you can think of requests as being served in two steps.

A. The first step is to identify a set of records satisfying the request’s criteria. This is like SQL running a select and coming back with a list of primary keys.

The second step is to take our record identifiers (RIDs, we literally call them “rids”) and load the data attached to each. This is formatted and sent back to the client.

One quirk (read: “optimization”) of our architecture is that steps (1) and (2) are actually performed by different http requests. Once (1) has identified some RIDs, it makes a new LSAPI call back to itself to perform (2). This was done to take advantage of parallelism: many RIDs could be fetched by many separate servers performing (2) at the same time.

(1) is accomplished with the use of explicit index files, using a tool called BDB == Berkeley DataBase. A BDB index is a file that essentially contains a binary search tree keyed on some criteria of our choosing. The leaves of the tree contain RIDs, that inform us which literal files may contain the key in question. Nearly all the options we offer in our api calls, for example the scopes and sorts in the ‘/links’ call (read more about those here), are each realized with their own BDB index. Therefore, each new LSAPI index includes in it a set of about 20 or so BDB files, which add up to a total of about half a terabyte. We keep each LSAPI index’s BDB files together in one EBS snapshot.

Once we use the BDB index files to identify a set of interesting RIDs, we need to resolve them in (2). If the BDB indexes we use to organize RIDs take 500 gig themselves, the *data* that those rids identify must, in total, be gigantic. So, we keep it in S3. I would say that the best metaphor for this is how a disk cache pages blocks in and out of memory from larger, slower storage. The problem is, that isn’t a metaphor, it is exactly what is happening, just on a much larger scale. When the LSAPI server wishes to see the data attached to a RID, it first checks a local cache (implemented with a tool called memcache.) If it doesn’t find it there, it constructs a location for where that record lives in S3, based on its RID, and fetches it from the network. Once retrieved it is cached and used.

Point of interest about EBS volumes: they act similarly to our caching arrangement, in that they are not all locally present when they are “attached” to an ec2 instance. When the machine tries to access a part of the volume that is absent, it is first loaded over the network from where ever amazon keeps it, and it is retained locally from then on.

So, we have several layers of caching here. BDB files will be lazily sent over the network (because they live in an EBS volume) the first few times they are accessed. Once available to the local disk, they will be cached in RAM as they are read (the BDB tool does this for us, which is one reason why we use it.) “Records” identified by RIDs will first be transferred over s3 when they are accessed. They are then kept in memcache, which lives in RAM.

When a request happens to use BDB indices and records that are already in RAM, things go quite fast. When they are not, the request runs an order of magnitude slower. So slow that the client’s http connection risks timing out (or equivalently, at the risk of the client concluding our service is broken.)
What precisely happens when this slowness occurs depends on which cached failed to have the interesting data (this event is called a “cache miss”.

When we’re performing step (1), there is the potential that a piece of the EBS volume holding a BDB we want must be loaded over the network, which is slow. When this is the case, the client’s request simply seems to hang, like loading a slow web page.

When we’re performing step (2), the risk is that an interesting RID hasn’t been fetched from S3 yet, which is slow. The gag here is that step (2), as mentioned above, is performed in a *new* http call invoked by the server handling (1). So, (1)’s request, in addition to the client’s, risks timing out. When (1)’s call to (2) times out, (1) gives up and returns an error, to the client (I’m simplifying things, I believe there are some retries, but these can only mitigate, not fix the problem.)

Either way, when deploying a new LSAPI index, it is critical that these delays are avoided, or else they will manifest as we saw them: a bunch of errors appearing out of the ether only when real load is placed on the new index.

We can’t fix the fact that the network is not as fast as RAM. We can only hope to make it less relevant, by trying to ensure that interesting things are *already* in their relevant caches before the first client request ever comes in. What’s the easiest way to do that? Make a bunch of requests ourselves first! If we have a good idea of what a bag of typical client requests looks like, then we can make similar initial requests to hopefully force interesting things into cache. This is called “warming up” the cache(s). Similarly a cache that is empty is called “cold” and one that has been in use for a while is “warm”.

Part of our bootstrap process for lsapi nodes was (is) dedicated to warming the caches, however it was flawed. The sample set of warmup requests was not sufficiently like typical client requests. The tricky point we missed was this: there are multiple BDB files, and each of them lives relatively isolated in its own area of the EBS volume. Therefore, it is not enough to exercise the new lsapi node with a wide variety of domains & urls, we must also end up using a wide variety of BDB index files as well, to ensure they *all* get at least partially paged-in.

Once our sample requests were made more varied to cover this, our deployment problems disappeared. The warmup script itself also now takes substantially longer to run (perhaps an hour), but this just means it’s working.

The big take away –
If you are using EBS volumes to hold index like data, make sure you create your cache warm up scripts to exercise some data in each volume. That way you know everything will be locally present when needed.

(special thanks to Phil who was the one who worked tirelessly to figure out the issue and share his learnings with the rest of us)

Comments Off

Welcome to our dev blog!

This is the blog that is written by members of the Moz engineering team covering topics and things that interest us.