As we’ve recently shared, Buffer is moving towards a Service Oriented Architecture and moving the links counting logic into its own service. The links service accounts for a large portion of all Buffer API traffic and at the time of writing contains almost 600 million records.

Starting Out

When I started working on the links service project, the production traffic was being served with a PHP backend and a Mongo database. There was also a caching layer stored in memcached.

Requirements

Aside from moving the links counting logic into its own service, we wanted to make sure the new service met the following requirements:

High Availability

The system should be able to tolerate a partial failure. If one links service replica goes down, the system should still service requests on the other replicas. This means that building a stateless, or even more specifically, conforming to the 12 factor app principles is key.

Handle High Throughput

After collecting data from the original links service we found an average of 400 req/sec with bursts upwards of 700 req/sec. On the write side we observed 8 inserts/sec from the workers. Indicating a heavy skew towards read. We should not only be able to handle the current workload, but handle future workload as it increases (making the assumption that this feature gets utilized more over time).

Maintain Historical Data

The links service collection contains time series information that could be utilized in future features. We wanted to preserve the data even though we wouldn’t need it directly for the counts.

Iteration 1: Amazon RDS (Aurora) + Node + Redis

Amazon RDS (Aurora) is a MySQL compatible database that has some nice features like auto-scaling and a simple interface to increase compute power. We decided to go with RDS – Aurora for a number of reasons:

– High throughput
– Data autoscaling
– MySQL in local dev environment

Node almost needs no introduction. It’s server side JS backed by a powerful packaging system. It’s an async powerhouse driven by its event loop. We chose Node for the following reasons:

Redis is an in-memory data store that can be used as a database, cache or message broker. It has capabilities to operate in a cluster with replica sets, which is a huge plus if you’re looking for high availability. We decided Redis was ideal because:

Combining these technologies together, we decided on an architecture that looked like this:

After backfilling the records and setting indexes we ran some test queries.

SELECT count(*) WHERE url = 'buffer.com'

The idea here is to count the number of times buffer.com has been shared. This works well so long as the counts for given urls are relatively low (< 200K).

When running a count on a url with a result in the millions, queries were taking 20+ seconds _even with the URL field indexed_. This was especially concerning, since there are a couple URLs with counts in the 80-90 million range. Running EXPLAIN on the query shed some light onto the problem:

SELECT count(*) WHERE url = 'popularurl.com'

--

4200000

EXPLAIN SELECT count(*) WHERE url = 'popularurl.com'

--

key | rows
----------
url | 5000000

Explain returned that in order to calculate the exact count it needed to scan through 5,000,000 records. Granted this is better than a full table scan of 600,000,000 records, but is still going to take some time.

With larger datasets, MySQL is really good at finding a needle in the haystack (one person’s profile data for instance). But it’s not great at answering questions like “how many pieces of straw are in the haystack?”.

We decided to keep iterating.

Iteration 2: Elasticsearch

Elasticsearch is a search engine built on Lucene. Elasticsearch search allows you to distribute data across shards and shards are distributed across nodes (along with replicas). So when you run a count query, the query is sent to each shard (and replica). When each shard has completed the count, the counts are aggregated on one node. This style of counting is very much like a map/reduce job on other frameworks, which is very powerful because the load is distributed across nodes.

After backfilling we ran some test queries using the Node count API. Initial results were very promising, as we where seeing response times in the 100 – 200 ms range for all urls tested. So we decided to roll out the new service at 1% of the production traffic.

Things were looking great at 1%, so we stepped up to 10% and then 50%. Oops, we broke it. Requests jumped up to 15 seconds (error timeout) and the containers on Kubernetes started running out of memory and eventually would get killed. After some investigation, we determined that Elasticsearch was the bottleneck. We started trying to optimize the query by using filters so we could utilize the Elasticsearch cache. At this point the thought occurred that this might not be the best solution for our workload. “We’re at 50% of our current workload and we’re already doing query optimization”, which seemed troubling since we wanted to build something that could carry us into the future. But there was still more unknowns to explore before giving up on Elasticsearch.

The next thing we tried was to add another replica to the links index. We were looking to increase the read capacity of the cluster and the Elasticsearch docs call out that one strategy is to increase the number of replicas. However, adding the replicas caused the queries to return more slowly.

Let’s say for a minute you’ve got one node, one shard and zero replicas. Now let’s say you run enough queries on the index to use 100% of the resources on the node.

Now lets say you set the replicas on the index to 1 and run the same queries as with the previous experiment. Elasticsearch is going to run queries on both the shard and the replica, effectively splitting the node’s resources. Keep in mind with the previous experiment caused the node to consume 100% of its resources. Splitting the resources under heavy load would cause the queries to take longer.

Keeping the replicas the same and increasing the number nodes would have been the better solution in our case. However the Elasticsearch cluster was already large enough to be more expensive than our Mongo database. It became clear at this point that changing replicas and nodes would not be enough to carry us confidently into the future.

So again, we decided to iterate.

Iteration 3: Redis + S3

Our strategy of storing the links in a collection had proven to not be scalable. Every count on the links collection was doing a massive amount of work. So we decided to rethink our data structure to better suit our workload.

We ended up decided to use a simple hash data structure to store counts:

{
"buffer.com": 5000,
"popularurl.com": 4000000
}

As with any decision with data structure there’s some tradeoffs you have to make. When you store links in this way it’s more difficult to recover from error states because you’ve got no history. With a counter you have a snapshot of the world, rather than the historical view for all time. To compensate for this drawback we decided to archive the links in S3 with the previous data structure, that way we could backfill another service or recover the links service in the event of a catastrophic failure.

What you get in return is the ability to leverage work you’ve already done, rather than repeating large amounts of work over and over again. Your current snapshot is a superposition of all previous states of the system. To update the snapshot, a simple increment operation is enough to merge all previous states with the current change – meaning that both reads are writes are very fast!

We decided to use Redis as our main datastore, with a simplified architecture:

After backfilling we ran some test queries and were blown away by the performance – sub millisecond response times! We decided to move forward with 1% of production traffic.

Things were looking great at 1%, so we increased traffic to 10%, then 50%, then 100%. Even with 100% of production traffic, average response times stayed below one millisecond.

At this point we were all convinced we had applied the right data structures and tech stack to support our link service customers well into the future. We even had some options that were unavailable in the past, things like realtime counts (we cache responses with our CDN for 10 minutes). We could also use the historical S3 data to backfill future services to support feature we could only dream of. So many possibilities!

Reflecting

Sometimes things look very simple on the surface, even if you’ve done similar projects. It’s only when you’ve dug deep into the implementation and ran experiments (likely in production) that you find out if something is going to work. I think it’s one of the paradoxes that make building software interesting. Challenging assumptions until you figure out whats lurking beneath the surface (thar be dragons). It’s been an epic journey building out the Buffer links service, and I hope that sharing these experiences with others will be helpful.

Great questions! Our working set is roughly 50 GB in Redis (the S3 dataset is much larger since it contains each link share and metadata – closer to 200 GB). Currently it takes ~8 hours to recalculate and load the stats from S3.

EJ

Did you look at using a MySQL trigger to update a summary table with the current count? That would keep everything in one place but speed up this particular query.

Interesting suggestion! That would work well if we needed to run queries against the historical data. But since we only needed the counts we ended up backing up the data in S3 since it was less expensive. How scalable are MySQL triggers? Is that something that could work in a MySQL cluster?

EJ

They run inside the MySQL engine, so they are as scalable as MySQL. You would pay the cost of two inserts, but your queries would be lightening fast.

stevesun21

There is another option for this scenario, MySQL + Redis.
The pesude code would like the following:

1: insert link (eg: https://buffer.com) record to mysql
2: increase count for the link in Redis by one

If the insert mysql record failed, then, the increase count will be never triggered.
If the increase count failed, then delete (manually rollback) the record inserted before.

More consistency way, you can rely on mysql transaction lock, but, I would say that not necessary based on the use case and odds of failure happened at manully rollback (delete link record).

That would be a great way to keep consistency between counts and historical data! (That’s a nice simple design too) With our use case, it turned out we only needed the counts. Keeping the historical data around in S3 in is a lot less expensive than the MySQL instances it would have taken to keep it all around. In the future we might backfill a MySQL or even an Elasticsearch instance for other applications.