Scaling to Billions of Requests a Day with AWS

Our startup, Branch, unveiled our deep linking tool to the world in September, 2014, and less than two months later our infrastructure was growing quickly. The Node.js application servers were at 80% CPU, and our PostgresSQL RDS looked like it didn’t have much more room to grow. Meanwhile, traffic was doubling every couple of weeks. We had to figure out a pathway to scale quickly, or we’d be dead.

About Our Service

Branch is a rapidly growing service that offers deep linking technology for mobile app developers through a URL that sits in front of every HTTP link for a business. Branch sends users who click links to the exact location of the developer’s choosing, either to a website or to a location in the app, and then provides analytics on the clicks.

We developed our service as a way to tackle the increasing complexities of mobile web and native app navigation. When native apps entered the picture, linking became much more challenging because you no longer could simply load a website at the click of a link. Mobile applications require a separate installation, and it is a complex, multi-step process unique to each platform. Users potentially can end up in five or more destinations depending on their operating systems and devices. Using Branch links and our SDKs give developers seamless and configurable destinations around their websites and apps.

Scaling Branch has been particularly challenging. Most of our requests require real-time responses because there are users depending on the query result. For example, our SDKs are initialized every time the app opens to return a link referrer to the app if there is one available. Developers often block their user experience waiting for us to return. Because of this, success rate and uptime are core product requirements.

A Look at Branch’s Current Tech Stack

The SDKs are easy to add to existing mobile projects, and our dashboard walks a developer through a few simple steps to get an application set up to track basic attribution. We’ve built SDKs across seven different platforms (including iOS, Android, Web, and React Native), each with their own quirks. However, given that this post is focused on infrastructure, we’ll gloss over these complexities and give you a quick summary.

From the server side, three Node.js applications handle the user-facing traffic: our API, our link service, and our dashboard. Due to the substantial complexities involved with each application, we’ve isolated the services to help with reliability. Once the SDK is integrated, the mobile app makes a few REST API calls to the Branch API (api.branch.io) for every app or web action that occurs. The API service handles requests from the SDKs, such as tracking app opens, installations, closes, and link creation. Our link service sits in front of all link traffic, directing users to the best destinations depending on their context and history. It manages all browser cookies to ensure consistency. Finally, the dashboard at dashboard.branch.io is the site our partners use to configure their apps, view analytics, and set up links.

These apps are served in AWS by Ubuntu instances behind unique Elastic Load Balancing (ELB) load balancers. The Node.js app handles much of the business logic, such as determining whether a user should be deep linked to a specific place in an app, go to a website, or head to the App Store. The business logic is unique to each user, but we’ve tried to make the services as stateless as possible. The Node.js app therefore depends on a number of Node.js, Java, and Clojure services to assist with its decision making. Load balancing to the individual services is done with a cluster of Nginx instances.

We use Kubernetes, an open source cluster management tool, for dynamically scaling our core services. Our primary databases are Amazon RDS for PostgreSQL and Aerospike. We typically feed analytics through a pipeline that flows from an Apache Kafka log to Amazon S3 to Spark with Python.

Learnings and Surprises

Scaling the Branch service has been one of the most challenging and rewarding experiences of my career. With an Ops team of only two or three infrastructure engineers, we’ve had a lot of late nights, rough weekends, disappointments, and spectacular successes. We realize that despite how crazy it’s been so far, it’s only just beginning, and we’re excited for what’s ahead.

We are heavy users of Amazon RDS for PostgreSQL, and have had the opportunity to work closely with the RDS team. We have dozens of databases, from small to medium to maximum r3.8xlarge. Here are some of the pitfalls we’ve encountered and what we learned.

We had moderate database experience on the team, but none with PostgreSQL and none with Amazon RDS. I was immediately impressed with the feature set of RDS: online backups and on-demand snapshots, automatic failover with Multi-AZ support, automatic upgrades, and easy instance re-sizing. It’s also a fully managed service with technical support. Amazon RDS is a powerful toolset that we continue to use.

On my third day on the job, we accidentally did a TRUNCATE TABLE on a production Postgres instance. Naturally, Multi-AZ support didn’t help here because the failover DB was also truncated. The latest automated snapshot was more than twelve hours old.

What were the available recovery options?

1. Restore the data from backup, but lose twelve hours of data.

2. Attempt RDS-supported PostgreSQL point-in-time recovery to restore the data, right up to the point before the truncation happened.

The second option is a no brainer from a data recovery perspective, right? Let’s do it. Fire it up. We would accept losing data while the recovery was in progress because we would make the recovered DB the primary DB once it reached the specified time, and the new DB wouldn’t collect new data while being recovered.

A half hour into the recovery process came the sobering realization that we had no idea how long the process would take. It could be hours or even days. We knew that RDS would take the latest snapshot, then roll forward using Write Ahead Log (WAL) to the specified time. But that could easily take the same twelve hours it took to run the original log (or longer) to retrieve all the logs before it could apply them.

OK, new multi-step plan:

1. Create a new primary DB from the latest snapshot because snapshots come online within minutes.

2. Let the other point-in-time recovery proceed on a secondary DB.

3. While point-in-time recovery is in progress, prepare scripts to pull data from the truncated primary DB, starting at the truncation date.

4. After the point-in-time recovery is finished, use another script to pull data from point-in-time recovery between the snapshot time and the truncation time.

5. Import data from steps 3 and 4 back into the new primary DB.

Overall, it mostly worked. Because the access patterns on the tables are mostly INSERT with very little UPDATE or DELETE, logical corruption was minimal. Despite that risk, we consciously chose availability over some data latency and some data corruption. Additionally, not having any foreign key constraints made the dump and import process a lot simpler.

Sharding Isn’t Always the Solution

After we felt that we had stretched an individual RDS instance to its capacity, we split the database and the API servers into shards to reduce the load on those components. Fortunately, the application data was isolated in ways that allowed us to easily shard by customer ID. This also allowed us to maintain a logical sharding scheme while offering the flexibility of physical separation, for single-tenancy, depending on the customer. From an operations perspective, RDS PostgreSQL DB replicas made resharding reasonably easy, and we did this several times to buy us more time. This is not a cheap solution as we were cautiously overprovisioned, meaning subsequent replicas and shards were also overprovisioned. The extremely rapid growth of our user base/platform did not afford us the time to optimize our provisioning for our per-shard workload.

With a renewed focus on stability and infrastructure investment, we have spent a lot of time designing a more scalable solution for the longer term. Here are some of the unfortunate consequences of sharding:

● Capacity planning for n shards takes n times more work than a single shard. Capacity planning involves some guesswork and is easy to get wrong, especially in an immature, high growth environment. Because it’s so unpredictable, it is usually better to overprovision than it is to be wrong. With n shards comes n ways to provision incorrectly.

● Hotspots are hotter and less predictable. Because you’ve isolated a fraction of your entire traffic to an individual shard, a burst of traffic from an individual customer will be larger relative to the sharded traffic, making it more difficult to deal with.

● Resharding is painful. Shard maintenance can be made manageable with an investment in tools and processes, and there are many edge cases to handle, especially if resharding has to be done online without any service interruptions.

After some trying times, we felt that the DB I/O was the problem, so we upgraded some databases from General Purpose SSD volumes to the maximum number of Provisioned IOPS (PIOPS) SSD volumes. Our DB was the most expensive that money could buy, an r3.8xlarge with 3 TB of storage running 30k PIOPS with Multi-AZ. For us, this was far too expensive a solution to be worthwhile.

A General Purpose SSD volume provides 1 IOPS per GB bursting to 3, so in a 3 TB disk it’s 3k IOPS, bursting to 9k IOPS. A PIOPS SSD volume has between 3–10 IOPS per GB, giving our 3 TB disk 9k to 30k PIOPS. The upgrade from General Purpose to PIOPS for a Single-AZ host cost about $4k/month and nearly double for a Multi-AZ. The more cost-effective solution is to increase the size of the disks from 3 TB to 6 TB. That gives the volume 6k IOPS bursting to 18k IOPS for only 2x the cost, plus you get the extra disk space.

One downside is that a Postgres volume cannot be shrunk online, so you’re adding in this new cost forever. An additional issue is that the time it takes to upgrade or downgrade the IOPS is significant. For our large volumes, it takes about 24 hours for an IOPS change. During that time, the I/O on the primary is impacted, and you are not allowed to make any other changes like instance type changes or network changes, and you can’t even reboot. This is scary if you think the performance impact of the PIOPS change might cause DB issues: You could be in a state where your DB has terrible performance, but you can’t make any change for an extended period of time. Increasing the size of a volume has no such problem. It immediately gives you more disk space and I/O without impact. Only in the most extreme IOPS deficit should you pay for PIOPS, and even then you still need to be careful about the transition period.

Things that Worked for Us

The first big gain we saw was the slow migration from the monolithic Node.js app to a more service oriented architecture. Moving logic and components out of our primary Node.js app and into separate services allowed services to scale separately and reduced the Node.js app as a single point of failure. It also made it easier to reason about caching, and explore alternative data stores, while simultaneously optimizing our RDS PostgreSQL usage by query tuning, improving indexing, and using denormalization.

The other big gain we had was to migrate latency-critical data out of PostgreSQL and into Aerospike. While breaking out services, we migrated many services to use Aerospike instead of PostgreSQL. This both reduced load on PostgreSQL and increased the scalability of the services because of Aerospike’s horizontal scalability. Moving from a SQL DB to a key-value store like Aerospike requires compromises in schema design and query flexibility, but the online operation was critical, and Aerospike has exceeded expectations in latency and reliability.

Lastly, we set a key objective to reduce synchronous operations on PostgreSQL. We moved many operations from synchronous PostgreSQL operations to our offline, queue-based analytics layer. One of our heaviest operations involves conversion funnel queries, and these were moved from direct PostgreSQL queries to our Spark-based analytics query system. Analytics operations now require more steps to process through our pipeline and are less readily accessible than PostgreSQL, but it’s an easy tradeoff for greater uptime and reliability.

Takeaways

Building things to scale is a challenge, but doing it while managing an existing system that is rapidly growing is an even greater challenge. I like to think of it like changing the wheels on an accelerating car. We’ve been successfully maintaining four nines of success rate on our API and five nines on the link service, all while continuing to grow server traffic at around 13% per month.

It’s been a fantastic learning experience for the team, and we look forward to the next set of challenges.

What do you think? Do these sound like interesting problems? Look at our jobs page and apply, and don’t worry if your first week is a little rough.