For years, I didn't understand protected methods. Not because I didn't care to, but because I couldn't see the practicality. I'd been writing what I thought was quality production software and never needed them. Not once. It also didn't help that most explanations of protected methods evoked flashbacks of my worst classes in college when I realized mid-semester I had no idea what was going on. I'm not sure if that was my fault or the professor's.

Definitions of protected usually go like this: "protected methods can be accessed by other classes in the same package as well as by subclasses of its class in a different package." Uh, what?

I picked a particularly obscure definition above, but it was the third hit on google for "ruby protected methods."

Let's get one thing out of the way early. I'm not saying you shouldn't use protected methods. I'm saying you shouldn't "use" them. As in, deliberately use them with foresight. That's why I put the word in quotes. There are perfectly valid use cases for protected methods, and I'll illuminate one, but this tool should be employed as a refactoring clarification and nothing else.

Let me show you what I mean.

Say we have a Student class. Each student has a first name, last name, and the ability to provide their full name. Because knowing strictly a first or a last name is potentially ambiguous, a student only knows how to answer by their full name, so the first and last name are private methods.

Along come the professors and they want to check attendance. They plan to call attendance in alphabetical order by the students' last names. They've asked our company, Good Enough Software LLC, to find a way to sort the students by last name. We promptly tell the professors that we only have access to the students' full names. The professors quickly retort, "don't care, make it good enough."

We got this.

Since we can't call the private method #last_name, sorting by last name is a tricky task. We can't just write the following, where a classroom can sort its students:

Protected methods can't help us here. This code will not work unless the #last_name method is made public. We don't want to introduce ambiguity, so we can't make #last_name public.

We need to refactor, eventually to protected methods.

This is why I say don't "use" protected methods. Using protected methods during the first iteration of a class is like grabbing your sledgehammer because you heard there would be nails. You show up only to realize the thing you'll be hammering is your grandma's antique birdbox. Inappropriate use of protected methods dilutes the intention of the object's API, damaging its comprehensibility. When reading code that utilizes protected methods, I want to be able to assume there is an explicit need for it; public and private would not suffice. Unfortunately, this is seldom the case.

We should never write new code with protected methods. There's simply not a strong case for it.

But they are helpful here. If we instead compare the two student objects directly with the spaceship operator (<=>), then we can let the student objects compare themselves using #last_name. Since private methods are accessible by the object that owns them, maybe that will work? Let's try.

We want the Classroom class to look like the following, comparing student objects instead of the last name of each student.

class Classroom
def initialize(students)
@students = students
end
def alphabetized_students
students.sort do |one, two|
one <=> two
end
end
end

The use of the #sort method with a block above is the default behavior, so we can update the the code to eliminate the block:

class Classroom
def initialize(students)
@students = students
end
def alphabetized_students
students.sort
end
end

We now introduce the spaceship operator on the Student class to compare last names of students.

This code still won't run. The implicit call to last_name works, but the explicit call to other.last_name is attempting to call a private method on the other student object. Only now can protected methods save our metaphorical bacon.

Let's update the Student class to make #last_name protected. This will allow our spaceship method to call other.last_name, because the other object is also a Student.

So this is why I say we shouldn't "use" protected methods as a general purpose tool. It's strictly a refactoring clarification for cases where we'd like to provide some utility without exposing additional API to the outside world. In our case, we'd like to compare two students without exposing the #last_name method publicly.

Have you heard of page objects? They're awesome. I'll refer to them as POs. They were conceived as a set of guidelines for organizing the actions a user takes within an application, and they work quite well. There are a few shortcoming with POs, however. Namely, the guidelines (or lack thereof) around how to handle pieces of the app that are shared across pages. That's where components are useful.

A component is a piece of a page; a full page is comprised of zero or more components. Alongside components, a page can also have unique segments that do not fit well into a component.

On the modern web, components are more than a visual abstraction. Web components are increasing in usage as frameworks like Angular, Ember and React advocate their adoption to properly encapsulate HTML, CSS and JavaScript. If you're already organizing your front-end code into components, this article will feel like a natural fit. Uncoincidentally, the behavioral encapsulation of components within acceptance tests is often the same behavioral encapsulation of components in the front-end code. But I'm getting a little ahead of myself...

Let's quickly recap POs. POs date back to 2004, when originally called WindowDrivers. Selenium WebDriver popularized the technique under the name Page Objects. Martin Fowler wrote about his latest approach to POs in 2013. There's even some interesting academic research on the impacts of POs. Generally speaking, a single PO represents a single page being tested. It knows the details of interacting with that page, for example, how to find an element to click.

Acceptance tests have two primary categories of events: actions and assertions. Actions are the interactions with the browser. Assertions are checks that the browser is in the correct state. The community prefers that POs perform actions on the page, yet do not make assertions. Assertions should reside in the test itself.

describe 'the profile page' do
let(:profile_page) { Page::Profile.new }
before do
profile_page.navigate('mikepack_')
end
it 'allows me to navigate to the followers page' do
profile_page.navigate_to_followers
expect(current_path).to eq('/mikepack_/followers')
end
it 'allows me to write a new tweet' do
profile_page.open_tweetbox
expect(page).to have_content('Compose new Tweet')
end
end

That's pretty much all a PO does. For me, there are a few outstanding questions at this point, but we've largely showcased the pattern. To highlight where POs start breaking down, let's model the "followers" page using a PO.

Uh oh, we've encountered our first problem: a user can create a tweet from both the main profile page and from the followers page. We need to share the #open_tweetbox action between these two pages. The conventional wisdom here is to create another "tweetbox page", like the following. We'll move the #open_tweetbox method into the new PO and out of the other POs, and rename it to #open.

module Page
class Tweetbox
include Capybara::DSL
def open
click_button 'Tweet'
end
end
end

Our test for the profile page now incorporates the new Tweetbox PO and our code is a whole lot more DRY.

describe 'the profile page' do
let(:profile_page) { Page::Profile.new }
let(:tweetbox_page) { Page::Tweetbox.new } # New code
before do
# Original setup remains the same
end
it 'allows me to navigate to the followers page' do
# Original test remains the same
end
it 'allows me to write a new tweet' do
tweetbox.open
expect(page).to have_content('Compose new Tweet')
end
end

We're now up against another conundrum: if both the tweets page and the followers pages have the ability to compose a new tweet, do we duplicate the test for composing a tweet in both pages? Do we put it in one page and not the other? How do we choose which page?

This is where components enter the scene. In fact, we almost have a component already: Page::Tweetbox. I dislike the conventional wisdom to make any portion of a page another PO, like we did with Page::Tweetbox. In my opinion, POs should represent full pages. I believe that whole pages and portions of pages (ie components) carry significantly different semantics. We should treat POs and components differently, even though their implementations are mostly consistent. Let's talk about the differences.

Here are my guidelines for page and component objects:

If it's shared across pages, it's a component.

Pages have URLs, components don't.

Pages have assertions, components don't.

Let's address these individually.

If it's shared across pages, it's a component.

Let's refactor the Page::Tweetbox object into a component. The following snippet simply changes the name from Page::Tweetbox to Component::Tweetbox. It doesn't answer a majority of our questions, but it's a necessary starting place.

module Component
class Tweetbox
include Capybara::DSL
def open
click_button 'Tweet'
end
end
end

In the tests, instead of using the sub-page object, Page::Tweetbox, we would now instantiate the Component::Tweetbox component.

Pages have URLs, components don't.

This is an important distinction as it allows us to build better tools around pages. If we have a base Page class, we can begin to support the notion of a URL. Below we'll add a simple DSL for declaring a page's URL, a reusable #navigate method, and the ability to assert that a page is the current page.

class Page
# Our mini DSL for declaring a URL
def self.url(url)
@url = url
end
# We're supporting both static and dynamic URLs, so assume
# it's a dynamic URL if the PO is instantiated with an arg
def initialize(*args)
if args.count > 0
# We're initializing the page for a specific object
@url = self.class.instance_variable_get(:@url).(*args)
end
end
# Our reusable navigate method for all pages
def navigate(*args)
page.visit url(*args)
end
# An assertion we can use to check if a PO is the current page
def the_current_page?
expect(current_path).to eq(url)
end
private
# Helper method for calculating the URL
def url(*args)
return @url if @url
url = self.class.instance_variable_get(:@url)
url.respond_to?(:call) ? url.(*args) : url
end
include Capybara::DSL
end

Our profile and followers POs can now use the base class we just defined. Let's update them. Below, we use the mini DSL for declaring a URL at the top. This DSL supports passing lambdas to accommodate a PO that has a dynamic URL. We can remove the #navigate method from both POs, and use the one in the Page base class.

Below, the test now uses the updated PO APIs. I'm excluding the component test for creating a new tweet, but I'll begin addressing it shortly.

describe 'the profile page' do
let(:profile_page) { Page::Profile.new }
before do
profile_page.navigate('mikepack_')
end
it 'allows me to navigate to the followers page' do
profile_page.navigate_to_followers
expect(Page::Followers.new('mikepack_')).to be_the_current_page
end
end

There are a few things happening in the above test. First, we are not hardcoding URLs in the tests themselves. In the initial example, the URL of the profile page and the URL of the followers page were hardcoded and therefore not reusable across tests. By putting the URL in the PO, we can encapsulate the URL.

Second, we're using the URL within a profile_page PO to navigate to the user's profile page. In our test setup, we tell the browser to navigate to a URL, but we only specify a handle. Since our Page base class supports lambdas to generate URLs, we can dynamically create a URL based off the handle.

Third, we assert that the followers page is the current page, using a little RSpec magic. When making the assertion #be_the_current_page, RSpec will call the method #the_current_page? on whatever object the assertion is being made on. In this case, it's a new instance of Page::Followers. #the_current_page? is expected to return true or false, and our version of it uses the URL specified in the PO to check against the current browser's URL. Below, I've copied the relevent code from the Page base class that fulfills this assertion.

def the_current_page?
expect(current_path).to eq(url)
end

This is how we can provide better URL support for POs. Naturally, portions of a page do not have URLs, so components do not have URLs. (If you're being pedantic, a portion of a page can be linked with a fragment identifier, but these almost always link to copy within the page, not specific functionality.)

Pages have assertions, components don't.

The conventional wisdom suggests that POs should not make assertions on the page. They should be used exclusively for performing actions. Having built large systems around POs, I have found no evidence that this is a worthwhile rule. Subjectively, I've noticed an increase in the expressivity of tests which make assertions on POs. Objectively, and more importantly, is the ability to reuse aspects of a PO between actions and assertions, like DOM selectors. Reusing code between actions and assertions is essential to keeping the test suite DRY and loosely coupled. Without making assertions, knowledge about a page is not well-encapsulated within a PO and is strewn throughout the test suite.

But there is one aspect of assertion-free objects that I do embrace, and this brings us back around to addressing how we manage components.

Components should not make assertions. Component objects must exist so that we can fully test our application, but the desire to make assertions on them should lead us down a different path. The following is an acceptable use of components, as we use it to perform actions exclusively. Here, we assume three methods exist on the tweetbox component that allow us to publish a tweet.

describe 'the profile page' do
let(:profile_page) { Page::Profile.new }
let(:tweetbox) { Component::Tweetbox.new }
before do
profile_page.navigate('mikepack_')
end
it 'shows a tweet immediately after publishing' do
# These three actions could be wrapped up into one helper action
# eg #publish_tweet(content)
tweetbox.open
tweetbox.write('What a nice day!')
tweetbox.submit
expect(profile_page).to have_tweet('What a nice day!')
end
end

In the above example, we use the tweetbox component to perform actions on the page and the profile PO to make assertions about the page. We've introduced a #have_tweet assertion that should know in which part of the page to find tweets and scope the assertion to that DOM selector.

Now, to showcase how not to use components, we just need to revisit our very first test. This test makes assertions about the contents of the tweetbox component. I've copied it below for ease of reference.

describe 'the profile page' do
let(:profile_page) { Page::Profile.new }
before do
profile_page.navigate('mikepack_')
end
it 'allows me to write a new tweet' do
profile_page.open_tweetbox
expect(page).to have_content('Compose new Tweet')
end
end

After converting this test to use the tweetbox component, it would look like the following.

describe 'the profile page' do
let(:profile_page) { Page::Profile.new }
let(:tweetbox) { Component::Tweetbox.new }
before do
profile_page.navigate('mikepack_')
end
it 'allows me to write a new tweet' do
tweetbox.open
expect(tweetbox).to have_content('Compose new Tweet')
end
end

Not good. We're making an assertion on the tweetbox component.

Why not make assertions on components? Practically, there's nothing stoping you, but you'll still have to answer the question: "of all the pages that use this component, which page should I make the assertions on?" If you choose one page over another, gaps in test coverage will subsist. If you choose all pages that contain that component, the suite will be unnecessarily slow.

The inclination to make assertions on components stems from the dynamic nature of those components. In the case of the tweetbox component, pressing the "new tweet" button enacts the dynamic behavior of the component. Pressing this button shows a modal and a form for composing a tweet. The dynamic behavior of a component is realized with JavaScript, and should therefore be tested with JavaScript. By testing with JavaScript, there is a single testing entryway with the component and we'll more rigidly cover the component's edge cases.

Below is an equivalent JavaScript test for asserting the same behavior as the test above. You could use Teaspoon as an easy way to integrate JavaScript tests into your Rails environment. I'm also using the Mocha test framework, with the Chai assertion library.

By testing within JavaScript, we now have a clear point for making assertions. There is no more confusion about where a component should be tested. We continue to use components alongside POs to perform actions in our acceptance suite, but we do not make assertions on them. These tests will run significantly faster than anything we attempt in Capybara, and we're moving the testing logic closer to the code under test.

Wrapping up

Unsurprisingly, if you're using web components or following a component-based structure within your HTML and CSS, component-based acceptance testing is a natural fit. You'll find that components in your tests map closely to components in your markup. This creates more consistency and predictability when maintaining the test suite and forges a shared lexicon between engineering teams.

Your mileage may vary, but I've found this page and component structure to ease the organizational decisions necessary in every acceptance suite. Using the three simple guidelines discussed in this article, your team can make significant strides towards a higher quality suite. Happy testing!

At Ello, we were blindsided by the amount of traffic we were receiving. Right time, right place, I guess. One week, we're seeing a few thousand daily sessions. The following week, a few million. This insurgence of users meant the software we built was contorted in directions we never thought possible.

Like anything viral, there's a massive influx of interest for a relatively short period of time, followed by a slow decline, leaving a wake of destruction as the subject settles into its new mold. Ello has since settled, so what better time than now to document some of the lessons learned whiling scaling during those critical weeks of virality. I want to ensure these lessons are not merely light takeaways but rather tangible advice that you can apply if you're ever fortunate/unfortunate enough to be put in a similar situation. As such, parts of this article will be specific to Ello and may not apply in other domains.

Lesson 1: Move the graph

One of our first scaling hurdles involved managing the graph of relationships between users. We didn't just intuitively say, "oh, the graph is slow," but it didn't take much prodding either. We're on a standard Rails stack, using Heroku and Postgres. We have a table called relationships which stores all data about how users are tied together. Have you friended, blocked, or unfriended someone? It's all stored in the relationships table.

We're building a social network. By definition, our relationships table is one of the hottest tables we have. How many people are you following in total? How many in friends? How many in noise? Who should be notified when you create a post? All of these questions rely on the relationships table for answers. Answers to these questions will be cached by Postgres, so only the initial query incurs the cost of calculating the results. Subsequent queries are fast. But Postgres' query cache alone becomes meager at scale. As a user on a new social network, accumulating relationships is a regular activity. Every new relationships formed busts Postgres' cache for queries on that data. This was a high read, high write table.

Since we're on Heroku, we had the phenomenal Heroku Postgres tools at our disposal. When thrown into the fire, one of the best extinguishers was heroku pg:outliers. This command illuminates the top 10 slowest queries. All 10 of ours were associated with the relationships table. We had all the right indexes in place, yet some queries were taking up to 10 seconds to produce results.

Resolving a problem like this is application specific, but in our case the best option was to denormalize the relationship data into a datastore that could more easily answer our pertinent and frequent questions about the social graph. We chose Redis. It was a bit of a knee-jerk reaction at the time but a technique we've had success with in the past. Only after having implemented this, did we stumble upon a reassuring article outlining how Pinterest uses Redis for their graph. To be clear, we didn't move the data entirely, we provided an additional layer of caching. All data is still stored in Postgres for durability and is cached in Redis for speed. In the event of a catastrophe, the Redis data can be rebuilt at any time.

We moved all of our hot queries against the relationships table into Redis. Since "followers" and "followings" are displayed on every profile and a count(*) was our top outlier, our first step was to cache these values in Redis counters. We used Redis Objects to make this simple and elegant. Any time a new relationship was created or destroyed, these counters are incremented and decremented. When looking at another user's profile, to render the UI we needed to answer the question "are you following this user? If so, in the friends or noise bucket?" To answer this and similar questions, we cached the user IDs of all people who you had in your friends bucket, your noise bucket, and the union of both.

With our graph data in Redis, we can now query the graph in ways that would be prohibitively expensive with Postgres. In particular, we use it to influence our recommendation system. "Give me all the users that are being followed by people I'm following, but I'm not yet following." Using Redis set intersections, unions, and diffs, we can begin to derive new and interesting uses of the same data.

The real lesson here is this: every product has a core pillar that supports the core offering. Ello's is a social graph. When your core pillar begins to buckle under its own weight, it is critical to cache that data (or move it entirely) and continue providing your core offering.

Lesson 2: Create indexes early, or you're screwed

No really, you'll be chasing down these indexes for months. The previous section outlined how we scaled our relationships table. This, and subsequent sections will detail how we scaled our activities table, or the denormalized table that runs everyone's main activity feed. The activity feed contains any posts that people you follow have created, notifications for when someone follows you, notifications for mentions, and the like. Everything that you need to be notified about ends up in this table and we forgot some indexes.

Prior to Ello, I fell into the camp that created indexes only when data proves so. Sure, you can predict usage patterns, but since indexes can consume a lot of memory, I would have rather created them when I knew they were necessary. Big mistake here.

The first type of index that we forgot was just a plain old btree on a field that was queried regularly. An index like this can be created easily if nobody is writing to the table or downtime is feasible. This is high availability scaling, so downtime is not an option, and everything was writing to this table. Since the activity table was experiencing extremely high writes, concurrently building these indexes would never finish. While an index is being built concurrently (that is, without downtime), new records in the table are also added to the index. If the speed by which new records are added outpaces the speed by which Postgres can index hundreds of millions of existing rows, you're shit out of luck.

The solution? If downtime is not an option, you'll have to build a chokepoint in your application. All writes to a particular table must be funneled through this chokepoint so that if you want to stop writes, you constrict the chokepoint. In our case, we are using Sidekiq. We use Sidekiq jobs as our chokepoint, which means that if we ever want to stop all writes to the activities table, we spin down all Sidekiq workers for the queue that pertains to activity writes. Unworked jobs would get backed up and remain idle until we spun the workers back up, hence preventing writes to the activities table. Doing this for a couple minutes endowed Postgres with enough breathing room to work hard on building the index from existing records. Since Sidekiq jobs run asynchronously, this should have little impact on users. In our case, the worst that would happen is a user creates a post, refreshes the page, and sees that the post is not there because the activity record was not yet created. It's a tradeoff we made to keep the app highly available.

Situations like this are actually not the worst of it. The absolute worst is when you forget a unique index. Now your data is corrupt. We forgot a unique index, oops. When the level of concurrency necessary to run a rapidly scaling app reaches a point where you can't decipher whether a job is good or fallacious, you need to rely on your database's ACID characteristics. This is why Postgres is awesome: something will happen once and only once regardless of concurrency. If two jobs try to accomplish the same thing in parallel, Postgres will ensure only one of them wins. Only if you have a unique index.

An astute reader might ask, "well, why would two jobs try to accomplish the same thing?" Let me quickly explain. It all stems from one bad piece of data, that when used, creates more bad data. For example, we didn't have a unique index on the relationships table. So, I could technically follow another user twice. When the user I follow creates a new post and it becomes time to ask, "who should receive this post in their feed?", if you're relying on the relationships table to answer that question, you're relying on bad data. The system will now create two duplicate activities. This is just one reason for duplicate jobs. Others include computers being stupid, computers failing, and computers trying to fix their own stupidity and failures. Fixing the source of the bad data, the non-unique relationships, was a great launching point towards stability.

So many of our scaling choices were derived from not having a unique index. It was crippling. Firstly, you can't create a unique index with non-unique values in the table. Just won't happen. You need to first remove duplicates, which is terrifying. You're deleting data, and you better hope you're caffeinated enough to do it correctly. I also recommend 48 hours of sleep before attempting. What constitutes a duplicate depends on the data, but this Postgres wiki page on deleting duplicates is an excellent resource for finding them.

So, you delete duplicates, great. What about the time between deleting duplicates and adding a unique index? If any duplicates were added in the meantime, the index won't build. So, you start from square one. Delete duplicates. Did the index build? No? Delete duplicates.

Lesson 3: Sharding is cool, but not that cool

We sharded ourselves five times. Tee hee hee. Laugh it up. Sharding is so cool and webscale, we did it five times. I mentioned earlier that a lot of our scaling choices derived from the lack of a unique index. It took us two months to build a unique index on the activities table. At the point when the index was built, there were about a billion records in the table. Sharding would reduce the write traffic to each database and ease the pain of most tasks, including building a unique index.

For completeness, I want to define sharding. Like most things in software, sharding has conflated definitions, but here's mine. Sharding is the process of taking one large thing and breaking it into smaller pieces. We had one, large 750M record activities table that was becoming unwieldy. Prior to breaking down the activities table, we moved it out of our primary database (with users, posts, etc) into its own database, also a form of sharding. Moving it to a different database is horizontally sharding, breaking up a single table is vertically sharding or partitioning. We received recommendations from highly respected parties to think about vertically sharding when our table reached 100GB of data. We had about 200GB. We don't follow rules well.

When sharding, say we have database A that is progressively slowing and needs to be broken down. Before sharding, users with IDs modulus 0 and 1 have their data in database A. After sharding, we want to make users with IDs modulus 0 continue going to database A and modulus 1 go to a new database B. That way, we can spread the load between multiple databases and they will each grow at roughly half the speed. The general sharding process is this: setup a new replica/follower database B, stop all writes to A, sever the replica (A and B are now two exact duplicate dbs), update the shard configuration so some data goes to A and some to B, resume writes, prune antiquated data from both A and B.

So cool, I love sharding.

Many highly respected and extremely expensive people told us we needed to shard. We trusted them. We planned out multiple approaches to sharding and converged on the technique outlined here, sharding by user ID. What nobody cared to consider was what would happen after we've sharded. We thought there was a pot of gold. Nope.

We sharded for two reasons: so we didn't hit a ceiling while vertically scaling our Postgres boxes. And so our queries would perform better because we had less data in each shard after the prune step. Let's address the prune step.

In the example above, since data for users with ID modulus 1 are no longer being stored or referenced in database A, we can safely remove all of their data. You're going to need a second pair of underwear. The simplified query for pruning database A is, "delete all records for users with ID modulus 1". The inverse is done on database B. In our case, we ended up removing almost exactly half of the records for each additional shard we created. This was our plan: if each time we shard, the databases store half the data, we need half the Postgres box to serve the same data.

Imagine we have four records in database A before sharding and pruning: [ W | X | Y | Z ]. After sharding and pruning, database A might look like this: [ W | | Y | ]. Database B might look like this: [ | X | | Z ]. Notice the gaps. This equates to hard disk fragmentation. This started biting us in the ass and would have likely made our lives hell if we didn't already have other tricks up our sleeves.

If database A looks like this: [ W | | Y | ]. When I ask "give me all records for user ID 0", it should return W and Y. But W and Y are not in contiguous places on disk. So in order to service this query, Postgres must first move the disk to W, then move the disk to Y, skipping over the gaps in between. If W and Y lived next to each other on disk, the disk would not have to work so hard to fetch both records. The more work to be done, the longer the query.

Generally, when new data is added to the table, it's put in contiguous slots at the end of the disk (regardless of the user ID). We then ran a VACUUM ANALYZE on the table. Postgres now says, "oh, there's space between W and Y, I can put new data there!" So when new data is added and then fetched, Postgres needs to spin all the way back to the beginning of the disk to fetch some records, while other records for the same user are at the end of disk. Fragmentation coupled with running a VACUUM ANALYZE put us up shit creek. Users with a lot of activity simply couldn't load their feeds. The only sanctioned way to fix fragmentation is hours of downtime.

Ok, I hope you're still with me. The solution and lesson here are important. Firstly, if our Postgres boxes were on SSDs, maybe fragmentation wouldn't have been such a big deal. We weren't on SSDs. The solution for us was to build a covering index so that we could service index-only scans. Effectively, what this means is that all fields used to filter and fetch data from a table must be stored in an index. If it's all in the index, Postgres does not need to go to disk for the data. So we added a covering index for our hottest query and saw about a 100x improvement on average, up to 7000x improvements for users with a lot of activity.

The lesson here is twofold. Serving data from memory is exponentially faster than serving from disk. Be leery of serving data from disk at scale. The second lesson is equally important. We probably should have just scaled vertically as much as possible. Webscale was too sexy to avoid. "Shard all the things" is the meme I'm looking for. Sharding was challenging and a better long-term solution, but had we applied a covering index for the whole entire table before doing any vertical sharding, I believe we could have saved tons of time and stress by simply adding more RAM as our database grew.

Lesson 4: Don't create bottlenecks, or do

Early on, we made a decision that would have a profound affect on how we scaled the platform. You could either see it as a terrible decision or a swift kick in the ass. We chose to create an Ello user that everyone automatically followed when they joined the network. It's pretty much the MySpace Tom of Ello. The intention was good; use the Ello user for announcements and interesting posts curated from the network, by the network. The problem is most of our scaling problems originated from this user.

All of the scaling issues that would have been irrelevant for months or years were staring us right in the face within the first month of having a significant user base. By automatically following the Ello user, it meant that just about all users would receive any posted content from that account. In effect, millions of records would be created every time the Ello user posted. This continues to be both a blessing and a curse. Database contention? Ello user is probably posting. Backed up queues? Ello user is probably posting. Luckily we control this account, and we actually had to disable it until sharding was complete and unique indexes were built.

What seemed like a benign additional at the time ended up having prodigious impacts on how we scale the platform. Posting to the Ello account puts more load on the system than anything else, and we use this to keep tabs on our future scaling plans. Culturally, it's important for us to be able to post from the Ello account. Technically, it's a huge burden. It means that we need to scale the platform in accordance with one user, which is silly. But in retrospect it's a godsend for keeping us on our toes and being proactive about scaling.

It makes me wonder if on future projects, it would be a good idea to implement the equivalent of the Ello user. Through induction of pain, we have a better infrastructure. So the lesson here is: if the platform must stay ahead of impending scaling challenges, it's probably a good idea to self-inflict the problems early and often.

Lesson 5: It always takes 10 times longer

In the above sections, I managed to breeze through some difficult scaling lessons. Caching, sharding and optimizing are non-trivial engineering objectives. Thus far, I've been taken aback by just how difficult these endeavors end up being in practice.

Take caching the graph in Redis as an example. Going into it, it felt like something that could have been accomplished in a few days. The data's there, all we need to do is put it in Redis, and start directing traffic to Redis. Great, so step one is to write the scripts to migrate the data, that's easy. Step two is to populate the data in Redis. Oh, that's right, there are tens of millions of records that we're caching in multiple ways. Well, it'll take at least a couple hours to work our way through that many records. Yeah, but what about capturing the data that was inserted, updated and deleted within those two hours? We have to capture that as well. Better not have a bug in the migration process or there goes a few days of your life you'll never get back.

The sheer amount of practice alone for sharding can't be accounted for with a point-based estimate. Don't mess it up or you'll lose millions of peoples data. No pressure. But say you've practiced enough to get comfortable with the process and you're confident it will go well. Things will always arise. We added more shards and realized our pgbouncer pool size was maxed out. Since the system was live and new data was being written to the new shards, we couldn't revert the changes or we'd lose data. We had to figure out on the fly that the non-intuitive errors meant we needed to increase the pool size. We didn't predict that disk fragmentation was going to be a huge problem, either, and ended up becoming a top priority.

While trying to apply a unique index to the activities table, who would have thought there were so many duplicates? The initial strategy was to attempt to build the index, and when it failed, let the error message tell us where we had to remove duplicates. Building an index is slow, duh, that won't scale if we have to attempt to build the index thousands of times. Ok, so write a query to remove the duplicates first. But wait, you can't just execute a blanket query a billion records, it will never finish and potentially acquire heavy locks for hours at a time. Ok, so page through all the users, and scope the query so it only removes duplicates for a subset of users. That works, but unfortunately there were a ton of orphaned rows for users that no longer existed. So while paging through all the users that currently exist, the query is not deleting records for users who no longer exist and for some reason have orphaned records. Ok, so write a query to remove all activities for orphaned users. But wait, since the activities table doesn't live in the same database as the users table, you can't join against the users table to determine which activities records are orphaned. Not that that would scale anyway.

Sorry for rambling, but you get the point. The lesson here is for your mental health during a time of rapid scale: plan on everything taking 10 times longer than you anticipate. It just will.

Closing thoughts

You may have noticed a recurring theme within this article: the quantity of data is a big offender. There are other scaling challenges including team size, DNS, bot prevention, responding to users, inappropriate content, and other forms of caching. All of these can be equally laborious, but without a stable and scalable infrastructure, the opportunity for solving them diminishes.