As an industry we, for the most part, know how to scale up our software. […]We also know how to scale up our organizations, putting in the necessary management structures to allow thousands of people to work together more or less efficiently.

On the other hand, I’d argue that we don’t really yet have a good handle on how to scale that area that exists at the intersection of engineering and human organization. […] And, worse, it often seems that we don’t even understand the importance of scaling it as we go from that first line of code to thousands of engineers working on millions of lines of code.

Peter’s pieces consists of two main parts. The first part is a play-by-play history of Twitter’s code base and development methodologies which highlights the key areas where focus on “engineering effectiveness” would have helped.

The second part decomposes “engineering effectiveness” to three main areas:

Reduction of tech debt first where it tends to accumulate the most (tooling) and then elsewhere in the code base

Help in the dissemination of good practices (around code reviews, design docs, testing, etc.) and the reduction of bad practices

Building tools which help engineers do their job better

In that second part Peter also suggests a model to determine the optimal level of investment in “engineering effectiveness” (ee):

Where “E” is total effectiveness (which we’re trying to maximize), “eng” is the total engineering headcount, “ee” is the engineering effectiveness group headcount, “b” is the boost that the first engineering effectiveness headcount brings to the rest of the engineering team and “s” is the scaling impact that each add’l engineering effectiveness headcount contributes (0<s<1 since we should assume diminishing returns).

Assuming b=0.02 (2% effectiveness boost) and s=0.7, for a total engineering headcount of 10, 100, 1000 and 10000, he gets an optimal ee headcount of 0, 2, 255, and 3773 respectively. As the engineering org scales, a larger portion of the total headcount should be dedicated to making the rest of the engineering org more effective, with ~100 engineers being the inflection point of making the investment worthwhile (for these b and s values).

Another important aspect here is in providing guidance on the type of initiatives that such organizations should take on: breadth very quickly trumps depth – making 1000 engineers 2% more effective, has a much greater overall impact, than making 10 engineers 50% more effective.

This model is particularly interesting since it can easily be generalized for any other group whose mission is to help a larger part of the org be more effective. These support groups, in companies that are wise enough to have them, tend to be staffed and funded based on a fixed headcount ratio to the total headcount of the org they support. Peter’s analysis suggests that when those organizations scale significantly, the traditional approach will lead to under-investment. Adopting this more refined methodology and having a thoughtful conversation about the appropriate “s” and “b” values for the particular use case, will likely lead to a better outcome.

People tend to be defensive about the responsibilities they own in a company. It’s natural that they struggle with giving those responsibilities to new employees and trusting that they’ll do a good a job as they did. And yet, giving away responsibility is exactly what we need them to do in order to effectively scale the company. As Molly puts it: “giving away responsibility — giving away the part of the Lego tower you started building — is the only way to move on to building bigger and better things”. More people does not mean less work for the people already there, it means the company can do more as a whole.

Her advice to managers is to be proactive in communicating about this challenge. Acknowledge that this feeling of defensiveness around giving away responsibility is completely normal, but getting beyond this initial, emotional reaction is exactly what the company needs them to do, in order to be successful. Focusing on the bright, new, shiny Lego tower that you need them to build next, is also a good idea.

Molly argues that the true scaling chaos happens approximately when the company has 30-750 people (every company is a bit different). Beyond that, the scaling challenges manifest themselves mostly on a departmental level, rather than a corporate level. She identified three distinct growth phases in which scaling presents different challenges:

30 – 50 people: communication, which has been almost effortless until that point, becomes exponentially more challenging. The best solution here is to start putting things down on paper: mission, values, philosophies, etc. and being particularly mindful about over-communicating them.

50-200 people: this is the most critical phase in the shaping of the company culture. Thought and focus must be directed to building the systems that’ll take the values off the paper and make them real. One of the hardest and most important aspects of this is pruning the talent pool – letting go of the people who are not a good culture fit to the culture we’re trying to create. It should only take a couple of months to assess whether someone is a good culture fit. And if the answer is “no” – action must be taken quickly.

200-750 people: At this point, the personality and habits of the organization are pretty much molded. The focus now shifts to scaling and preserving them as more people join. Onboarding, training and other business practices are key. Any desired cultural change at this point will be challenging, and must be undertaken deliberately, assuming a lot of work will have to be done by the CEO and leadership team in order to make it happen.

As readers of this blog can probably guess by now, going down research rabbit holes, in which one interesting read leads to another, which leads to another, which leads to another – is one of my favorite pastime activities. I find the discovery process just as satisfying and rewarding as the content itself.

Clay addresses the common phenomenon of companies becoming “bad” as they become “big”, and proposes a set of principles that may chart a path for an alternative outcome.

He starts by identifying a 7-attribute Performance Criteria for organizations:

Purpose: The work we do here is important to us and to our customers; We gather according to a clear purpose

Fitness: Those who consume our product believe in its quality; We achieve impressive things together; We make sense in the world

Vitality: Not just fun, but vital, life-giving; Our lives improve as a result of our membership in the organization; We get energy from our work; Our culture is contagious

Fairness: We make decisions taking everyone’s needs and advice into account; There is a strong sense of justice; Everyone in the organization has the same basic rights

Power: More, and more forms of power for all; Power is spread throughout the organization, not just kept in the hands of a few

Connection: Boundaries between teams are permeable; We don’t see our users as outsiders; We offer signals generously so others can learn from us

Safety: People stay in the organization by choice; I won’t be let go for personal reasons; It’s easy to do good work; We have what we need to succeed

He then charts a path for achieving the desired change. I’ll give you a quick taste of the first one:

We must replace tyranny with the rule of law:

“Large firms mostly have good corporate governance… these tools actually stand in the way of a true application of the Rule of Law … “governance” typically comes in the form of a standing meeting where a group of subordinates recommend a decision to one or more senior officials, who either say yes, no, or go do more digging. This is tyranny, or at the very least, it’s a feudal approach to organizing human work. The alternative approach is one where we don’t let a single person or a small group make arbitrary decisions that impact a whole, but instead we vest authority in a system”

We must replace central planning with market forces

We must replace opacity with transparency

Each one of those, is supported by a small set of “even over” statements – choosing to do one good thing, even over another good thing . If that construct sounds familiar, it’s because Holacracy’s approach to strategy is utilizing it as well:

Elect even over Select

Process even over Decide

Describe even over Prescribe

Focus even over Help

Open even over Closed

Pull even over Receive

If I managed to pique your interest, you should definitely read Clay’s full piece.

Ben argues that the key scaling challenges are driven three core component becoming much more difficult as the organization grows in size:

Communication

Common Knowledge

Decision Making

Avoiding their degradation altogether is impossible, so what we’re really trying to do is “give ground grudgingly”.- try to slow them down as much as possible using three key levers. Because they all include a trade-off of increased complexity, “giving ground grudgingly” is the right strategy here, and they should be applied with their impact on the three core components in mind.

Specialization

It is typically necessary to apply this level first, but it’s also the one with the most challenging side-effects: hand-offs, conflicting agendas, etc. The next two levers aim to mitigate these negative effects.

Org Structure

There is no perfect org design since there is no way to completely eliminate the negative side-effects of specialization. Organizational design has substantial impact on the company’s communication architecture, both internally and externally – and this is the key to effectively utilizing it, using the following steps:

Figure out what needs to be communicated – key pieces of knowledge and who needs to have it

Figure out what needs to be decided – try to minimize the number of decision makers that need to be involved in making the most frequent and critical decisions

Prioritize the most important communication and decision paths – every org design represents a trade-off…

Decide who’s going to run each group

Identify the paths that you did not optimize

Build plans for mitigating the issues identified in step 5 – typically by applying the next lever:

Process

The purpose of process is communication. It’s a formal, well-structured communication vehicle, meant to ensure that:

Communication happens

It happens with quality

The people who are already doing the work are the ones who are in the best position to design the necessary process, keeping a few best practices in mind:

Focus on the output first

Figure out how you’ll know if you are getting what you want in each step – usually via some form of measurement

Engineer accountability into the system – which organization/individual is responsible for each step. Make their performance visible.

He argues that complexity is the biggest enemy to scaling. And complexity, in turn, is driven by four different attributes of the system:

States. When there are many elements in the system and each can be in one of a large number of states, then figuring out what is going on and what you should do about it grows impossible.

Interdependencies. When each element in the system can affect each other element in unpredictable ways, it’s easy to induce harmonics and other non-linear responses, driving the system out of control.

Uncertainty. When outside stresses on the system are unpredictable, the system never settles down to an equilibrium.

Irreversibility. When the effects of decisions can’t be predicted and they can’t be easily undone, decisions grow prohibitively expensive.

A successful complexity-fighting strategy must focus on eliminating one of those attributes completely and figure out a way to manage the rest.

Since uncertainty is a factor of outside forces that our by definition outside of your control, it is extremely hard to design a strategy focused on it. Strategies focused on reducing the number of states are effective in some cases (Henry Ford’s Mass Production is a notable example). But in complex systems, like software, managing states and predicting interdependencies is incredibly difficult. So instead Facebook chose to focus on eliminating irreversibility in almost everything that they do with their software: for introducing “feature switches”, through doing gradual deployments to having backup internal communication channels in case the site crashes.

A rather compelling case for auditing the full spectrum of decisions your organization makes, identifying the ones that are currently irreversible, and figuring out what it would take to make them reversible.

Where C is the organizations’ capacity (max throughput), N is the number of people, α and β are the contention and coherence coefficients respectively.

Contention – measures the impact of waiting/queuing for resources on capacity, more commonly known as a the “bottleneck factor”. The contention coefficient reflects your ability to effectively delegate work and not become a bottleneck. Smaller coefficient reflects better delegating ability.

Coherence – measures the cost of getting agreement of what is the right thing to do. The coherence coefficient reflects the decision making autonomy within the organization. The more people need to be involved in making a decision, the higher the coefficient will be, and the capacity return from adding more people will decrease and may even become negative.

When trying not to become a bottleneck, we have to fight against our natural tendency to try and be helpful to anyone who asks us to help or contribute to a project. Little’s Law shows how by doing so we’re not being helpful at all:

average_wait_time = work_in_progress / throughput

Since our throughput is fixed (there are only that many hours in a day), the only thing we control is our “work in progress”, the number of concurrent projects we take on. The more projects we take on, the longer our average wait time gets and we become more and more of a bottleneck on these projects…

This gets compounded by the fact the request don’t come into our queue at a uniform rate. We know from Kingman’s Formula:

That as our utilization (ρ) increases, wait time increases exponentially. Working too close to 100% utilization will grind things to a halt. Adrian suggests choosing a WIP limit that will result in spending about 60% of your time on operational activities – ones that will slow down the rest of the org if you don’t process them promptly. The rest of those operational activities should either be delegated or discarded. The remaining 40% should be used for more strategic activities that will not have a negative cascading impact on the organization if they are put on hold during short-term periods of high operational load.