What Second Life can teach your datacenter about scaling Web apps

Over the past decade, building large-scale online applications has become a pretty well-understood science with numerous books, papers, periodicals, forums, and conferences devoted to the subject. The Web overflows with advice and prescriptions for achieving high reliability at massive scale.

Trouble is, implementing the best scaling practices is not free, and is often overlooked early in a product's lifecycle. Small teams use modern frameworks to quickly develop useful applications, with little need to worry about scale: today you can run a successful application on very little infrastructure... at least, you can up to a point. Past this point lies an uncomfortable middle ground, where small teams face scaling challenges as their system becomes successful, often without the benefit of an ideal design or lots of resources to implement one. This article will lay out some pragmatic advice for getting past this point in the real world of limited foresight and budgets.

Learning lessons from Second Life

Most of this information is based on my experience working on Second Life at Linden Lab from 2001 to 2009. SL is a highly complex virtual world, incorporating the features of Web services, online games, 3D modeling and programming tools, IM and VOIP, and so on. Between 2006 and 2007, the userbase grew dramatically, and while it has become more manageable, it continues to grow today. We ran into all manner of scaling challenges, and had mixed success meeting them; ultimately SL did grow to meet the new levels of demand, but we certainly made some mistakes, and there were periods where the reliability of the system really suffered.

As I lay out my advice to teams facing scaling challenges, I'll be referring to these experiences; if I had known then what I know now, SL customers would have had a better experience—and I would have gotten a lot more sleep.

Second Life usage grew by a factor of ten between 2006 and 2007 (Courtesy Linden Lab)

So how do you get from here (a simple system on a commodity stack) to there (a robust system which can be confidently expanded to meet any level of demand)? Plenty of pixels have been spilled on the subject of where you should be headed: to single out one resource at random, Microsoft presented a good paper ("On Designing and Deploying Internet-Scale Services" [PDF]) with no less than 71 distinct recommendations. Most of them are good ("Use production data to find problems"); few are cheap ("Document all conceivable component failure modes and combinations thereof"). Some of the paper's key overarching principles: make sure all your code assumes that any component can be in any failure state at any time, version all interfaces such that they can safely communicate with newer and older modules, practice a high degree of automated fault recovery, auto-provision all resources. This is wonderful advice for very large projects, but herein lies a trap for smaller ones: the belief that you can "do it right the first time." (Or, in the young-but-growing scenario, "do it right the second time.") This unlikely to be true in the real world, so successful scaling depends on adapting your technology as the system grows.

While developers should certainly try to make scaling-friendly design choices early on, there are many cases where taking the best advice on scalability can drastically increase development cost (assuming time = money). As a simple example, consider the common notion that a system should tolerate all failures in all its internal components. To accomplish this, all interface code everywhere in the system must check for a variety of failure conditions and (presumably) do something intelligent with them. Do they retry? What if the problem component is overloaded? Can the client detect that? Should the user be given an error, or simply queued up? What if there is a partial-failure condition, where responses take far longer than expected? Does all of this interface code need to be non-blocking? And so forth. Even attempting to answer all these questions can eat up a lot of engineering time—time that your team may not have. Developers do not always want to admit this (especially early on in a project), but implementing everything "correctly" can risk not finishing the project at all, or having to rush through the later stages in order to get something out the door. In these cases, it's better to ship, or retain, a design with known deficiencies.

In the rest of the article I'll survey some of the big issues likely to arise during scale-out, along with strategies for prioritization and mitigation.

Requirements

The first area that most projects tackle (and hence get into trouble) is correctly identifying the business need. How large does the system have to become? This is generally a tough question to answer, but taking a look at basic constraints can be informative. If a recurring billing system needs to touch each user annually, and the product is only available to Internet users in the US and Europe, and by the biggest estimates will achieve no more than 10% penetration, then it needs to handle about 2-3 events per second (1bn * 75% * 10% / (365 * 86,400)). Conversely, a chat system with a similar userbase averaging 10 messages/day, concentrated during work hours, might need to handle 20,000 messages per second or more. (1bn * 75% * 10% * 10 * 2 / 86,400) This is a vast difference which may seem obvious, but in more nuanced scenarios it is easy to make a bad assumption about volume, which can lead to inadequate designs or testing, followed by nasty surprises in production.

Just as important are reliability targets: can the system be shut down at regular intervals? What are the consequences for failing the various types of requests (potentially severe for the billing system, and minor for the chat system). If some of these requirements are very stringent, it's even more important to compare them to the business reality: will reaching a midway growth milestone give you the luxury of additional time or resources to then produce a larger, more expensive design? It's a simple exercise to look at this, but many teams fail to do it thoroughly. If a team was building both the hypothetical billing and chat systems above, and put in the time to give the chat system a million-message-per-second capacity while making the biller rock-solid up to 10,000 transactions per day, they'd end up in a scramble to catch up with billing system load while chat was still under-utilized. If you aim for the wrong goalposts, you risk over-investing in some parts of the system while under-investing in others, which will cause a scramble to catch up down the road.

21 Reader Comments

So, what DB are they running now? I assume MySQL was chosen because of cost, but I'm wondering what the impact of using PostgreSQL would have been, since from what I understand, it scales much better, or one of the other open source databases.

The conventional wisdom of scaling best practice recommends, in an ideal world, documenting and fully handling all possible errors. In the real world, even discovering what errors are possible, or likely, or common, can be a major challenge.

So glad to see this acknowledged. I've worked with folks who've bought the whole bit about handling errors, but who've had little patience for the challenge this represents. Ultimately, you're practically designing two things at once... the system, and the system's secret service detail, which always knows where everyone is and what they're doing at any given time. It's a lot of work, but it should be taken very, very seriously.

Great article. As an IT Architect in a non-gaming industry, I've always been fascinated by the challenges that must be faced when implementing systems to support the transaction volumes and growth patterns seen by applications such as Second Life.

I found the specific examples you provided to be particularly interesting.

Originally posted by PhazeZero:Great article. As an IT Architect in a non-gaming industry, I've always been fascinated by the challenges that must be faced when implementing systems to support the transaction volumes and growth patterns seen by applications such as Second Life.

I found the specific examples you provided to be particularly interesting.

Thanks for taking the time to share your experience!

I completely agree, this was an excellent article. I'm working on a similar project myself, and we're almost at that "middle ground" stage. And it's pretty scary.

As a former user of SL I can say that the biggest problem they had was lack of focus. While all this back end crap was going on, they were constantly adding new features to the front end client. In some cases this adversely affected the back end (the Map change mentioned in the story), but in most cases it simply siphoned off resources into development, when they really needed to be working on infrastructure while letting the client remain relatively stable.

Sure, Linden's response to this was always "but the game devs are different resources than the infrastructure guys" - right, but they are all still paid in dollars, and they all compete for the relatively limited attention of management.

@joshv: over-constrained teams tend to underdocument, which means that working in the new guy can take months until they actually "win back" the training investment. It's very easy to fall into the mindset that it's better to power through the rough patch than to slow down even more while "new guy" learns the ropes.

Great article. This is exactly why I read Ars. There's so much more to this article than a dozen articles on the iPad or Internet Explorer's falling market share. So much so that it inspired me to blog.

Excellent article! I'm a software architect with expertise on large telecom systems and have seen some of those situations arise.

For requirements, sometimes you need to think carefully about your most critical areas. While billing going down delays revenue, chat going down could cause cancellations, eliminating revenue. Which is really more critical? (In telecom it's generally network, ordering, then billing -- if a customer calls to order service and you can't fulfill that order, that might be a permanently lost customer; if bills go out late you're losing interest on working capital, which might not be nearly as bad.)

Beware of non-linear system response. As in the article example, a batch process might slowly take longer and not be a problem until suddenly it causes crashes. Same thing applies to 3rd party tools. I was part of a team doing DB performance testing for a European country a number of years ago. DB vendor had just enhanced system to allow tables with more than 2^^32 rows. We discovered during testing that if an index had more than 2^^32 pages, the system crashed. Fortunately our test cases found this before the system went into production, and the vendor gave us great turnaround support to fix it.

Upgrade paths for databases are a big concern, as noted. Archiving can be too. On a transaction system if you have to keep the data for a period of time, eventually you have to remove it as fast as you're adding it to maintain steady state table size. When building a system it's more natural to size for handling incoming data, but that might only be half the story.

Finally, when you're scaling your product, make sure you're thinking about scaling your organization too. You can add support staff one at a time, but at some point the person managing them is overloaded and you need to add a layer to the org chart. Make sure you plan on that before the crisis, so you can train people and define processes before it gets too crazy.

BeowulfSchaeffer - I believe SL still runs MySQL, made possible largely through horizontal data partitioning and some pretty well-developed tools. We picked it originally (in 2001) largely for ease of deployment and management, and later found that switching databases was not a simple proposition since we weren't using an abstraction layer. I can't really comment on the current situation, but when I left, the scaling effort was largely aimed at reducing database dependence in general.

Good article; but a lot of these recommendations (particularly about hardware) aren't possible to implement until you're a certain size. I'm guessing most people reading ars and likely to encounter this situation are going to be in a situation somewhere between a single server and a cage full of them.

I come at this from a startup perspective. I've watched companies blow $250,000 on hardware before even shipping a product. In my opinion, in the year 2010, hardware (and technology in general) is a distraction for startups. You need to focus on developing your product and building an organization to sell and support it. Honestly, the environment is expensive, requires specialized expertise, and you're not likely to have enough of it to keep someone busy full-time (so you either pay someone full-time and have them sit on their ass or hire consultants at 3x what it would cost to hire someone full-time.) If you buy hardware too early, it may be obsolete before your product really takes off; buy it too late and it can tank your product.

If I were starting a software company today, there's no question I'd use the cloud. Let someone else worry about the hardware, storage and network. That way you can focus on the other problems (scalable software and scalable organizations) which are, frankly, more interesting than a problem that has a solution someone has figured out how to package, sell and market to companies exactly like you. Amazon EC2 is a beautiful system, and their billing makes sense to the point where you can script your usage (and they're not the only ones doing it.)

Even still, there's a lot of good information about those other problems in this article. One of the most interesting on ars in a while.

Originally posted by joshv:Sure, Linden's response to this was always "but the game devs are different resources than the infrastructure guys" - right, but they are all still paid in dollars, and they all compete for the relatively limited attention of management.

So you'd fire the front-end developers (who are presumably well versed in your product) and hire operations guys that are probably going to take 3 months to become effective?

Part of the problem that I've seen in some companies is that operations gets overwhelmed and can't keep up with the pace of development. Which leads developers to create systems that can't be run well because they either intentionally do an end run around operations (which is given the secret nod by mangement because it "gets the job done.") or they just go ahead building product and when the ops team gets involved, it's too late to go back.

My company will be launching our web service this Spring, and these issues are what I'm working extremely hard to prepare for. I really appreciate the general advice from someone who's clear seen things from the inside! This definitely makes me feel good about the things we're focusing on, and also has me thinking about some new things as well.

Originally posted by Exelius:Good article; but a lot of these recommendations (particularly about hardware) aren't possible to implement until you're a certain size. I'm guessing most people reading ars and likely to encounter this situation are going to be in a situation somewhere between a single server and a cage full of them.

I come at this from a startup perspective. I've watched companies blow $250,000 on hardware before even shipping a product. In my opinion, in the year 2010, hardware (and technology in general) is a distraction for startups. You need to focus on developing your product and building an organization to sell and support it. Honestly, the environment is expensive, requires specialized expertise, and you're not likely to have enough of it to keep someone busy full-time (so you either pay someone full-time and have them sit on their ass or hire consultants at 3x what it would cost to hire someone full-time.) If you buy hardware too early, it may be obsolete before your product really takes off; buy it too late and it can tank your product.

If I were starting a software company today, there's no question I'd use the cloud. Let someone else worry about the hardware, storage and network. That way you can focus on the other problems (scalable software and scalable organizations) which are, frankly, more interesting than a problem that has a solution someone has figured out how to package, sell and market to companies exactly like you. Amazon EC2 is a beautiful system, and their billing makes sense to the point where you can script your usage (and they're not the only ones doing it.)

Even still, there's a lot of good information about those other problems in this article. One of the most interesting on ars in a while.

While I agree that having managed hosting or leased servers/space is probably a good idea for most startups, "the cloud" is not a magical scaling/operations panacea. It's a good solution for many, but EC2 comes with its own drawbacks.

This is a fantastic article! Not only does it nail the key challenges, it is an extremely well written and engaging piece.

I am a software architect of very large corporate systems, and I got a good chuckle reading through the characterization of these very difficult and familiar problems which are not limited to gaming systems.

One of the things that I found most interesting in my experience is that accurately predicting or simulating some of the behaviour of these large online systems is impossible. You certainly can predict many things and you can run large expensive simulations -- and you should run them -- to seek out specific predictable bottlenecks, but you can't conclusively rely on that. As a designer it's pretty easy to see the massive amount of assumptions that necessarily go into a simulation. A beta test in real life is always your best friend.

However, when you go live with the real deal, you better have the tools in place so you can efficiently track down the unexpected problems when they occur, because they will. Those tools may include monitoring applications and graduated logging that can be turned on when required.

Originally posted by joshv:Sure, Linden's response to this was always "but the game devs are different resources than the infrastructure guys" - right, but they are all still paid in dollars, and they all compete for the relatively limited attention of management.

So you'd fire the front-end developers (who are presumably well versed in your product) and hire operations guys that are probably going to take 3 months to become effective?

I took this to mean that the frontend developers were not versed in the backend architecture and vice versa, so unless everyone is cross-trained on the entire system, simply shuffling your development resources and dealing them all out to tackle a specific problem doesn't get it done any faster.

Great article. I dodged a horrific experience when management decided to scale our solution originally designed for a 10TB database in one location to one that is 150TB each in 15 locations. The director said money was not an issue... till they saw how much our Oracle vendor charged for a DB solution that could have feasibly worked.

As an ops guy myself I was particularly taken with the parts about operations - and page-outs sent to on-call team. This problem occurs often, and almost as often, it's slow-tracked or simply ignored - until the major customer-visible outage appears and starts the finger-pointing exercise, which is damaging in its own ways.

I don't mean to slight the rest of the article, which is 10 other kinds of awesome, but that's the bit I felt personally.