Generic scaling

“But one of the reasons we’re very lucky is our engineering team has selected to use PHP as the primary development language. That allows us to use a fairly generic server type. So we, with a couple of exceptions, have three main server types and run a fairly homogeneous environment, which allows us to then consolidate our buying power.”
—Jonathan Heliger, VP Site Operations, Facebook (in interview with Dan Faber)

I think homogenous horizontal scaling (when possible) is a great idea for operations.

Facebook continues to add 250,000 new members a day. The most Tagged ever registered was 600,000 a day. Both use PHP. I can’t speak for Facebook, but Tagged has never needed more than 200 servers on the web tier.

Post navigation

7 thoughts on “Generic scaling”

How greats that?! “Only” 500 Servers for 75’000 Users?! I just callculated the numbers for my employers site. And well on average we had in May exactly this number (sure the peeks on Sunday might be much higher). We run our business on around 50 servers (but only 20 are really serving webpages with apache/php). If I would be Terry, I would find now some swearwords, but as I’m not, I only say that these guy should really consider switching to php.

Just wanted to clear up some confusion. 500 mongrels does _not_ equal 500 servers. We had ~32 mongrels running on each box, so it was closer to 15 instances running on EC2 for our app stack. And we only needed that number of servers during our peak, which was 20k new users an _hour_ (read the rightscale article). Each worker, however, is a different EC2 instance because our media analysis/render requirements are incredibly heavy. We’ve since ported our engine to work on the large 64-bit EC2 instances, so right now we need ~1/6 the same number of worker instances to process the same number of videos.

@leo Actually, as I pointed out in the mouseover text, I’d be worried more about the load on the back end, than the performance issues on the front end. It seems that’s a much bigger fish to fry.

They might run into issues with the choice of Rails if the size gets really large and they start partitioning or if the utilization on the back end goes down by about four orders of magnitude.

I doubt they have to partition for a while because I’d guess is that the files are very large (but stored in S3) but the database size is actually pretty reasonable (just stores users, and links to the S3 files).

@Stevie: First of all, thanks for the reply.

Point taken, I’m sorry for the error. I forgot people run multiple mongrels on a box. Doh! I equated mongrels with servers because they’re threaded and because the legend on the graph said “Mongrel App Servers.”

As for the peak load comment, I already computed based on the 20k “peak” in the mouseover comments (yes, I did read the article). I got 7/server-minute which is now 224 according to your estimate.

Tagged does around 40 “pages”/user-session (we’ll assume that the user doesn’t create more than a session a day). Every “page” is at least two dynamic pages (page + status), but sometimes more (page + status + ajax + we don’t count pages if the ajax requests come in too fast). This works to doing about 700 dynamic pages/machine-minute. I took the most conservative numbers and this is a lower bound estimate: we don’t do cloud computing; we have machines way below 100% util; and I’m not looking at daily average not peak. (BTW, in my book 700 pages/machine-minute is pretty darn inefficient—c’mon that’s like 3 processes with a quarter second time to complete!—but we do have some stuff under the old architecture which is taking a majority of the CPU/mem even though it accounts for only a fraction of the page count and I did say that I was taking a conservative estimate .)

The 224 pages/machine-minute is in the same ballpark. It assumes each session uses 1 server-minute which, assuming a half second response time on the server (ick!), means that each user downloads 120 dynamic pages in a session. I bet your average pages/session is no higher than six since you have to proxy through Facebook which prevents pages from being chatty.

A 32x gap in front end performance sounds bad. But Moore’s law puts that only 7.5 years off. If you have a large variation between peak and off-peak loads (say…8x), then cloud computing on a Rails front end will reach standard site operations on a PHP front end in only three years. So I guess it all depends on your perspective.

(Then again, that assumes that PHP developers won’t be using compute clouds.)

@anon: Tagged is a second or third tier social network—did I ever say otherwise? This means it has over 70 million registered users and pushes out an average of 80 million web pages a day. We don’t need the 160 machines we have on our front-end tier, but we do as a vestigial remnant of really bad code. (By the way, 160 is not that many. Facebook has over 10,000.)

But you knew that, which is why you posted anonymously. You’re afraid I’d pull up your site stats and embarrass you. I don’t know if you use PHP or not, but at the CTO of Animoto doesn’t. The difference is, he has the balls to own up and provide constructive criticism of my numbers.

You, on the other hand, just look like the idiot you are. And the reason there isn’t a rush to defend you, is some people do this stuff for a living and know that there are only a handful of websites that push 80 million pages a day, and, here’s a tip, none of them are using Rails…

@Stevie: Sorry! 15 instances is not to bad, at least not for ruby
@terry: Well I know, should read mouse-over text, but my finger is clicking allways too fast.
I don’t know what this guys do on there machines, so I will not go and say there stupid. I think it might be just hard to find a business model to make a profit when you allready burn nearly 10’000$ a day for your EC2 bill. On the other hand, why optimzing a algorithm that is used by nobody, as they found out that they burn a hell lot of money now they will find a way to optimise it. Somebody I know said that stability and scalling comes befor speed…