A lot has happened since my first article on the Stack Overflow Architecture. Contrary to the theme of that last article, which lavished attention on Stack Overflow's dedication to a scale-up strategy, Stack Overflow has both grown up and out in the last few years.

Stack Overflow has grown up by more then doubling in size to over 16 million users and multiplying its number of page views nearly 6 times to 95 million page views a month.

Stack Overflow has grown out by expanding into the Stack Exchange Network, which includes Stack Overflow, Server Fault, and Super User for a grand total of 43 different sites. That's a lot of fruitful multiplying going on.

Just More. More users, more page views, more datacenters, more sites, more developers, more operating systems, more databases, more machines. Just a lot more of more.

Linux. Stack Overflow was known for their Windows stack, now they are using a lot more Linux machines for HAProxy, Redis, Bacula, Nagios, logs, and routers. All support functions seem to be handled by Linux, which has required the development of parallel release processes.

Fault Tolerance. Stack Overflow is now being served by two different switches on two different internet connections, they've added redundant machines, and some functions have moved to a second datacenter.

NoSQL. Redis is now used as a caching layer for the entire network. There wasn't a separate caching tier before so this a big change, as is using a NoSQL database on Linux.

Unfortunately, I couldn't find any coverage on some of the open questions I had last time, like how they were going to deal with multi-tenancy across so many diffrent properties, but there's still plenty to learn from. Here's a roll up a few different sources:

CDN: none, all static content is served off the sstatic.net, which is a fast, cookieless domain intended for static content delivered to the Stack Exchange family of websites.

Developers and System Administrators

14 Developers

2 System Administrators

Content

License: Creative Commons Attribution-Share Alike 2.5 Generic

Standards: OpenSearch, Atom

Host: PEAK Internet

More Architecture and Lessons Learned

HAProxy is used instead of Windows NLB because HAProxy is cheap, easy, free, works great as a 512MB VM “device” on a network via Hyper-V. It also works in front of the boxes so it’s completely transparent to them, and easier to troubleshoot as a different networking layer instead of being intermixed with all your windows configuration.

A CDN is not used because even “cheap” CDNs like Amazon one are very expensive relative to the bandwidth they get bundled into their existing host’s plan. The least they could pay is $1k/month based on Amazon’s CDN rates and their bandwidth usage.

Backup is to disk for fast retrieval and to tape for historical archiving.

Full Text Search in SQL Server is very badly integrated, buggy, deeply incompetent, so they went to Lucene.

Mostly interested in peak HTTP request figures as this is what they need to make sure they can handle.

All properties now run on the same Stack Exchange platform. That means Stack Overflow, Super User, Server Fault, Meta, WebApps, and Meta Web Apps are all running on the same software.

Redis is so fast that the slowest part of a cache lookup is the time spent reading and writing bytes to the network.

Values are compressed before sending them to Redis. They have plenty of CPU and most of their data are strings so they get a great compression ratio.

The CPU usage on their Redis machines is 0%.

global cache: which is shared amongst all sites and servers

Inboxes, API usage quotas, and a few other truly global things live here

This resides in Redis (in DB 0, likewise for easier debugging)

Most items in the cache expire after a timeout period (a few minutes usually) and are never explicitly removed. When a specific cache invalidation is required they use Redis messaging to publish removal notices to the "L1" caches.

Joel Spolsky is not a Microsoft Loyalist, he doesn't make the technical decisions for Stack Overflow, and considers Microsoft licensing a rounding error. Consider yourself corrected Hacker News commentor.

For their IO system they selected a RAID 10 array of Intel X25 solid state drives . The RAID array eased any concerns about reliability and the SSD drives performed really well in comparision to FusionIO at a much cheaper price.

The full boat cost for their Microsoft licenses would be approximately $242K. Since Stack Overflow is using Bizspark they are not paying near the full sticker price, but that's the max they could pay.

Reader Comments (15)

Did they explain why they use Redis instead of Memcached for caching? I've heard of quite a few people using Redis for cache, just wondered what does Redis do that Memcached doesn't?

If I remember correctly Redis is not a distributed database, right? With memcached if I add new nodes the client will automatically redistribute the cache to take advantage of the additional capacity. Redis doesn't do that. So why Redis?

One of the advantages of using something like Redis or membase instead of memcached is that the cache can be persisted to disk, this can avoid the cache storm issue if the cache goes offline and is then is brought back up.

I guess what we don't know is what configuration the Redis boxes are in e.g. are they sharding, doing master/slave replication etc.

James: backing up to tape means offline/archival backup. This is often worth the expense and hassle, especially for a large important dataset. After the issues a week or three ago, I can tell you that the Gmail guys are *very* glad they backed up to tape. If all your replicas are online, there's always the possibility that a single bug or slip of the fingers can wipe them simultaneously.

@Sosh - Please take it easy and don't elevate yourself in support of Microsoft products. There is no technical reason to run MS stuff among the best and latest of open-source companies and their communities. In fact to really drive this point, the StackOverflow team should be using more *paid/licensed* ms products everywhere to drive their point home. There is also the perspective of using best combination of tools for the job so points there. The answer is really simple: StackOverflow team knows MS products, visual studio, C# and .NET therefore it was cheapest and fastest (for this team) to deliver StackExchange family of sites. ^M