Stackdive: the evolution of Wayfair’s stack

Jack and I are both long-time software guys who now spend somewhat less of our time thinking about what to build, and somewhat more about how to keep a large number of systems running well. The emergent DevOps culture of the last few years has made it easy for people like us to move between these worlds. Are they really separate worlds anymore? In 2009 John Allspaw and Paul Hammond, heads at the time of ops and development, respectively, at Flickr, gave an influential talk at Velocity called “10+ deploys per day: Dev and Ops Cooperation at Flickr”. It’s about their deploy tool and IRC bots, sure, but it’s also about culture, and especially how to get dev and ops team out of adversarial habits and into a productive state, with a combination of good manners and proper tooling. We have cribbed quite a bit from that family of techniques since 2010, but Wayfair has had an evolution of stack and culture that’s distinctive, and we’re going to try to give a close-up picture of it.

Let’s start with a brief overview of the customer-facing Wayfair stack overs the year since our founding. We’re going to stick to the customer-facing systems, because although there is a lot more to Wayfair tech than what you see here, we’re afraid this will turn into a book if we don’t bound it somehow.

Early 2002 (founding, in a back room of Steve Conine’s house): Yahoo Shopping for a bout 5 minutes, then ASP + Microsoft Access.

Late 2002: ASP + SQL Server

The middle years: Forget about the tech stack for a while, add VSS (Windows source control) relatively early. Hosting goes from Yahoo shopping, to Hostway shared, to Hostway dedicated, to Savvis managed, to a Savvis cage (now CenturyLink) in Waltham. Programmers focus on building a custom RDBMS-backed platform supporting the business up to ~$300M in sales…

2010: Add Solr for search because database-backed search was slow, small experiments with Hadoop, Netezza, Seattle data center. Yes, you read that right. 2010. 8. years. later. That’s what happens when you’re serious about the whole lean-startup/focus-on-the-business/minimum-viable-stack thing. But now we’ve broken the seal.

SiteSpect has been our A/B testing framework from very early on, and at this point we have in-line, on-premise SiteSpect boxes between the perimeter and the php-fpm servers in all three data centers.

Here’s a boxes-and-arrows diagram of what’s in our data centers, with a ‘Waltham only’ label on the things that aren’t deplicated, except for disaster recovery purposes. Strongview is an email system that we’re not going to discuss in depth here, but it’s an important part of our system. HP Tableau is a dashboard thing that is pointed at the OLAP-style SQL Servers and Vertica.

Why did we move from ASP to PHP, and why not to .NET? That’s one of the most fascinating things about the whole sequence. Classic ASP worked for us for years, enabling casual coders, as well as some hard-core programmers who weren’t very finicky about the elegance of the tools they were using, to be responsive to customer needs and competitor attacks. Of course there was a huge pile of spaghetti code: we’ll happily buy a round of drinks for the people with elegant architectures who went out of business in the mean time!

But by 2008 or so, classic ASP had started to look like an untenable platform, not least because we had to make special rules for certain sensitive files: we could only push them at low-traffic times of day, which was becoming a serious impediment to sound operations and developer productivity. Microsoft was pushing ASP.NET as an alternative, and on its face that is a natural progression. We gave it a try. We found it to be very unnatural for us, a near-total change in tech culture, in the opposite direction from where we wanted to go: expensive tools, laborious build process, no meaningful improvement in the complexity of calling library code from scripts, etc., etc. We eventually found our way to PHP, which, like ASP, allowed web application developers to do keep-it-simple-stupid problem solving, but to rationalize caching and move our code deployment and configuration management into a good place. In the early days of Wayfair, when there was not even version control, coders would ftp ASP scripts to production. That’s a simple model that has a fire-and-forget feel to an application developer that is very pleasant. Something goes wrong? Redeploy the old version, work up a fix, and fire again, with no server lifecycle management to worry about. It is a lot easier to write a functional tool to deploy code, when you don’t have to do a lot of server lifecycle management, as you do with Java, .NET, in fact most platforms. So we got that working for PHP on FreeBSD, but soon applied it to ASP, Solr, Python and .NET services, and SQL Server stored procedures or ‘sprocs’. Obviously, by that point we had had to figure out server lifecycles after all, but it’s hard to overstate the importance of the ease of getting to a simple system that worked. The operational aspects of php-fpm were a great help in that area. The core of Wayfair application development is what it has always been: pushing code and data structures independently, and pushing small pieces of code independently from the rest of the code base. It’s just that we’re now doing it on a gradually expanding farm of more than a thousand servers, that spans three data centers in Massachusetts, Washington state, and Ireland.

It’s funny. Microservices are all the rage right now, and I was speaking with a microservices luminary a couple of weeks ago. I described how we deploy PHP files to our services layer, either by themselves, or in combination with complementary front-end code, data structures, etc., and theorized that as soon as I had a glorified ftp of all that working in 2011, I had a microservices architecture. He said something like, “Well, if you can deploy one service independently of the others, I guess that’s right.” Still, I wouldn’t actually call what we have a micro-services architecture, or an SOA, even though we have quite a few services now. On the other hand, there’s too much going on in that diagram for it to be called a simple RDMS-backed web site. So what is it? When I need a soundbite on the stack, I usually say, “PHP + databases and continuous deployment of the broadly Facebook/Etsy type. With couches.” So that’s a thing now.

Let’s dig in on continuous deployment, and our deploy tool. Here’s a chart of all the deploys for the last week, one bar per day, screenshot from our engineering blog’s chrome:

Between 110 and 210 per day, Monday to Friday, stepping down through the week, and then a handful of fixes on the weekend. What do those numbers really mean in the life of a developer? There’s actually some aggregation behind the numbers in this simple histogram. We group individual git changesets into batches, and then test them and deploy them, with zero downtime. The metric in the histogram is changesets, not ‘deploys’. Individual changesets can be deployed one by one, but there’s usually so much going on that the batching helps a lot. Database changes are deployed through the same tool, although never batched. The implementation of what ‘deploy’ means is very different for a php/css/js file on the one side, and a sproc on the other, but the developer’s interface to it is identical. Most deploys are pretty simple, but once in a while, to do the zero downtime thing for a big change, a developer might have to make a plan to do a few deploys, along the lines of: (1) change database adding new structures, (2) deploy backwards-compatible code, (3) make sure we really won’t have to go back, (4) clean up tech debt, (5) remove backwards-compatible database constructs. From the point of view of engineering management, the important thing is to allow the development team to go about their business with occasional check-ins and assistance from DBAs, rather than a gating function.

Memcached and Redis are half-way house caches and storage for simple and complex data structures, respectively, but what about MongoDB and MySQL? Great question. In 2010 we launched a brand new home-goods flash-sale business called Jossandmain.com. We outsourced development at first, and the business was a big success. We went live with an in-house version a year later, in November, 2011. Working with key-value stores that have sophisticated blob-o-document storage-and-retrieval capabilities has been a fun thing for developers for a while now. It had the freshness of new things to us in 2011. There were no helpful DAOs in our pile of RDBMS-backed spaghetti code at the time, so the devs were in the classic mode of having to think about storage way too often. Working with medium-sized data structures (a complex product, an ‘event’, etc.) that we could quickly stash in a highly available data store felt like a big productivity gain for the 4-person that built that site in a few months. So why didn’t we switch the whole kit and caboodle over to this productivity-enhancing stack? First of all, we’re not twitchy that way. But secondly, and most importantly, what sped up new feature development had an irritating tendency to slow down analysis. And those document/k-v databases definitely slow you down for analysis, unless you have a large number of analysts with exotic ETL and programming skills. We love how MongoDB has worked out for our flash sale site, but as we extrapolated the idea of adopting it across the sites that use our big catalogue, we foresaw quagmire. By 2011, we were a large enough business that a big hit to analyst productivity was much worse than a small cramp in the developers’ style.

Around the same time, we began to experiment with moving some data that had been in SQL Server into MySQL and MySQL Cluster. The idea was to cut licensing cost and remove some of the cruft that has accumulated in our sproc layer. We have since backed off that idea, because after a little experimentation it began to seem like a fool’s errand. We would have been moving to a database with worse join algorithm implementations and a more limited sproc language, which in practice would have meant porting all our sprocs to application code, a huge exercise of zero obvious benefit to our customers. Since the sprocs are already part of the deployment system, the only compensation besides licensing cost would have been increased uniformity of production operations, which would have been standardized on Linux, but in the end we did not like that trade-off.

Wow! Stored procedures along with application code, colo instead of cloud. We’re really checking all the boxes for Luddite internet companies, aren’t we!? I can’t tell you how many times I’ve gotten a gobsmacked look and a load of snark from punks at meetups who basically ask me how we can still be in business with those choices.

Let’s take these questions one at a time. First, the sprocs. When we say sproc, of course, we mean any kind of code that lives in the DBMS. In SQL server, these can be stored procedures, triggers, or functions. We also have .NET assemblies, which you can call like a function from inside T-SQL. Who among us programmers does not have a visceral horror for these things? I know I do. The coding languages (T-SQL, PL/SQL and their ilk) are unpleasant to read and write, and in many shops, the deployment procedures can be usefully compared to throwing a Molotov cocktail over a barbed-wire fence, to where an angry platoon of DBAs will attempt to minimize the damage it might do. Not that we don’t have deployment problems with sprocs once in a while, but they’re deploy-tool-enabled pieces of code like anything else, and the programmers are empowered to push them.

Secondly, the cloud. If we were starting Wayfair today, would we start it on AWS or GCP? Probably. But Wayfair grew to be a large business before the cloud was a thing. Our traffic can be spiky, but not as bad as sports sites or the internet as a whole. We need to do some planning to turn servers on ahead of anticipated growth, particularly around the holiday season, but it’s typically an early month of the new year when our average traffic is above the peak for the holidays of the previous year, so we don’t think we’re missing a big cost-control opportunity there. Cloud operations certainly speed up one’s ability to turn new things on quickly, but the large-scale operations who make that economical typically have to assign a team to write code that turns things *off*. One way or the other, nobody avoids spending some mindshare to nanny the servers, unless they don’t care about the unit economics of their business. In early startup mode, that’s often fine. Where we are? Meh. It’s a problem, among many others, that we throw smart people at. We think our team is pretty efficient, and we know they’re good at what they do. What is the Wayfair ‘cloud’, which is to say that thing that allows our developers to have the servers they need, more or less when they need them? It looks something like this:

We’re afraid of vendor lock-in, of course, with some of the hardware, which we mostly buy and don’t rent:

But the gentleman on the right makes sure we get good deals.

That’s it for now. Another day, we’ll dig in on the back end for all this.