Running Thousands of Drupal Websites at Scale

Rok Zlender is the manager of Site Factory Engineering at Acquia. Previously he was a developer, then Director of Technology, at Examiner.com, a news site that at the time (2010-2013) was the 40th largest site in the US, publishing around 3000 stories a day. Today, Rok is still managing sites at scale, working with Acquia customers who have hundreds or thousands of websites. Rok has been working with Drupal for nine years.

What’s the trick to managing thousands of sites?

Consistency and automation. You can always handle a site or two manually: you get all your scripts ready, you kick them off on one site, then a second, maybe a third. But when you get into conversations about dozens or hundreds of sites, you really need to have robust tools that allow you to make sure that your hardware can handle it, and that the operation will not fail on even one site. Maybe 99.9 percent will be fine, but if even one site fails, that’s not acceptable.

Yes, Site Factory enables site builders, developers, and digital managers to create, deploy, and manage hundreds of websites quickly and efficiently. It offers multisite tools for developers, and a platform and user experience that lets site builders create sites quickly without depending on developers. It also makes it easy to manage a Web stack and it works really well with Acquia Cloud.

So it’s a platform that helps customers with management and automation. The tools we’ve built are very robust and can scale up to thousands of sites.

Most customers have to do security releases or module updates on their sites, but when you get into numbers that are high, it’s really important that you do it consistently across all of them. Because as soon as you have snowflakes — exceptions — those exceptions will come around to haunt you. You’ll need to fix them eventually, and at that point it will take you much longer, and require more resources. So it’s really important that you have a consistent environment across all of your sites.

That’s what Site Factory does. You commit your code in one place, kick off the deployment, and Site Factory will go in and execute the same steps on all of the sites. It will make sure that all of your code is on the newest version, and all of the sites get the latest security update. So you don’t have to worry about that. If a step fails it will retry up to a number of times, if it still fails we will make sure we put a site back into a good state.

What kinds of customers have so many sites?

Warner Music, an Acquia customer, runs hundreds. One of the biggest pharmaceutical companies in the world, an Acquia customer, also runs hundreds of sites. We have one financial customer that is in the process of migrating thousands of sites onto Site Factory, but we have others that are running 60-70 sites. A number of major universities use Site Factory to manage multiple sites. Heartland Dental, a dental chain, has moved 600 sites to the Site Factory platform in the last six months. They are really cranking out the sites.

What tools do you use to work on this scale?

We build our own execution engine, basically, that runs all of our processes. So if we’re running a site update, for example, we have our process that says, “I want to push this code, I want to put the site into maintenance mode, I want to update code on it, I want to update the database, I want to clear cache, and I want to put the site out of maintenance mode.”

And it does that on all of the thousands of sites that you have, and it does it consistently and reliably. So we build our execution engine to do that, and we execute thousands, if not millions, of tasks when we do an update. It goes through all the sites, one by one, and does the updates on them.

We do the same thing running cron, a daemon that executes commands at specified intervals. You can run cron on one, or two, or five sites manually. But when you want it to run on thousands of sites every one or two hours — that’s what our execution engine can do. So let’s say we have five servers, and you want to run cron every hour. We can say: we’ll run it at 20 percent utilization: at this minute five sites will run, and at the next minute another five sites will run, and so on. We will spread it across the hour. The customer doesn’t have to worry about that level of detail. The customer just has to say they want to run it every hour. We spread it out efficiently.

We also have an advantage because we’re running Site Factory on the Acquia Cloud platform. Everything on Site Factory runs on the Acquia Cloud. We just provide the tools to manage these high numbers of sites. So we use a cloud api to execute some commands, we use Drush to execute other commands, etc. We build our tools around the Acquia Cloud.

What it really comes down to is centralized management. You really can’t deviate from that. You’ll just get yourself into trouble, because then you’ll have a lot of different snowflakes that you have to maintain individually. So where our customers see real benefit is from one centralized platform that allows them to update all of their sites. They can be sure that all the sites are running the same version, and all of the modules are secure and updated.

How big is the Acquia Cloud Site Factory team?

We have twelve engineers, plus product management, quality assurance, documentation, and supporting systems. Most of the engineers on the team have been here a long time, and they are always thinking about how to run Drupal not on one site, but on many, many sites.

The thinking is different when you have one task that’s going to run on thousands of sites. And not one of them can fail.

Rok Zlender is the manager of Site Factory Engineering at Acquia. Previously he was a developer, then Director of Technology, at Examiner.com, a news site that at the time (2010-2013) was the 40th largest site in the US, publishing around 3000 stories a day. Today, Rok is still managing sites at scale, working with Acquia customers who have hundreds or thousands of websites. Rok has been working with Drupal for nine years.