Twitter: It's the end of the sysadmin as we know it

Grow some Ganglia

Speaking this morning at the annual Web 2.0 Expo in San Francisco, Twitter operations man John Adams warned sysadmins they won't succeed on today's intertubes unless they learn to do a bit more than system administration. Sysadmin 2.0, he said, must develop certain talents for analyzing data.

"This is a whole new world, " Adams said. "For the longest time, people ran large data systems on a kind of ad hoc basis. We're in a world now where so many people are depending on the real-time web. A system administrator is not just a system administrator anymore. You have to use analytics. You have to grab data. You have to look at where a site is trending and where things are going so you can scale...

"If you don't start doing this work early, if you don't start collecting this data early, you will fail."

That may be stating the obvious. But there you have it.

It's no secret that Twitter endured its own scaling problems in its earlier days, as the digerati embraced the Web2.0rhea service en masse. But in mid-2008, the company ported a portion of its core code from Ruby on Rails to Scala - a new-age programming language that combines functional and object-oriented techniques - and in 2009, according to net-research outfit comScore, the micro-blogger rode a 1,358 per cent traffic leap.

Adams believes the company has handled such growth in large part because it uses open source tools like Ganglia to track performance across the site's back-end infrastructure. Currently, Adams explained, he and his ops team track about 15,000 points of site performance.

"If you're collecting data, you want to look at the aggregate. It's about how many areas across the site where you see errors. A server isn't going to tell you much. It's about the overall application," Adams explained. "We do this in as near-real-time as possible."

Well, we would hope so.

Adams acknowledged that Twitter has endured its fair share of performance problems - and in the past, the company has freely admitted that its original infrastructure was ill-suited to rapid scaling - but he insists that careful preparation got the company to where it is today. "In the first year, the site was kind of unstable, but what we were doing was trying to ramp up to a position where we are planning the future."

Such data tracking, he said, helped sidestep the so-called Twitpocalypse, the Y2K of Web2.0rhea. Many suspected that third-party clients would start failing when the unique identifier attached to each Tweet reached first the signed integer limit (2 31) and then the unsigned limit (2 32). Mining site data, Adams said, his crew accurately predicted when it would hit these limits - which meant they knew just how soon they need to make any infrastructure changes.

Over the past six months, he said, the site's most important infrastructure upgrade was the adoption of the Ruby on Rails application server known as Unicorn. The company still uses Ruby on the front-end, and Unicorn has provided an estimated 30 per cent performance improvement over the previous setup, based on Apache and a server called Mongrel.

Adams compares Unicorn to a grocery store where a single line feeds a row of cashiers - as opposed to the traditional setup where each cashier has a separate line. With Mongrel, each worker had its own request queue. With Unicorn, there's a single request queue, and requests are feed to workers as they become available.

"Ordinarily, if you're standing in line at a grocery store, you don't have any idea how quickly each cashier is going to move people through. You have ten random lines. You stand in one line. And you have no idea how long you're going to wait," he said. "The other model - which pretty much describes Unicorn - is where you have one line, and when a cashier is done, it grabs the next person from that one line. This creates a very rapid response."

He added that the 30 per cent performance boost from Unicorn affords only a "few more days of scaling" on a site like Twitter - but, he says, "every small amount helps."

Adams's other big piece of advice for aspiring Sysadmins 2.0 was to avoid - as much as possible - pulling data from disk. "Another discovery that we made when trying to increase the scale of Twitter was that disk is the new tape," he said. "With any sort of social networking operation - juggling followers, sending mail, etc. - disk is extremely slow."

Twitter has worked around the disk problem with, yes, memached, the open source distributed-memory object-caching system. But he also warns that there's such a thing as too much memcached. "You can't rely too heavily on memcached," he said. "If you put too much data into memcached, you enter this problem where if the memcached server goes down, you have to take all the data out of your database and reinsert it into memcached. You want to cache in a balanced way."

In addition to memcached, MySQL, and the open-source distributed database Cassandra, Twitter leans on a message-queue server known as Kestrel and a kind of follower database known as FlockDB. Both were developed at the company, and both have now been open sourced.

In one sense, Adam and his team exhibit a certain Google-like quality. They believe in data. But their Googleness goes only so far. They're also committed to open sourcing their back end. ®