Server Fault and LOPSA have a lot in common. Both are communities of system administrators, and both are committed to advancing the state of the art in IT. Both are committed to system administration as a whole, not just “Linux admins”, “Windows admins”, “network admins”, etc.

I’ve only been a Server Fault member for a little while, but I have already gotten great value from the community there. I’ve learned some technical things (my Windows-fu really sucks), and most importantly, I’ve learned more about what I would call “new school” system administration and new ways to work with users and their community.

Kyle Brandt is one of the administrators who works behind the scenes to keep Server Fault up and running smoothly, and he also writes about his experiences at the Server Fault Blog.

Server Fault will be having a one day conference for system administrators and operations people this October called Scalability. Check out http://scalability.serverfault.com/ for details!

Kyle was kind enough to take some time from his busy schedule to answer some questions about what it is like to manage such a large and busy system, that serves a community that can be rather demanding at times.

How did you become interested in Server Fault?

I discovered Server Fault when I was listening to the old Stack Overflow podcast. I have always valued participating online through blogging and forums. After using Server Fault the system made a lot more sense to me than the traditional forum. I also really appreciated that Server Fault does not cater to any particular operating system or networking platform — but is for all system administrators. The most appealing aspect of Server Fault after using it for a little while was that people tended to have a more respectful open attitude than I was used to seeing on IRC and mailing lists.

What’s your background? Especially in system administration? In programming? How did you get your start in the business?

I actually studied classical double bass at the New England Conservatory and the Cleveland institute of Music, so system administration and programming is not my formal background. I have always enjoyed hacking on computers though since grade school. I started out on a Tandy 1000 computer and dial-in BBSs. I then started playing around with Linux a lot in the mid-90s. During college I also did a couple help desk jobs. After graduating college I felt computing would be a better career choice for me, so I spent a summer reading networking books, scripting, and learning a couple programming languages to fill in my knowledge. I started blogging to show that I was dedicated and try balance out a mostly empty resume. From there I was able to get an entry level Linux support position and made sure I kept studying and improving my skill set.

What’s it like to admin such a busy site? Does the fact that it caters to system administrators make your job easier or harder? How many people (in what roles) does it take to keep everything running?

Stack Exchange has a lot of great programmers, because of this, administering our sites isn’t as hard or scary as people might think (at least, so far). Being a system administrator for a system administration site is one of the best parts of this job for me. I get to meet a lot of great Sys Admins and learn a lot from them. If anything it might make easier, because in a way Server Fault is like a big IT department with a lot of people to ask for advice and bounce ideas off of (http://blog.serverfault.com/post/why-participate-on-server-fault/). Since we are open about a lot of what we are doing, it also helps keep us honest.

My days are pretty varied. I work from my home in Boston, so I often start work pretty early in the morning. On the technical side I have to deal with networking, Windows, Linux, backups, monitoring — the usual but as a generalist there is a lot of ground to cover. Every now and again things get crazy too, so just like some days are spent putting out fires. I also write on the Server Fault blog when I can and try to keep up with Server Fault itself. This October, we are having a one day conference for system administrators and operations people called Scalability (scalability.serverfault.com), so I am trying to find good speakers and figure out what the schedule should be. All of these things make for a pretty varied and interesting schedule on the whole.

Where did the original (and current) code come from, and what are the basic software technologies that are used?

What can you share about the hosting architecture? How many servers, are they in cloud, co-location, or private data center? What kind of storage are you using and what kind of network bandwidth to you have?

We use a co-location facility in Manhattan provided by Peer 1 for our primary data center, and we have a backup co-location facility that hosts our chat service in Corvallis, OR provided by Peak Internet. We don’t have anything in the cloud that is part of core business. The only exception to that is that we have switched to a CDN for our static content — if you consider that cloud computing. Our Internet bandwidth is now only about 30-40Mbit after moving our static content. Internally we recently discovered that our network pattern is a microburst pattern, which makes traditional things like SNMP counters for bandwidth underestimate how hard we push our L2 network. I did a blog post about this (http://blog.serverfault.com/post/per-second-measurements-dont-cut-it/).

For storage we are moving to SSD more and more. The IOPs we get from a 6 disk RAID 10 array will saturate the controller. SSD makes a lot sense for us because our datasets are pretty small on the whole since it is only text data. We are also in the processes of deploying SSDs to our web tier because they hold our search indexes that are built using Lucene.NET

What does the server (hardware, network, OS) environment look like? How many servers, how much storage, etc.

We use a lot of Dell hardware to be consistent and Cisco for our switches and routers. We have a lot of utility servers, such as monitoring, logging, etc. But our core server count is pretty small. In our New York facility we have one active load balancer at a time running HAProxy, 9 active production web servers, and 2 active database servers and one active Redis caching server. We also have a couple DNS servers. For the database, load balancers, and redis servers we have secondary failover servers as well.

Is there any automation in the administration of the site? I’m looking more for information on puppet, cfengine, kickstart, or any other system administration tools than on the automation within the application itself.

There are 51 (!) Stack Exchange sites now. How much technology and infrastructure do they share? How many are you involved with?

They all live on the same infrastructure. Stack Overflow gets its own database server since it is so big, and that helps us see how a single site behaves easily. For the most part though Stack Exchange is a multitenant architecture.

What does the daily/weekly load look like, in terms of visitors, “hits” or any other metric that you use? In fact, what do you consider the key metrics for the site?

We peak around Morning to Lunch (For the US) on weekdays. With the static hits moved to the CDN we now see about 500 HTTP request/s at peak for our NY data center. Looks like we are at about 130 million page views a month now according to Quantcast which is directly measured: http://www.quantcast.com/p-c1rF4kxgLUzNc

I have been doing a lot of thinking about more advanced metrics for growth on the system side. Since from that perspective, growth includes new features and design changes as well as new users. I’m in the middle of writing a blog.serverfault.com post about that so that might have the best answer.

With the huge growth at Stack Exchange and Server Fault, how have you scaled to match?

There have been some bumps, but for the most part we only notice them internally. As I said before great performance conscious programmers make it so we can keep are server count pretty low. I also believe on erring on the side of being over provisioned so hardware is never a bottle neck for growth or new features (http://blog.serverfault.com/post/the-limits-of-cost-benefit-analysis-in-it/).

How do you monitor the health of the site? Cacti, or Nagios or ???

We have been using Nagios which sends its data to n2rrd for graphing. The whole system is a bit kludgy so we recently purchased Solarwinds Orion NPM/APM. One of the biggest strengths for us is that it stores its data in SQL Server which we are all comfortable with. I don’t find pulling the raw data out of RRD based systems (Cacti, Munin, etc are all RRD based) to be that easy when it is all stored in a lot of different RRD files. It is also a little less time consuming to add new checks/graphs. More and more I view how much things costs using time as my currency.

What questions have I not asked? What things would you like your users (and the article readers) to know?

System administration is evolving. The field is going to become less about support and more about managing things in a more scalable and automated way. Knowledge of things like protocols, OS fundamentals, data interpretation are becoming givens. I also believe coding abilities will be standard, not a plus. You can give the job a new name if you like, but what is important is the advancing skill set.

More than anything else, participation in the system administration community is driving this. By speaking at LOPSA meetings, participating on Server Fault, or writing blog posts you contribute to our craft and move it forward. Making an effort to be part of the community, even just a little, can really boost one’s career and helps out your colleagues.

We try to write as much as we can on http://blog.serverfault.com so you can keep up with our ideas and our infrastructure there. Also, Server Fault is a great place for system administration questions, so I hope your readers consider checking it out next time they have a system administration question. It is also a low barrier way to start participating in the community.