What Can System Administrators Learn from Programmers?

Introduction

July 21, 2006

By
Ibrahim Haddad

Although we often hear about program bugs and techniques to get rid of them, we seldom see a similar focus in the field of system administration. LinuxPlanet asked Diomidis Spinellis, the author of the book Code Quality: The Open Source Perspective, for tips on what system administrators can learn from programmers.

LinuxPlanet: How would you judge the quality of a system's setup?

Surprisingly, I would use the same attributes as those I employ for describing the quality of a software artifact: functionality, reliability, usability, efficiency, maintainability, and portability. I would ask questions like the following. What services does the system offer to its users? Has the system been setup in a way that it can run uninterrupted for months? How easy is it to manage its services, or to restore a file from a backup? Is there any waste in the utilization of the CPUs, the main memory, or the disk storage? How difficult would it be to upgrade the operating system or the installed applications? How difficult would it be to move the services to a different platform?

System administration is sadly a profession that doesn't get the type of attention given to software development. This is unfortunate, because increasingly the reliability of an IT system depends as much on the software comprising the system as on the support infrastructure hosting it. Nowadays, especially when dealing with open source software, you don't simply install an application; you often install with it a database server, an application server, a desktop environment, and libraries providing functionality like XML parsing and graphics drawing. Furthermore, for the application to work on the Internet you need network connectivity, routing, a working DNS server, and a mail transfer agent. And for reliable operation underneath you often deploy redundant servers, RAID disks, and ECC memory.

A well-setup system, say a Linux installation, has many quality attributes that are the same as those of a well-written program. Both system administrators and programmers can use similar techniques for implementing quality systems.

LP: Give me an example. I have a system with serious time performance problems. Where do I start?

I suggest that the first step you should undertake would be to characterize the system's load. Using the top command available for Linux systems you will first see how your system spends its time. Near the screen's top you will see a line like the following.

CPU states: 80.7% user, 17.1% system, 0.0% nice, 2.2% idle

You can typically distinguish three separate cases.

Most of the time is spent in the user state. This means that your system's applications are primarily executing code directly in their own context. You will need to tune your applications, using profiling tools to locate the code hotspots and algorithmic improvements to optimize their operation.

Most of the time is spent in the system state. This is a situation where most of the time the kernel is executing code on the application's behalf. In this case you will first monitor the applications using tools like strace to see how they interact with the operating system, and then use techniques like application-level caching or I/O multiplexing to minimize the operating system interactions.

Most time is spent in the idle state. Here you're dealing with I/O bound problems. You'll need to look at the way your application interacts with peripherals, like the storage system and the network. You can again use strace to see which I/O system calls take a long time to complete, and the vmstat, iostat, and netstat commands to see whether there's more performance you can squeeze out of your peripherals. If indeed your peripherals are operating well below their rated throughput you can use larger buffers to optimize the interactions, otherwise you must minimize the interactions, again using algorithmic improvements or caching techniques.

LP: It looks like you can often trade space for time. Is this so?

Yes, I wrote that space and time form a ying/yang relationship, and there are many cases where you can escape a tight corner by trading one of the two to get more of the other. Consider a sluggish database system. You can often improve its performance by adding appropriate indexes, and even precomputing the results of some queries and storing them in new tables. Both the indexes and the new tables take additional space, but this space gives you increased performance. Now consider the opposite case where your data overflows your available storage. (Star Trek's "Space the final frontier" opening line was true in more than one sense.) If you have sufficient CPU time at your disposal you can devote that spare time to compress and decompress your data. Your MP3 player and digital camera have succeeded as products by adopting exactly this design tradeoff.