As a programmer, we tend to take sysadmins for granted. The few times I've been without a good sysadmin have really made me appreciate what you guys do. When we're venturing into an environment without a sysadmin, what words of wisdom can you offer us?

This question exists because it has historical significance, but it is not considered a good, on-topic question for this site, so please do not use it as evidence that you can ask similar questions here. This question and its answers are frozen and cannot be changed. More info: help center.

18 Answers
18

Always have a backup system of some kind. Even better if it has a history.

Consider single points of failure and how to deal with them should they fail.

Depending on the amount of computers involved, looking into some way to make and create a standard image across computers will make everyone's life easier - no "it works on mine" because they have such and such a program not normally installed.

Document everything, if only because you will forget how you set something up.

documenting all steps is something I've seen good sysadmins do, and I've started doing it myself. Very helpful, indeed.
–
Nathan DeWittMay 4 '09 at 19:37

2

Consider self documenting systems. For example, why keep a list of host names in a text file or wiki somewhere when a well commented Zone file is the canonical source of information.
–
Dave CheneyMay 5 '09 at 14:58

3

Dave, is that well-commented Zone file accessible by all? If I am a new person coming on board, isn't it easier to be told "go to this wiki for all your answers" rather than "everything is documented everywhere. DNS is documented in DNS settings. The whozit is documented in the whozit config file. The database is documented in the database config file." That seems very... unfriendly to me.
–
Nathan DeWittMay 5 '09 at 16:07

5

Nathan, Dave: The trick is of course to use a script to update the wiki from the canonical source. It's worked wonders for me, I'm really sorry I can't use it where I work now.
–
Anders EureniusMay 28 '09 at 14:24

6

I would add to this: Build a test system. You need an environment where failure is an option. I have server running VirtualBox for this, but I have used my personal workstation when servers aren't available
–
Mark PorterMay 29 '09 at 15:15

Document everything. If you don't have one, install an under-the-radar wiki, but make sure you back it up. Start off with collecting facts, and one day, a big picture will form.

Create diagrams for each logical chunk and keep them updated. I couldn't count the number of times an accurate network map or cluster diagram has saved me.

Keep build logs for each system, even if it's just copy and paste commands for how to build it.

When building your system, install and configure your apps, test it works and perform your benchmarking. Now, wipe the disks. Seriously. 'dd' the first megabyte off the front of the disks or otherwise render the box unbootable. The clock is ticking: prove your documentation can rebuild it from scratch (or, even better, prove your colleague can with nothing more than your documentation). This will form half of your Disaster Recovery plan.

Now you have the first half your Disaster Recovery plan, document the rest; how to get your application's state back (restore files from tape, reload databases from dumps), vendor/support details, network requirements, how and where to get replacement hardware -- anything you can think of that will help get your system back up.

Automation:

Automate as much as you can. If you have to do something three times, make sure the second is spent developing your automation so the third is fully automated. If you can't automate it, document it. There are automation suites out there - see if you can make them work for you.

Monitoring:

Application instrumentation is pure gold. Being able to watch transactions passing through the system makes debugging and troubleshooting so much easier.

Create end-to-end tests that proves not only that the application is alive, but really does what it's supposed to. Points are yours if it can be jacked into the monitoring system for alerting purposes. This serves double duty; aside from proving the app works, it makes system upgrades significantly easier (monitoring system reports green, upgrade worked, time to go home).

Benchmark, monitor and collect metrics on everything everything sane to do so. Benchmarks tell you when to expect something will let out the magic smoke. Monitoring tells you when it has. Metrics and statistics make it easier to get new kit (with fresh magic smoke) through management.

If you don't have a monitoring system, implement one. Bonus points if you actually do jack the above end-to-end tests into it.

Security:

"chmod 777" (aka grant all access/privileges) is never the solution.

Subscribe to the 'least bit' principle; if it's not installed, copied or otherwise living on the disk, it can't get compromised. "Kitchen sink" OS and software installs may make life easier during the build phase, but you end up paying for it down the track.

Know what every open port on a server is for. Audit them frequently to make sure no new ones appear.

Don't try cleaning a compromised server; it needs to be rebuilt from scratch. Rebuild to a spare server with freshly downloaded media, restoring only the data from backups (as the binaries may be compromised) or clone the compromised host to somewhere isolated for analysis so you can rebuild on the same kit. There's a whole legal nightmare around this, so err on the side of preservation in case you need to pursue legal avenues. (Note: IANAL).

Hardware:

Never assume anything will do what it says on the box. Prove it does what you need, just in case it doesn't. You'll find yourself saying "it almost works" more frequently than you'd expect.

Do not skimp on remote hardware management. Serial consoles and lights out management should be considered mandatory. Bonus points for remotely-controlled power strips for those times when you're out of options.

(Aside: There are two ways to fix a problem at 3am, one involves being warm, working on a laptop over a VPN in your pyjamas, the other involves a thick jacket and a drive to the datacenter/office. I know which one I prefer.)

Project management:

Involve the people that will be maintaining the system from day one of the project lifecycle. The lead times on kit and brain time can and will surprise, and there's no doubt they will (should?) have standards or requirements that will become project dependencies.

Documentation is part of the project. You'll never get time to write the whole thing up after the project has been closed and the system has moved to maintenance, so make sure it's included as effort on the schedule at the start.

Implement planned obsolescence into the project from day one, and start the refresh cycle six months before the switch off day you specified in the project documentation.

Servers have a defined lifetime when they are suitable for use in production. The end of this lifetime is usually defined as whenever the vendor starts to charge more in annual maintenance than it would cost to refresh the kit, or around three years, whichever is shorter. After this time, they're great for development / test environments, but you should not rely on them to run the business. Revisiting the environment at 2 1/2 years gives you plenty of time to jump through the necessary management and finance hoops for new kit to be ordered and to implement a smooth migration before you send the old kit to the big vendor in the sky.

Data you're not backing up is data you don't want. This is an immutable law. Make sure your reality matches this.

Backups are harder than they look; some files will be open or locked, whereas others need to be quiesced to have any hope of recovery, and all of these issues need to be addressed. Some backup packages have agents or other methods to deal with open/locked files, other packages don't. Dumping databases to disk and backing those up counts as one form of "quiescing", but it's not the only method.

Backups are worthless unless they're tested. Every few months, pull a random tape out of the archives, make sure it actually has data on it, and the data is consistent.

"Benchmark, monitor and collect metrics on everything everything sane to do so. Benchmarks tell you when to expect something will let out the magic smoke. Monitoring tells you when it has. Metrics and statistics make it easier to get new kit (with fresh magic smoke) through management."Pure Gold
–
T.J. CrowderOct 4 '09 at 16:59

Don't assume its easy. I know many programmers who think that just because they can setup IIS or Apache on there dev box that they can run a web farm. Understand what the job involves and do your research and planning, don't just think the sysadmin work is the easy thing you can do in 10 minutes to get your app deployed.

It goes the other way of course, I've found a few sysadmin types who really don't understand the difference between the sort of scripts and small utility programs we can all knock up and "real" programming.
–
RobMJun 7 '09 at 16:19

2

+1 Robert: Or the sysadmin saying "it's a simple if statement" to workaround a poorly designed network architecture. Mutual respect and understanding is key.
–
SnOrfusOct 13 '09 at 18:54

Realize that, for better or worse, many of the servers and/or networking equipment they tend to are very much like children from a second family. These are their babies. They tend them, help them along when they are ill, and monitor them vigilantly for trouble. This shouldn't be this way, but after many years, it often is. Keep this in mind as you communicate to them your concerns about equipment not performing properly or to expectation. And if you get a reply you don't understand, try filtering it through this world-view.

Be on good working terms. Sounds cheezy, but it's worth its weight in gold. Some day, you'll need some special favor. And some day, that sysadmin will be happy to go out of their way to make life a little easier for you, just this one time.

That working relationship goes both ways. If the sysadmin is very busy, and you could make life a little easier by writing a small script or program, then do it! They'll appreciate it more than you know.

Be very clear. "This sucks" is not as clear as "having an intermittent network connection is a bit annoying, any chance you can look at it?"

If you think your app will scale, ask the admin before assuming it will. They might "see" something you don't, or know something about the performance limits of the equipment you're going to deploy on.

If your app needs tuning, but it doesn't appear to be a code issue, ask nicely about how the servers are performing. Sysadmins tend their machines with loving care and are not pleased when they are "ill" or "misbehaved". Asking nicely will turn an ailing machine around (or get it repaired/replaced).

(as mentioned elsewhere) document the settings you use, and why you use them. Just having "set checkbox X" or "uncomment config file line Y" doesn't help. You could be setting the option that erases all your data on the next reboot for all you know.

If you don't have the time to document the setting on paper, try to document it in the system if possible. With config files, this should almost be standard practice - every setting change should be datestamped, with initials, the expected effect of that setting, and the reason why it was changed (see prior bullet point). This little habit has saved my bacon more than once during crunch-time. "Why did we do that?" "Because we mandated policy X, and the setting Y gives us the behavior we need for policy X".

Beer. Or Cola. Or even Water. Beverages are always welcomed. Being a sysadmin is thirsty work.

For the configuration file documentation/change issue, I recommend putting all configuration files in a version control system. This should be very easy for programmers to do, since they are hopefully already using such a system for their source code. If they also add a comment whenever they commit a change, it will be easy to go back in history and see what was changed when, and why.
–
Anders SandvigMay 5 '09 at 7:37

+1 for that, as it "closes the loop" on change management. Great suggestion.
–
Avery PayneMay 5 '09 at 12:40

2

Excellent suggestion for giving clear error reports. Nothing frustrates me more than after being told that there is a problem, and knowing that it could potentially affect a lot of people, I have to tease the details from a disinterested programmer
–
Dave CheneyMay 5 '09 at 15:00

How to debug networking issues and watch your program run with sysadmin tools. As a programmer who got started in system administration, I'm amazed by how impotent many programmers become once networking "just stops."

Wireshark, to watch your code run in a black-box fashion, packet-by-packet

Tools to connect directly to network services:

Telnet, netcat, or socat for plain connections over TCP or UDP

OpenSSL for the same thing with encryption (hint: try openssl s_client -connect target-host:port sometime), for manually connecting to network services

dig (in the BIND 9 package) for debugging name resolution

Being able to tell what part of the network stack failed based on the timing and other characteristics of a failed connection

+1. Any developer writing an application dependent on solid network performance should read 'TCP/IP Illustrated v1', by the late great W. Richard Stevens, before ever starting to code.
–
Murali SuriarMay 4 '09 at 20:34

1

Thanks for all the upvotes guys. It's bummed me out for years to see programmers at a helpless standstill once the underlying networking fails. And these days, almost all programming is network programming.
–
jhsMay 5 '09 at 1:06

It's very easy to pass the buck (e.g., your network is hosing my communication with the database). It may be the network's fault, but you should have application logs with errors that, using Google or SO, may reveal a problem in an app's configuration.

Everyone likes to blame the hardware, OS, or network, so if you practice a little more due diligence, you'll make the sysadmin a happy person. Because, if nothing else, you might be able to point them in a specific direction as to what might be wrong (as opposed to saying "your network sucks" or something equally helpful).

Document everything you can. Cannot tell you how many times the last sysadmin thought it would be cute not to document something for 'job security' or someone just wanted to get in and get out. Just like a programmer should leave good comments, sysadmins should document. A diagram of the topology would be nice too.

Documentation: no need to go nuts, but how the application works, a diagram showing how the bits fit and ways to test each component when it all goes wrong. Sample data and output is nice.

Requirements: what modules does it rely on? Versions? OS?

Monitoring: ideally developers would include monitoring information and tests with the application.

Speaking of packaging, PACKAGING! Nothing worse than a "deployment" which means checking out a new revision of a file from VCS and copying it to a bunch of servers. Too often programmers don't appreciate the complexity of deploying software: there are reasons why versioned, packaged software forms the backbone of most OSes.

If a developer came to me with an RPM which installed first time with concise, comprehensive documentation and some Nagios tests they'd be my new best friend.

This may apply only to beginning programmers, but I deal with a few things on every project with some programmers.

"It works on my machine" is not ever a valid statement. It is the programmer's responsibility to create an install program for use on the server, or at least document every connection and dll and add-in that will be required on the server.

(I've heard this multiple times, so please don't laugh) I run the exe on the server from my machine and it works. But, when I run it on the server (Citrix, Terminal Server, etc) it doesn't work. Please understand dll's and ocx's and anything else your program requires and where and how they are registered, and how your program uses them.

talk to your admin both formally and informally about what you are doing. They will usually be interested and can express possible impacts upon production early on. You dont have to agree, but it helps identify trouble spots.

No, you cant have the entire server to yourself... If you need to, it's a political decision, regardless of how technically sound it is. If you want to work the politics, go right ahead.

Production hardware often looks different that your development server, and even within farms, the specs on machines are different.

Learn how production is setup, because you likely cannot replicate it on your desktop, doing this prevents you from making poor assumptions.

Just because you can cache stuff in memory doesn't mean you should, wait for the bottleneck first (in unit testing or pre-production performance testing)

if you are sticking data in a database, think about how you could split the data up into read-only data (that could be horizontally scaled) and read-write data (that usually only vertically scales).

if you are sticking data in a database, must be really be an RDBMS ? there are other key-value pair systems out there that scale better (netcache).

dont think AJAX is the end-all solution, it looks cool, but it limits the monitoring and automation possibilities. I'm not saying dont use it, just think twice.

a) When coding, assume that underlying infrastructure could fail, and does not come from happy-happy always-on land. Or Google.

b) We probably don't have the resources to implement anything like the infrastructure you've read about, so take it easy on us when things go down. It's likely we know what needs to be done, but for whatever reason it just hasn't happened yet. We are your partners!

c) Like jhs said above, it would really help if you had a passing familiarity with tools to troubleshoot the infrastructure, such as ping, traceroute (or combining both - mtr), dig, etc. Massive bonus points for even knowing about Wireshark.

d) If you program a computer, you really should know how it connects to the network and the basics like being able to parse the output of ipconfig /all or ifconfig. You should be able to get your internet connection up and running with minimal help.

Otherwise I think Avery pretty much nailed it. Devs who do a little sysadmin are worth their weight in gold! But equally, sysadmins who understand how devs go about things (including versioning, etc.) are pretty much essential in this day and age.

This seems to be in the air at the moment, I've noticed more discussion about the dev/ops relationship in blogs - check out

That no one group or function is 'better' than another and that none require 'bigger brains' than each other either. I've seen both sides get all prima-dona'ish in the other's company - you're all trying to achieve the same goals - focus on these similarities and not the fact that you use different tools.

Infrastructure architect turned programmer, might want to roll back that transaction in the future though :)

Talk to each other, early and often. Review designs with the guys who will manage the infrastructure your app will get deployed on (if you know who that will be).

Zero data loss is possible, but it is a responsibility shared by developers and sysadmins. Again, talking to each other may help here.

Your infrastructure staff should have been involved in determining the non-functional requirements.

Arrange beer (when the job is done) and pizza (while we're working). Somehow, the presence of that kind of food impacts our ability to make our nice little 32 cpu boxes do whatever you want them to do :)

As someone that has been a sys admin for developers, and a developer myself, the advice given here is not only gold, but should be part of the hiring documentation for new developers for companies all over.

Something that I haven't' seen (yet) explained is that developers really should know the products that they'll use to create the programs that they are paid for. The amount of times that I've had to explain and configure apache servers, eclipse and Visual Studio installs, and database on developer machines is a bit worrisome.