All that is missing is a scripted WAV file to play "Wake Up Damnit!" every time it sprung into action. However, if the first server ever died completely it might be a pretty sad scene to see the second server performing CPR on the first and begging it to Wake up.

I used to build a circuit with a 555 timer, a 74LS00 and a couple of 74LS191's which would count very slowly to 16 (it took about 4 minutes) but would reset the counter to zero every time the hard disk light blinked. The "overflow" output pin on the 4-bit counters was connected to the RESET signal on the motherboard. No disk activity for 4 minutes and the machine gets reset. With an appropriate interface chip the serial port could be monitored too (I used to need this when I ran a BBS).

Isn't it possible with 555 timer, a resistor, a capacitor and just one 7400 series counter? I made a CAT5 cable tester (a very crude one) using just these components.

"the faulty server was taken offline and replaced with a new one working under a new IP address. During the swap, ITAPPMONROBOT was moved to a neglected corner of the server room, plugged back in, and promptly forgotten. It spent the last weeks of its life dutifully opening and closing its CD ROM drive every two minutes, reaching in vain for the restart button that it'd never touch again."

So the new server still had the same problem? The CD eject was by the ping script, remember?

"the faulty server was taken offline and replaced with a new one working under a new IP address. During the swap, ITAPPMONROBOT was moved to a neglected corner of the server room, plugged back in, and promptly forgotten. It spent the last weeks of its life dutifully opening and closing its CD ROM drive every two minutes, reaching in vain for the restart button that it'd never touch again."

So the new server still had the same problem? The CD eject was by the ping script, remember?

Ummm, no it didn't have the same problem. As it says, the new server was commissioned with a new IP address, so pinging the old server's IP address would result in no response.....and trigger the eject sequence :)

The real WTF? is, that none of you super computer guys know anything about RESETATOR piece of hardware. It's simple small piece of HW, which sits on serial port waiting for a ping. If no come in some period, then it triggers switch, which may be connected to machine reset button.
There are also enhanced GSM versions, which can reset computer over GSM network :)

The real WTF? is, that none of you super computer guys know anything about RESETATOR piece of hardware. It's simple small piece of HW, which sits on serial port waiting for a ping. If no come in some period, then it triggers switch, which may be connected to machine reset button.
There are also enhanced GSM versions, which can reset computer over GSM network :)

Oh, you really need to learn... :))

Who cares, this story is much better.
Poor little robot. I want to squeeze and hug it.

Actually, I did the same thing back in 2001ish using one Linux server to keep another up. The positioning was the hardest thing.

Redundant Array of Inexpensive Servers?

I wonder if you could expand that - a circle of four servers, each one resets another one if it goes down. I guess you'd have to do some mechanical work to ensure they can press each other's reset button.

(Yes, I know that primitive circuitry would allow a computer to power-cycle another one. But where's the Rube Goldberg fun of that?)

wait, so it'd potentially reset a perfectly working machine simply because it hadn't seen disk activity in 4 minutes?

Technically since the device was looking for edges in the LED signal, it would also reset the machine if a single access took more than 4 minutes (because for example your disk drive firmware has crashed)...but yes.

Modern server machines have watchdog devices like this built into the chipset or buried in the IPMI subsystem. Once you enable the device, the OS has to send a signal to it every N seconds for some configurable value of N, or the machine gets reset. Usually you have some program running on the machine that tests your machine's favorite service and then tells the OS to zero the watchdog timer. If your favorite service hangs, so does the test program, then the watchdog timer expires and the machine resets. Ditto if the OS crashes.

My timer boards were in use a decade or so before anything "modern," and I used an existing signal from the motherboard to avoid having to build interface circuitry. Using the hard disk LED pin or RS-232 TXD pin was easy since there were standardized connectors for these signals. No OS-specific device drivers required either, just write something to the disks or send something through the serial port every few seconds--things healthy systems typically do already.

When the machines I fitted with these timers are running their normal workloads one can safely assume that if more than 15 seconds pass between disk accesses the machine has crashed, or is in a sufficiently degraded state to be equivalent to crashed for all practical purposes.

The 4 minute timeout was only as long as it was to accomodate the BIOS power-on routines. Most of the time the counter was held at "0" during normal operation (it takes 15 seconds to reach "1"). Later versions of the board used a LED to display the clock signal just to make the board look like it was doing something during normal operation.

There was also a bypass switch on the output of the timers so you could spend more than 4 minutes fiddling with the BIOS or running diagnostic software or something.

Isn't it possible with 555 timer, a resistor, a capacitor and just one 7400 series counter? I made a CAT5 cable tester (a very crude one) using just these components.

Almost. The 555 gets unstable if you make its period too long. At the extremes the clock period is unpredictable or it stops entirely. I found that I could get to about 15 seconds without much jitter, which means 16 clock pulses take 4 minutes. If it wasn't for this problem I wouldn't need the digital counter at all: I could just configure the 555 timer to cycle once every 4 minutes and reset it with the HDD LED input.

I'm pretty sure you need some kind of logic chip too--at least one inverter, because most interesting signals from the counter chip (overflow, MSB, etc) are active-high and RESET is active-low.

I needed to detect transitions in the HDD LED signal because the OS might crash during a disk access. If I detected only the LED-on state then such a crash would hold the counter in its reset state and the machine would not be rebooted.

When I designed the board I had a lot of 4-bit counters left over from a cancelled project. I used two counters on the board, one which counted when the LED was on and one when the LED was off. The two counter overflow outputs were ORed together and the (inverted) result went to the ~RESET line on the PC.

If I used only one counter I'd need some other chip to detect edges in the HDD LED signal. Since I had lots of counter chips for free, the economics of the alternatives didn't work out.

Some prototype versions of the board had lots of unnecessary parts from the spares'n'leftovers bin: 7-segment LED decoders (driving 7-segment LED displays of mismatched sizes and colors), or demultiplexers driving bar-graph LED arrays, sometimes even a third counter that just counts up to drive blinkenlights.

I am surprised no one commented on the WTF of an undocumented server. If Erik had documented its purpose, there is a chance someone might have read the documentation when they went to move the monitoring server and realized it was no longer needed. Note: I said "a chance"! :-)

Remember that the server has "DO NOT MOVE" written on it. Did you think that someone who didn't read the text printed on the machine would read documentation from some other place?

These days even sub-$100 UPSes have USB monitoring and control. At the very least you can command the UPS to power off. If AC power is present the UPS will power on again some seconds later (good UPSes will allow you to configure this, not-so-good ones do this after a fixed delay); otherwise the UPS stays off until power is restored (good UPSes will also wait until a minimum battery charge is present, not-so-good ones will power on anyway and might fail if AC power dies again soon after). This is a pretty basic capability that is required for controlled system shutdown and automatic restart--even the proprietary 3-wire RS-232 port interfaces provide a signalling mechanism for this.

Modern Unix systems (even Linux) have had journalled filesystems for years now. Database servers and commercial UNIX filesystems have implemented their own journalling or logging for years before that. Even NTFS is reasonably bulletproof against worse failure modes than a system lockup (although the shoddy quality of typical software for systems that actually use NTFS tends to negate this advantage). These systems are designed to cope with unexpected system halts and resets, but not necessarily power failures. In the ITAPPMONROBOT case it seems we are dealing with a non-power-failure case, since the machine has stopped responding to pings for up to two minutes before it is reset.

In cases where a server is strictly locking up (i.e., the system crashes before writing any corrupt data to disk, the disks are fundamentally healthy, and power is maintained to the disks throughout) there should be no data loss, or well-defined data loss (e.g., the last N transactions committed might not be replayed during startup recovery, depending on configuration). Many crashes fit within those criteria.

New high-performance system designs take into account the fact that system power fails slowly and not all at once--your RAM can lose power before your disk drives and disk controller, causing any data writes that were initiated before the power failure to store junk on the disk. This can do some serious damage to your data even if you are using all the standard software journalling/logging capabilities.

The real WTF? is, that none of you super computer guys know anything about RESETATOR piece of hardware. It's simple small piece of HW, which sits on serial port waiting for a ping. If no come in some period, then it triggers switch, which may be connected to machine reset button.
There are also enhanced GSM versions, which can reset computer over GSM network :)

Oh, you really need to learn... :))

Since you will have noted the fact that there was no money budgeted to purchase hardware, and you wouldn't have mentioned it unless they could have gotten one anyway, where do they give these resetators away?

I agree this was brilliant. UPC costs money and unless the BIOS is a relatively new one, the server will still require manual intervention after the UPC kicks in.

I think you mean UPS, and you are incorrect because a really old server will have hard power (thus the circuit will remain complete as long as the power button is down) and a somewhat newer server will ALWAYS have power on settings in the BIOS. I haven't seen an ATX-based server without "full on" settings in its BIOS... ever.

This is easily the greatest Daily WTF of all time. It has the lowly staff triumphing in the face of management stupidity! It has the greatest hack known to man. It has a sad story of one brave soldier abandoned after he has done his duty.

I will now cycle my CD drive to salute this steadfast little machine.

Seriously the only way this would be better if it was a precisely aligned USB Foam Dart Missile Launcher ;)

The real wtf is that so few commenters read the actual article before commenting! Phrases like "budget freeze", "turn of the 21st century" and "worked for years" are passed over so they can demonstrate their knowledge of modern technology that should've been used to provide a more elegant solution. Apart from this is mid-90s technology in a BUDGET FREEZE (even inexpensive new hardware would've been infrastructure change and would need management clearance - chance!). And so many people have either missed "new IP address", OR don't understand ping (although surely the old server's IP address would've been recycled and assigned to another machine?).

Of course, the really elegant solution would've been to strip down the CD drive and fashion a small lightweight extending arm (probably out of a pencil or a biro tube) which would've been easier to position; or could've driven a seesaw type rocker from on top of the faulty server. But that's just being hyper critical, and this is a lovely story...

At least one person here could use a common sense/reading comprehension/sense of humor update. Let me see if I can help that person out. (That person would be you, in case you have trouble figuring it out.)

No budget to replace existing hardware in the story.

None of what you rambled on and on about has any bearing.

Nothing in the story gave a date (even approximately) indicating when this happened, so it could have been prior to any of the stuff you ranted about being available.

Probably 99.9999% of the people here knew what you were spouting off about, so you wasted our time telling us things we knew already in an attempt to make yourself appear smarter than you obviously are (otherwise, you wouldn't have made this senseless posting).

It annoys people when you post something in the tone you used (that of a pompous know-it-all).