Scaling IT

Just about everyone can have a free web page. You get them free when you open cloud accounts or purchase internet service. This has lead to a proliferation of cat pictures on the Internet.

Back in the 90s, when it cost a little more to get on the Internet, the idea of personal web pages was just beginning. One very large ISP (Internet Service Provider) that used SGI systems wanted to sell personal websites. They felt SGI's Challenge S system was the perfect solution. They would line up hundreds of these systems, and each system could handle several sites. SGI did indeed set several website access records for handling the website for "Showgirls,” which, as you can imagine, had a racy website.

Fast forward a few months and there are 200 systems lined up in racks handling personal web pages. Then I start getting phone calls.

"Hey Steve. These guys are filing cases about two or three times a week to get memory replaced. We're getting parity errors that cause panics about two or three times a week."

I fly out and start looking carefully at the machines. The customer had decided to purchase third party memory (to save money) so they could max out the memory in each system. Each machine had 256MB of RAM, which was a lot at the time. This was parity memory, which means that each 8 bits has a parity bit that is used like a cheap "double check" to make sure the value stored is correct. The parity bit is flipped to a 1 or 0 so that each 8 bits always has an even number of 1s in it. If the system sees an odd number of 1s, it knows there's a memory error.

I looked at each slot. I looked at ambient temperature. I made sure the machines were ventilated properly (including making the customer cover all the floppy disk holes since they did not have floppies installed, but had neglected to install the dummy bezel). No change. Parity errors continued and clearly there was an issue.

Going back to the memory vendor and the specs on the chips, we started doing the math.
The vendor claimed that due to environmental issues (space radiation etc) one should expect a single bit parity error about once every 2000 hours of uptime for each 32MB of memory. Half of these errors should be "recoverable" (i.e., the data is being read and can be read again just to be sure), but the other half will lead to a panic. They do not mean the memory is broken, but the errors should be rare.

So let's do the math: 256MB/machine (so that's 8X 32MB).
Hours of uptime? (These machines are always up): 8760hours
How many total parity errors: 35 per system, per year, with half of them being "fatal." So, that’s 17 panics per system per year. They had 200 systems. That's 3400 panics a year in that group of systems or roughly 10 per week?!

Consider this when you start to scale up your IT systems. How many machines do you have to put in a room together before "once a year" activity becomes "once a day?”

I had one of those at home for a while. R3000 MIPS chip. If you could find "The Magic Garden" (a kernel internals book) it gives lots of nice assembly language and stack trace examples specifically for that chip.