The trouble with Dell

Note: I wrote this in mid-2010 and, for whatever reason, never posted it. I found it this week. Although the information in it is no longer fresh and new, it’s still useful, so for that reason, I’m posting it now.

Dell is standing on some shaky ground right now. Bill Snyder has a good summary of the problem.

In recent years, Dell computers have, shall I say, made me nervous. Some of it’s been concrete. Some of it’s just been touchy-feely. Now one of those touchy-feely problems is more concrete.

I’ll preface this by saying I do own some Dell equipment. I own a Dell PC. A very old Dell PC. It’s about 15 years old or so. I bought it a decade ago, and it’s still functioning as a web server. It’s slow, it’s old, and it long ago outlived its usefulness, but it dutifully keeps trying its best. I also own a Dell Inspiron laptop that dates to 2006 or so. Though aged and nearing obsolescence, it’s reliable. The only trouble it’s ever given me was the battery dying, which is completely expected.

But my experience with Dell equipment overall is very uneven. That’s why Dell frustrates me. You can buy a Dell and it’ll be the best computer you ever owned. So you go buy another one, of course, and the replacement could be just as good. Or it could be so bad, you wonder if Steven the stoned-off-his-butt Dell Dude built it himself while using it as an ashtray.

Dell tends to inflame passions with people. Like Apple, some people believe Dell can do no wrong. And likewise, those who have seen Dells fail tend to get a bit passionate when discussing Dell computers with those who believe Dell can do no wrong.

The current Dell scandal is pretty clear cut.

Mid-decade, Dell was using capacitors made by Nichicon. Dell determined that these particular capacitors made by Nichicon had an absurdly high failure rate of 97%, and for whatever reason, they continued to use them.

A capacitor is an electronic component. Computers have a large number of them. Among other things, their job is to insure a clean, steady flow of electricity.

Nichicon is a reputable company that’s been in business for 50+ years. Two problems appear to have happened in the mid-00s. Nichicon made a bad batch, and some counterfeit Nichicon products made it onto the market.

It’s not unusual for a quality capacitor to last a decade or more, often longer than the useful life expectancy of the machine. A poorly made capacitor can fail in a matter of months, usually well within the machine’s warranty period.

Symptoms include weird video, system crashes, and, in extreme cases, can prevent the machine from powering up at all.

A number of companies got bit by the problem, not just Dell. HP, Apple, and Intel ran into the problem in 2004-2005. But HP was very forthcoming. They admitted the problem, and gave details of the problem. Apple and Intel quietly started using different components.

Unlike HP, Dell wasn’t forthcoming, and unlike Apple and Intel, was slow to shift to alternatives. And unlike the rest, Dell was a lot more reluctant to replace failed parts under warranty. It took a while to catch up with them, but it is now.

A similar but unrelated problem surfaced with capacitors from a variety of Taiwanese manufacturers in the 2000-2001 timeframe. This problem ultimately forced Abit, once a very popular maker of motherboards, out of the market entirely.

This explains why Dell has always made me uncomfortable.

Dell’s products usually cost a few dollars less than their equivalents from HP or IBM. My first experience with a Dell server was unpleasant. I unboxed the unit, and noted that the entire system chassis flexed under its own weight. That’s not a huge deal, I suppose, but it puts undue stress on the motherboard. I’d rather spend $10 more and get a case that protects the motherboard, rather than subjecting it to mechanical stress when you move the system.

When I opened the system up to plug in a network card, I discovered something else I didn’t like: an Intel desktop chipset. Inside an HP equivalent at the time, I would have expected to see a Serverworks chipset. Sure, you can usually get away with using desktop-grade hardware in the server room, but when the difference in cost is $100, I’d rather have the server-grade hardware. If the server crashes one additional time over the course of its life due to overstressing those desktop components, that downtime will cost you more than $100.

If you’re going to make a big deal about five-nines reliability, pay the money.

More recently, I worked in a shop that used mostly Dell hardware. Many of those systems were big, 5U systems with lots of CPUs and cost upwards of $50,000. I had to help a Dell technician swap a motherboard in one of those systems, so I know those were made with higher-grade components than those 1U systems that initiated me into the Dell world.

In everyday use, those systems were OK. We got the occasional weirdness, but in a network that has nearly 200 servers in it, you’re going to get occasional weirdness no matter whose hardware you’re using. Cosmic rays from the sun can cause occasional weirdness, and nobody’s systems are immune to that.

But the systems would fail predictably. Every year, we would do a huge project across the network that caused a tremendous number of database transactions–enough to keep the systems busy for the entirety of a 3-day weekend, even if everything went well. The systems would run full bore trying to process everything.

Every time we did this, some number of systems would fail. Not all of them. And usually not the same ones. But a handful of these systems, which were generally fine the other 50 or 51 weeks out of the year, would predictably develop random hardware faults under stress. And technicians–when I could manage to convince Dell to send one out–didn’t always find anything wrong. Often we would shut it down, pull it out of the rack, open it up, look around, move components to try to isolate the problem, and the problem would be gone by the time we powered it back up. Just shutting the system down long enough for it to cool off seemed to help as much as anything else we could do.

The systems had high-end Northwood-generation Xeon CPUs in them. Northwood chips are infamous for running hot, and not aging particularly well. So I attributed that as a possible part of the problem.

But now I know that Dell was using questionable Nichicon capacitors in PowerEdge servers in 2004-2005. These servers were made during that timeframe. Capacitors are most critical when a system is drawing a lot of power.

Occasionally our diagnostics–when we were able to run them–would point to the CPUs, but not always. The unpredictable nature of capacitors failing–after all, there are a lot of them in a system–explains the random nature of some of the other failures we were seeing.

It seems to me that Dell has finally explained one of the most frustrating ongoing problems of my career.

It also seems to me that I can feel more confident of Dell equipment manufactured in 2010 (or now, for that matter) than equipment manufactured in 2005. After all, Intel learned a lot from the Northwood problems, and one of the results was an entirely new chip architecture. Nichicon presumably has learned too. If it hadn’t, there are a half-dozen or so other companies who make equivalent and completely interchangeable parts that work well.

The question is whether Dell has learned its lesson. Some companies learn from their mistakes–IBM learned from its antitrust problems and Microchannel’s failure in the marketplace. It couldn’t save its PC business, but they do still have a profitable server business. It took a while, but they did it. Even Microsoft has learned from its cavalier attitude towards security. They aren’t perfect, but to their credit, they’ve shown steady, year over year improvement for nearly a decade now.

But Gateway didn’t learn from its mistakes, and went from being a darling of the industry to being bought for pennies on the dollar by a Taiwanese conglomerate who was really only interested in the name and distribution channels. You can buy a Gateway computer today, but aside from the cow logo on the front and on the box, it’s no different from a plain old Acer. Which is actually a pretty big improvement.

Packard Bell didn’t learn from its mistakes either, and faded away more quickly than anyone thought possible. The brand still exists today in Europe, but it, too, is just an Acer with a different logo. In the United States, it’s not much more than a source of jokes for people who’ve been around long enough to remember them.