Mitigation and solving the problem for good

If this issue affected consumer products, such as the faulty 6-series chipset for Intel 2nd Gen Core CPU, the dozens of known issues to apple mac books, macs and iPhone such as bad solder on GPUs, faulty FireWire ports, ‘bendgate’ on iPhone 6 Plus’s and a multitude of other common faults such as bad capacitors to all brands, an end user would simply be inconvenienced during the process of having the device repaired or replaced, and ignoring the financial implications a replacement is easily sourced.

For business/enterprise gear especially ones that run services third party customers, things get a bit more complicated as there are service level agreements in place that make a provider liable for outages to customer services, as well as any loss to reputation or damage to equipment.

SLA’s are often defined as a high 90s percentage, examples are:

99.9% uptime means a 43 minute outage per month.

99.99% uptime means 4.3 minutes outage per month.

Some providers advertise 100% uptime which obviously means 0 minutes outage per month. This is impossible regardless of high availability technology or provider. Third party services such as power often go out or become instable more often external to a Datacentre, international telecommunication lines fail such as in a storm or if a ship damages a undersea line or commonly enough contractors damage cables while doing construction or digging. In extreme cases, acts of god and natural disasters such as floods or hurricanes/cyclones can take out entire facilities.

No cloud or network infrastructure provider can fully mitigate against all these issues. The best possible is to ensure redundancy on a service and geographic level, so if a particular service is lost in one location such as a router, power or data line, other resources in another geography can be switched in.

Intel’s ‘Rangeley’ based Atom chips are and were used in hardware installed in such critical locations and sites so loss of functionality becomes paramount.

Users of such equipment need to plan for and accommodate not only the possible future failure of any of these devices, but also if a device fails how it will be replaced and what the time frame is to replace it.

The core definition of the relationship between a paying customer and business of high repute should be the ease of obtaining support and service from that vendor, especially when there is a declared known issue relating to that device.

I can understand a tier 1 OEM like Dell need to manage their inventory and production levels so they can balance existing orders as well as buffer stock for repair programs, but given the price of these devices (>$10K AUD before options) a user should be able to request repair/replacement for affected devices whenever they please or need.

For an ordinary fault, the customers paid for high priority on-site warranty will cover any spot failures within a few hours but dell are refusing to allow any un-dead switches to be exchanged under this paid warranty program, again, we must wait for the bat signal from Dell.

By then it will be too late, the damage has been done.

Dell’s intermediate fix for the issue is to add “early warning detection” of imminent failure via a patch to their switch operating system. Vendors typically do not add telemetry to their systems and devices indicating a future fault. Often devices detect a fault after the fact and provide a remedy, or perform scheduled maintenance. If a company like Apple added a readout to iPhone stating ‘your phone is dying’, the zombie horde would grab their torches and pitchforks in revolt!

For administrators the only sure solution is to purchase a new switch at own-cost, schedule an outage, swap out the old switch for the new one and store the old, to be faulty switch for future replacement. This requires capital investment which some may not be able or willing to commit to.

In our case, as many of our switches have reached or exceeded the 18 month mark, they are in the danger zone. IF there is an interruption in power it is possible the switch may not come back after the outage, leaving the network in a state of disarray /suboptimal state. The fact they are heavy operated 24x7 only further degrades the product. Something which could have been avoided if Dell had simply allowed blanket replacement of devices without any assigned priority.

It is unreasonable for a purchaser of a good to replace it at their own expense when the vendor has admitted and disclosed to the public that the item has known issues or faults.

Again to use Apple as an analogue, when a repair program for an iPhone or Mac is announced, no devices are prioritised /by age, and all devices falling within a specific series or time frame are equally eligible for repair regardless of age or mileage.

What should vendors do about the problem then, or what should they do different? All manufacturers had to do was allow end users to RMA their devices on an ad-hoc basis. Since production was fixed in February, the number of possible affected units in the field is limited and users should not expect a fast turnaround

All manufacturers concerned should also replace customer devices out of warranty. An active warranty plan should not be required to repair devices with declared issues.

Intel should also help end users who cannot get service for their device.