Posted
by
timothy
on Sunday October 11, 2009 @04:29AM
from the oh-well-enough-said dept.

Expanding on the T-Mobile data loss mentioned in an update to an earlier story, reader stigmato writes "T-Mobile's popular Sidekick brand of devices and their users are facing a data loss crisis. According to the T-Mobile community forums, Microsoft/Danger has suffered a catastrophic server failure that has resulted in the loss of all personal data not stored on the phones. They are advising users not to turn off their phones, reset them or let the batteries die in them for fear of losing what data remains on the devices. Microsoft/Danger has stated that they cannot recover the data but are still trying. Already people are clamoring for a lawsuit. Should we continue to trust cloud computing content providers with our personal information? Perhaps they should have used ZFS or btrfs for their servers."

So are we saying microsoft didn't have a backup? what about a offsite backup? Who wants to bet they were using their own backup solution? if they had a decent storage array they could have had snapshots and offsite replica's to restore from

Either this is a really, really serious meltdown which completely killed not only the server but all their backups as well (and what're the chances of that?), or their IT guys have been really, really slack and just didn't make any backups...

Guess they should have used a better smartphone, like *anything* else on the market... Even the cloud-centric Pre will still work if you don't have access to the Cloud - even if Google and/or Palm dies, you'll still have all your information on your phone! Jesus... Doesn't inspire confidence...

Using one ZFS would just create a different single point of failure, one which is also relatively complex and therefore does not provide satisfactory disaster recovery options. Redundancy should be provided by independent systems. That means that they're ideally implemented differently even though they serve the same function. For example, it's pretty useless to have two fibers coming into a facility if these fibers are taking the same route or are even in the same bundle: A backhoe will get both of them. An implementation bug in a filesystem will very likely affect both redundant stores. Even using two separate filesystems has that flaw. Storage systems should keep redundant data on separate systems with different filesystems. Then the single point of failure is the splitter which sends the data to both storage systems. A failure at that point does not destroy the data. It only affects your ability to access it. Due to its low complexity, it's also a component which can easily be replaced.

One reason why our corporate policy is that we actually have to validate backups for every system on a regular basis (this means doing a full restore of a tape called from off-site), where the regularity is directly proportional to the criticality of the system. The more critical, the more often we test. On our iSeries, they restore the weekly backup tape EVERY week on the QA server - both for the purposes of refreshing it, AND to validate the backups. We also have a quarterly 'random' test where a system is chosen randomly and it must be recovered from bare metal using only our standard procedures + the backup tape.

We've discovered all kinds of strangeness with backup tapes through the years. Our Tier 1 systems have completely separate instances in geographically diverse areas, with data-replication.

Ever try to restore from a ZFS corruption? It IS easy and it can be done. However...

What if the data was on an EMC storage array and the tech told them its all lost? What if your dealing with a Teir 1 vender (I am looking at you Dell Equallogic) that swears UP and DOWN that there is no way to recover the system after a second drive out of a RAID 5 has been pulled? Hell, try just a standard raid 5 card from a Teir 1 vender. (Not talking about calling like 3ware support directly, they are honestly good and recovered a few arrays with them)

I "suspect" that they are running it off a storage array that failed big time, or lost the LUN, or just someone decided to die and take the server with it. There is just to much we don't know. Was Dagger installed on multi-servers? Was it clustered? Is it a cloud system? Does it run its own storage system or requires additional hardware?

But you know what? ZFS, EMC even Windows 2008, All moot. Why? WHERE ARE THE TAPE BACKUPS?!?! SERIOUSLY. The ONLY way they have lost ALL that data was that they didn't have backup solution. Otherwise their "press release" would say "...however we will be restoring the data from last week/months tapes..."

I do like how they keep saying "Microsoft/Danger" as if they are at fault. A good admin would expect a new car would catch fire and run into a bus full of nuns.

Dubious backups? Depends. We had a system which was a 6TB cluster that was notoriously difficult to back up. This went on for years, it took too long, failures caused issues downstream etc. Then someone took a moment to realise that the application was not capable of re-using that 6Tb of data if it was restored - once the data came in it was processed and archived. To recover the application all they had to do was backup a few gig of config and binaries, and restart slurping data from upstream again. Viola - backup stripped down to nothing, 6TB a day of data less to backup, and next to no failures as it was now so quick to backup.

Then there is the case of an application which the vendor and application developer signed off on using a backup solution using a daily BCV snapshot. What they failed to tell us was application not only held data in a database, but in a 6G binary blob file buried deep in the application filesystem. If the database and the binary where out of sync in any way, it could mean missed or replayed transactions or generally that the application was inconsistant. As this was an order management platform, that was bad. You can guess the day we found out about this dependancy.... yup, data corruption, bad vendor advice screwed the binary file and all we had to go on was a backup some 23 hours old where the database was backed up an hour after the application. Because of a corresponding database SNAFU, the recover point was actually another day before that, with the database having to be rolled forward. It was at this point we found out the despite the signed off backup solution, the vendors documented recommendations (that were not supplied to us) was that the only good backup was a cold application one - not possible on a core order platform. Thankfully after some 56 hours of solid work the application vendor managed to help sort the issue out and the restore from backup was not actually needed. The backups were never really tested as the DR solution worked on SRDF - the DR consideration for data corruption was never really part of the design (from a very high level, not just this platform).

So there you have it. Two dubious Enterprise backups - one not needed, the other not usable.

MS was misleading T-Mobile about the state of Sidekick support, and apparently charging hundreds of millions every year for, and I quote "a handful of people in Palo Alto managing some contractors in Romania, Ukraine, etc". This is apparently because most of the Sidekick devs had either moved to Pink or quit out of disgust.

There are plausible reports as to how this happened here [hiptop3.com].

tl;dr - They tried upgrading their SAN without making a backup first, and the upgrade somehow hosed the entire SAN.

That's the thing that has always worried me most about SANs: you have all your eggs in one basket. No matter how redundant or reliable the hardware is, one bad update or trigger-happy admin can cause the instant loss of all your data. That's only slightly better than having your data center burn down. You still have your hardware, but a total restore like that can be a nightmare. I've heard somewhere that 80% of corporations couldn't recover from a scenario like that.

Here's some fun numbers: a typical tape restore runs at something like 70MB/sec, if you're lucky, per tape drive. Some small low-end SANs that I see people buying these days are 10TB or bigger. At those speeds, it takes 40 hours to restore the complete system. What's worse is that it doesn't scale all that well either, you can get more drives, but the storage controllers and back-end FC loops become a limit. If you have some big cloud provider scenario, a complete restore could take days, or even weeks.

What's scary is that mirroring or off-site replicas don't help. If your array starts writing bad blocks, those will get mirrored also.

A) The Sidekick apparently doesn't store anything, so customers can't make backups that easily, even if they wanted to, and

B) Danger designed this phone to store everything server-side. It is incomprehensibly foolish to not include a SUPER SOLID backup strategy as well. This problem has been ongoing for several days now; I don't know if the data was fine on the onset of this problem, but the infuriated customers have all the right to demand everything AND the kitchen sink for losing practically everything they had.

Thats why you have logical redundancies. I work for a fortune 10 company and this is a standard practice for all mission critical applications. The application has be to geographically redundant with install base at least at 3 data centers (ATL,SEA and DLS). Different SAN technology at each DC. All Oracle databases have 2 physical dataguard configuration with 4 hours and 8 hours latency (to guard against user errors) and all J2EE apps hard configured to switch connections from one db to the other almost on the fly or with a reboot. Some really really critical databases have all this and transaction duplication via Goldengate to remote databases to off load reporting queries. We have had issues where SAs screwed up allocating LUNs and ended up f*cking up the file systems but we recovered in every scenario even a 30 TB DB restore over 2 days.

Its amazing a consumer serving company like T-Mobile risked itself by hosting their application on Microsoft platform;. Furthermore where is the DR in all this? Who the F*ck in the right mind fiddle something on SAN without confirming a full backup of all applications/databases? It appears that Hitachi and Microsoft are at fault here (if SAN maintenance is the root cause of this failure) but T-Mobile is the fool allowing these companies to ruin their data. Not only there won't be any consequences because of this issue to MS or Hitachi - T-Mobile will be pouring in more money to fly in the MS and Hitachi consultants.

In our environment, a large government shop, our data volumes are capped at around 1 TB of storage for that very reason. Between the SAN, and the tape backups...they just simply have to create a physical cutoff point for data storage due to those onerous recovery periods.

There is nothing wrong in our shop with having TWO 1 TB volumes, but you will never get approved to have one single 2TB. Problem solved...at least for file storage. Database backups are managed via other mechanisms like replication.

I work in telecom at a different provider. SAN upgrades are performed by the SAN vendor and, IME, they always demand a complete backup prior to starting any work unless the customer demands otherwise. If the customer doesn't want the backup, we always had to get a Sr VP to sign off. There were about 10 Sr VPs in the company - not like at a bank where everyone is a VP.

Usually, we would perform firmware upgrades only when migrating from old SAN equipment into new. The old equipment would be upgraded and used to upgrade either lower performing SAN or directly attached disk arrays that had been neglected for 5+ years. Being out of warranty was avoided. Most data is too important to risk that.

BTW, we measured storage in petabytes and our storage team was **never** on the cutting edge. We were always 2+ years behind other BIG companies. Our labs may have this quarters' latest and greatest, but it would take years to get from the lab into production service. That drove some vendors nuts, but not the "names you know."

I saw where someone above said they randomly verified recovery quarterly. What a joke. On my systems (Sr Tech Arch), we deployed with redundant systems at least 500 miles apart. Many systems did have instant fail over, but if instant fail over was not possible due to the amount of data, **never** would we lose more than 24 hours worth of data. Between, RAID-10, near disk backups, tape backups, remote replication and backups at the alternate location, we had the data. Further, to verify the alternate system worked, we swapped primary production locations every week. I and my internal customer slept very well, thank you.

I have a good friend who works at T-Mobile in their architecture design team. It will be interesting to see whether this subcontractor had anything to do with the issues. I called T-Mobile for an unrelated personal item on Tuesday, they were already swamped with calls and said that a sub to Microsoft was working the issue. I'm thinking MS outsourced/bought the provider and the garage shop team was still running things - but I don't know. I do know that Microsoft has excellent engineers for systems like this and they are more cautious than google with their upgrades and deployed systems. Over the years, I've had to deploy a few Windows-Server-based solutions - usually for voice response systems. I was never really happy doing it. I don't trust backup systems much unless it is really a mirror that I can get to 1 file from 3 weeks ago easily.

Ok, back to upgrading the company email servers. A system version upgrade will impact users for less than 10 minutes - probably under 3 minutes, but we like to under promise and over deliver.

The kind of filesystem have help - I'm familiar with ZFS concepts so I'll stick to those:

In ZFS when you write to a file you don't write over the pre-exisiting data, you write elsewhere then that gets mapped in upon success, the old data is still there and you can see the aged mapping (you know what was there). Now you can at this point recycle this space. However, you can switch this pruning off, now you have a complete record of everything that was ever done on the disk. To stop it ever running out of space I can either: Add disks to the disk-pool to stop that, or prune very old data (older than a give age - maybe 6 months?).

According to this comment post [engadget.com] on Engadget, it was a contractor working for Danger/Microsoft who screwed up a SAN upgrade and caused the data loss. Obviously, take this with a grain of salt until it's substantiated:

"I've been getting the straight dope from the inside on this. Let me assure you, your data IS gone. Currently MS is trying to get the devices to sync the data they have back to the service as a form of recovery.

It's not a server failure. They were upgrading their SAN, and they outsourced it to a Hitachi consulting firm. There was room for a backup of the data on the SAN, but they didn't do it (some say they started it but didn't wait for it to complete). They upgraded the SAN, screwed it up and lost all the data.

All the apps in the developer store are gone too.

This is surely the end of Danger. I only hope it's the end of those involved who screwed this up and the MS folks who laid off and drove out anyone at Danger who knew what they were doing.

The second problem is believing the tech when he says the data cannot be reclaimed.

The third problem is using a simple raid 5 volume on a great deal of data. Multiple drives fail all the time! Hell, racks of servers fail in unison.

Even if the DCB data is corrupted this can be corrected even on a large SAN.

All or part of the data is generally recoverable.

Either this was an impossibly horribly managed install or something very complex has happened. Generally, the more severe instances are because of multi-faceted failures and not something so simple as lost array data.

from that sounds of it, Microsoft couldn't turn Danger into a WinMo platform so they gutted it of employees instead of spinning it back off since they'd rather have it dead than spreading more Java but not dead before they had Pink out the door. So when you fire everyone from the top downward, you end up with people who's job is to turn the lights off when the doors get locked for good. they're not motivated much nor are they skilled in all of what used to be required to run the shop. Auto-pilot mode comes to mind.

So maybe the backup system needed to be checked or a CRON job verified or maybe the computer in Joe Fired's office was part of the backup process in some little way but important enough that the whole job was failing every night.

As I said, Microsoft tried to replace the Danger stack with Microsoft software but it wasn't going to work or got too much backtalk( thinking of Softimage ) and threats of everyone leaving if they had to port to the WiMo pile/stack. They moved anyone who'd go, over to Pink and left the rest to keep life support systems running. oops, they failed.

With Ballmer publicly saying that WinMo has been a failure, he's hearing the press say WinMo 6.5 is a yawn and expectations are that the Sony PS3 will eclipse MS XBox, and recently reading about how he's telling people that IBM doesn't know what they are doing....There's probably a new monkey-boy dance going on inside his office we'd probably love to see. It might be too dangerous being so close as to record it.

Will Microsoft ever make any profits from anything outside of MS Windows and MS Office? Ballmers 8-Ball still seems to be telling him something very different from what everyone else is seeing.

It's not a backup unless you can prove it will restore. Until then it's just a waste of tape, or disk, and time

True. There's a similar problem in biological research, where people think they have secured frozen samples but they haven't tested whether the samples are valuable after thawing. For example, frozen cells might not be viable, or RNA might be degraded. Too often the samples are just wasting freezer space. Anybody can freeze (or backup), the question is whether what you thaw (restore) is valuable.

I agree. Weekly? WEEKLY?!!! What is this... 1980? Hell even in 1980 people with critical data on their apple2 spreadsheet kept more than one copy of their data on a daily basis.

I'm not sure why, but one of our customers had a backup daemon running with just incrementals being done. There was one full backup done two years ago and an incremental every night. Well.. they had a computer fry one weekend. It was a crappy windows backup program with only a point and click interface. No way in hell am I going to sit there for days and click restore on 600+ individual backups. I wrote a pretty cool little windows script using autoit3. It was a real pita to write though since every button clicked had to have a "wait-for-next-window" sequence. After five days of the restore script running, they were back in business.

Since then I've gone through every customer's system and made sure they have full backups done weekly and incrementals done daily. And we also do routine backup testing.

A good quote:"A backup is not a backup, until you try and restore from it"

At the very least they should have been segmenting customer data. How could a single failure outside of a ten mile wide asteroid hit wipe out all customer data? Was everything stored in a single giant registry?
I see this a one of the single greatest failings in current system design. Top professionals trust tools more than data design and management processes. I would say the same thing if they were using ZFS or btrfs. Technology is NOT a solution. Technology is at most a tool that contributes to an overall solution. Without proper automated control systems and at least some form of manual verification reliance on pure technology solutions is little more than blind faith.

The problem is that with all the data that goes onto tape, the relatively small chance of errors ends up getting magnified. However, backup policies, tape rotations, RAIT (at the high end), and different backup methods minimize the damage errors can do.

The current millennium has only been around for nine years and ten months. (Eight + ten months if you are a traditionalist and think the Nineties ended in 2001.)
Then again, good back-up policy predates computers. If Microsoft/Danger had the same dedication to backups of valuable documents as monasteries did back in the 1000s, this sort of mess wouldn't have happened.