To start with

Disclaimer

The opinions expressed here are my personal opinions. Content published here is not read or approved in advance by EMC and does not necessarily reflect the views and opinions of EMC.
This is my blog, it is not an EMC blog.

April 2011

April 29, 2011

The technology behind the FastClones are file based snapshots (versioning) within the NFS datamover. We are getting to this functioality via DHSM API calls to the datamover. However, a new twist on leveraging the FastCloning methods is to use them for protection of a virtual machine. This was traditionally done via snapshots to whole datastores and really brought out scaling challenges in this method for any vendor. The versioning capability in EMC Unified arrays allows EMC customers to take snapshots at the file level thus creating file level snapshots of VMDK files that are immediately available for use (FastClone), or saved for future use (traditional snapshot but at file level).

Clinton starts programming to the VNX DHSM API and gives people deeper access to Fast/Full Clones. Which is cool because someone had to do it first in order to make the case we need to open more of these things up to end users by showing if we do they'll use it. Turn his stuff inside out and send him your feedback.

April 23, 2011

Until this past week, there's been a mostly silent war ranging out there between two dueling architectural models of cloud computing applications: "design for failure" and traditional. This battle is about how we ultimately handle availability in the context of cloud computing.

The Amazon model is the "design for failure" model. Under the "design for failure" model, combinations of your software and management tools take responsibility for application availability. The actual infrastructure availability is entirely irrelevant to your application availability. 100% uptime should be achievable even when your cloud provider has a massive, data-center-wide outage.

Most cloud providers follow some variant of the "design for failure" model. A handful of providers, however, follow the traditional model in which the underlying infrastructure takes ultimate responsibility for availability. It doesn't matter how dumb your application is, the infrastructure will provide the redundancy necessary to keep it running in the face of failure. The clouds that tend to follow this model are vCloud-based clouds that leverage the capabilities of VMware to provide this level of infrastructural support.

The advantage of the traditional model is that any application can be deployed into it and assigned the level of redundancy appropriate to its function. The downside is that the traditional model is heavily constrained by geography. It would not have helped you survive this level of cloud provider (public or private) outage.

The advantage of the "design for failure" model is that the application developer has total control of their availability with only their data model and volume imposing geographical limitations. The downside of the "design for failure" model is that you must "design for failure" up front.

The rest of George Reese's post is as succinct on the AWS matter as anything I've read and I'd advise you to read it. To make it even simpler if it used to be on a physical system and you're installing it or P2Ving into a VM you're probably traditional. If you're developing and deploying it with Cloud Foundry you're probably designing for failure.

I'd also ask that you gaze in horror at the following post on the AWS Forums where a Cardiac Monitoring System was deployed into AWS and was then blown offline when the outage occurred.

They've been unable to monitor the status of their at risk cardiac patients for the past two days.

This clearly is a case where the service in question mistook or was told that AWS was the traditional model at a cheaper price and only found out otherwise on the 21st. That or they were blinded by the race to the bottom pricing they were getting and said if it was good enough for the big guns going from industry show to industry show prognosticating about how great it is and how they've "reinvented IT" then they wanted in.

While the criticality of the app is one thing the fact they had to go begging on the support forums is terrifying. There needs to be some upfront education here that if you don't have the engineering resources, development time and funding to design for failure you need to understand that with AWS you're on your own.

It is clear there are no individual customers, there is only the system. And with the system we're back to what I said about the nines in my last post, you can either do the extra work for them or you can't.

April 21, 2011

Not that I have any sympathy for Amazon Web Services (I don’t) but it appears to me that they’re going to keep making headlines with sites being blown offline until someone sits down and explains that just because you move your workload into Amazon’s cloud doesn’t mean you get to walk away from DR planning.

Maybe it has to do with cost, people seeing downtime as an acceptable trade off for race to the bottom pricing, maybe it has to do with the fact that a lot of AWS users are software developers and not infrastructure people to whom these thoughts occur naturally and who dream in Visio diagrams & not in code but regardless, it wasn’t like AWS just collapsed. It didn’t. Just one location out of a few did.

If you can’t answer the question “What happens to my workload after what it’s running on at a moment in time goes away for any reason?” then you’re doing a half assed job or have made the decision that it doesn’t matter if it goes away so long as it comes back in a window of time you can live with.

The answer to the question should be either “As this this and this detects the event this this and this mitigates it and we keep running” or it’s “We’re offline until everything comes back.”

I’ve said this already but when you hear about three (10.1 minutes a week), four (1.01 minutes a week ), five (6.05 seconds a week) and six (0.605 seconds a week) nines availability those aren’t things which come free the moment your cloud provider swipes your credit card.

Each nine cost money and the more of them you want the more you’ll pay in your time and your Disaster Recovery planning and engineering effort.

April 18, 2011

New quarter new release. It was hardware in Q1 now it’s backup software with Avamar 6.0

A bit of history. When EMC first acquired Avamar it was clear that delivery as an integrated appliance was the best route forward. Drop it on a floor tile, give it power and network connectivity, IP the system and then hit the button marked “GO!” This has been a successful strategy, so much so it’s been copied (Badly) by Avamar competitors. Avamar Virtual Edition has been popular for smaller environments where it’s possible to leverage an existing investment in VMware Virtual Infrastructure. As well as the new system software that we’re announcing today we’re also introducing the Fourth Generation of Avamar Data Store with up to 124TB of useable storage capacity for backup data deduplicated at the source.

Gen4 is the performance & scale improvement you’d expect from a Data Store upgrade, but in the era of Big Data and considering BRS also owns the fastest single controller deduplication storage systems on the market we could do better.

So we did.

Moving yet another step closer to the integration of key BRS technologies Avamar can now use one or more Data Domain storage systems as backup and recovery storage for Avamar backups. It doesn’t matter if either the Avamar Data Store or Data Domain system are already in use, with Avamar 6.0 you pretty much just point them at one another and away you go.

Avamar 6.0 with Data Domain increases the overall addressable storage capacity and facilitates the high speed deduplicated backup and recovery of large structured data sets (Email, DBs, VMware images, etc) and sets us up well for a future of requiring even more capacity and even greater throughput.

Sending someone a download link or giving them some optical media with binaries on them is no longer enough. EMC has been in the business of delivering integrated software and hardware for 12 years now and it’s where the market has moved.

Let’s talk about new software functionality.

Through its integration with VADP Avamar already used Changed Block Tracking (CBT) on backup of VMware Image files, now it supports CBT on restore which can dramatically reduce the ‘time to first ping’. That is, the time when the restore has completed, the VM has booted and it’s reachable on a network.

Expanded application support brings things like support for Oracle RAC and granular object level backup and recovery of the latest versions of Exchange & SharePoint. Three gaps which needed closing.

There’s a lot more but for that I’d advise looking at the release notes on Powerlink.

All in all Backup Recovery Systems has a lot going on, what we’ve announced today is just part of it. There’s still time to win a free pass and come see us at EMC World where we’re running deep dive technology sessions, hands on labs and talking about our view of backup and recovery during the Blockbuster Backup Solutions keynote with EMC BRS Vice President BJ Jenkins and CTO Stephen Manley.

April 14, 2011

Just like the title says we have a free EMC World ticket to give away to a customer who enters the competition.

Go to the EMC Community Network here, follow the instructions and you're in with a chance of winning free entry to a show with thousands of your peers and hundreds of EMC Engineers. All that and new products, hands on labs, technical/birds of a feather sessions and the industry famous "hangover recovery bean bags".

Chad's World will be live but no confirmation yet if it'll feature Chuck Hollis & The Big Data Band. You'll just have to show up and see.

April 02, 2011

Let’s first make sure everyone is on the same page. The number of enterprises hit by APTs grows by the month; and the range of APT targets includes just about every industry. Unofficial tallies number dozens of mega corporations attacked; examples are in the press regularly, and some examples are here, and here.

These companies deploy any imaginable combination of state-of-the-art perimeter and end-point security controls, and use all imaginable combinations of security operations and security controls. Yet still the determined attackers find their way in. What does that tell you?

The first thing actors like those behind the APT do is seek publicly available information about specific employees – social media sites are always a favorite. With that in hand they then send that user a Spear Phishing email. Often the email uses target-relevant content; for instance, if you’re in the finance department, it may talk about some advice on regulatory controls.

The attacker in this case sent two different phishing emails over a two-day period. The two emails were sent to two small groups of employees; you wouldn’t consider these users particularly high profile or high value targets. The email subject line read “2011 Recruitment Plan.”

The email was crafted well enough to trick one of the employees to retrieve it from their Junk mail folder, and open the attached excel file. It was a spreadsheet titled “2011 Recruitment plan.xls.

The spreadsheet contained a zero-day exploit that installs a backdoor through an Adobe Flash vulnerability (CVE-2011-0609). As a side note, by now Adobe has released a patch for the zero-day, so it can no longer be used to inject malware onto patched machines.

OK, back to the attack. As you know, the next step in a typical APT is to install some sort of a remote administration tool that allows the attacker to control the machine. In our case the weapon of choice was a Poison Ivy variant set in a reverse-connect mode that makes it more difficult to detect, as the PC reaches out to the command and control rather than the other way around. Similar techniques were reported in many past APTs, including GhostNet.

Having set remote access, now the attacker in a typical APT starts digital shoulder surfing to establish the employee’s role and their level of access. If this isn’t sufficient for the attackers’ purpose, they will seek user accounts with better, more relevant, privileges. I’ve pieced together a separate blog post as an appendix, talking about the attack end-to-end and providing more data.

When it comes to APTs it is not about how good you are once inside, but that you use a totally new approach for entering the organization. You don’t bother to just simply hack the organization and its infrastructure; you focus much more of your attention on hacking the employees.

One cannot stress enough the point about APTs being, first and foremost, a new attack doctrine built to circumvent the existing perimeter and endpoint defenses. It’s a little similar to stealth air fighters: for decades you’ve based your air defense on radar technology, but now you have those sneaky stealth fighters built with odd angles and strange composite materials. You can try building bigger and better radars, or, as someone I talked to said, you can try staring more closely at your existing radars

Uri has more details on the RSA attack over on his blog. I had an idea we had a breach when the corporate infrastructure rapidly began locking down while I was connected to it, but I had no idea where or what the severity was.

Regardless, this criminal investigation continues and the scary people from scary places still stalk the halls.