Tuesday, December 31, 2013

Last year, I made some predictions on cloud computing. Here's my self-analysis:

===================
1. OpenStack continues to gain traction but many early adopters bypass Folsom in anticipation of Grizzly.>> Correct. This was a gimme.

2. Amazon's push to the enterprise means we will see more hosted, packaged apps from Microsoft, SAP and other large ISV's. Their IaaS/PaaS introductions will be lackluster compared to previous years.>> Correct. It's interesting that the press failed to notice the lack of interesting stuff coming out of AWS. Has the law of diminishing returns already hit Amazon?

3. BMC and CA will acquire their way into the cloud.>> Incorrect. CA picked up Nolio (and Layer 7), BMC acquired Partnerpedia. These acquisitions are pieces to the puzzle - but are not large enough to serve as anchors for a cloud portfolio.

4. SAP Hana will quickly determine that Teradata isn't their primary competitor as the rise of OSS solutions matures.>> Incorrect. SAP Hana continued to kick butt in 2013 and the buyers of it have probably never heard of the large open source databases. What was I thinking?

5. Data service layers (think Netflix/Cassandra) become common in large cloud deployments.>> Partially Correct. We're seeing the cloud-savvy companies implement cross-region data replication strategies - but the average enterprise is nowhere near this.

6. Rackspace, the "Open Cloud Company" continues to gain traction but users find more and more of their services 'not open'.>> Correct. Rackspace continues to push a 'partially open' agenda - but users seem to be more than happy with their strategy.

7. IBM goes another year without a cohesive cloud strategy.>> Correct. The acquisition of SoftLayer was a huge step forward in having a strategy - but from the outside looking in, they still look like a mess.

8. Puppet and Chef continue to grow presence but Cfengine gets a resurgence in mindshare.>> Partially Correct. Puppet and Chef did grow their presence, especially in the large enterprise. I could be wrong, but I personally didn't see Cfengine get traction. That said, Ansible and Salt came out strong.

9. Cloud Bees, Rightscale, Canonical, Inktank, Enstratus, Piston Cloud, PagerDuty, Nebula and Gigaspaces are all acquired.>> Incorrect. I was right about Enstratus but some of these predictions were stupid (like Canonical). The others remain strong candidates for acquisition.

13. PaaS solutions begin to look more and more like orchestration solutions with capabilities to leverages SDN, provisioned IOPS, IAM and autonomic features. Middleware vendors that don't offer open source solutions lose significant market share in cloud.>> Incorrect. I believe this is still coming but for the most part the vendors aren't there.

14. Microsoft's server-side OS refresh opens the door to more HyperV and private cloud.>> Unsure. This should have happened but I have no data.

15. Microsoft, Amazon and Google pull away from the pack in the public cloud while Dell, HP, AT&T and others grow their footprint but suffer growing pains (aka, outages).>> Correct. Well - at least the part where AWS, Azure and Google pull away from the pack. Dell continues to frustrate me; I need to have a sit-down with Michael Dell.

16. Netflix funds and spins out a cloud automation company.>> Incorrect. Perhaps this was wishful thinking. I'm a Netflix OSS fanboy - but think that they're starting to fall into the same trap as OpenStack (aka, open sourcing the kitchen sink without strong product/portfolio management).

17. Red Hat focuses on the basics, mainly integrating/extending existing product lines with a continued emphasis on OpenStack.>> Correct. Red Hat appears to be taking a risk averse strategy... slow but methodical movement.

18. Accenture remains largely absent from the cloud, leaving Capgemini and major off-shore companies to take the revenue lead.>> Unsure. I'm unaware of any large movements that Accenture made in the cloud. The big move in the SI space was CSC acquiring ServiceMesh.

19. EMC will continue to thrive: it's even easier to be sloppy with storage usage in the cloud and users realize it isn't 'all commodity hardware'.>> Correct. That said, we're starting to see companies implement multi-petabyte storage archival projects with cloud companies.

20. In 2013, we'll see another talent war. It won't be as bad as dot-com, but talent will be tight.>> Correct. And it will get worse in 2014.

Thursday, January 03, 2013

The 2012 Christmas Eve outage at Amazon has people talking. The fuss isn't about what broke; it's about what Amazon said they're going to do to fix it. If you aren't familiar with their report, it's worth a quick read. If it's tl;dr, I'll sum it up: a developer whacked some data in a production database that made the load balancing service go hay-wire, and it took longer than it should have to identify the problem and restore it. (did you see how i avoided the technical jargon??)

If you're Amazon, you have to start thinking about how to make sure it never happens again. Restore confidence... and fast. Here's what they said:

We have made a number of changes to protect the ELB service from this sort of disruption in the future. First, we have modified the access controls on our production ELB state data to prevent inadvertent modification without specific Change Management (CM) approval. Normally, we protect our production service data with non-permissive access control policies that prevent all access to production data. The ELB service had authorized additional access for a small number of developers to allow them to execute operational processes that are currently being automated. This access was incorrectly set to be persistent rather than requiring a per access approval. We have reverted this incorrect configuration and all access to production ELB data will require a per-incident CM approval. This would have prevented the ELB state data from being deleted in this event. This is a protection that we use across all of our services that has prevented this sort of problem in the past, but was not appropriately enabled for this ELB state data. We have also modified our data recovery process to reflect the learning we went through in this event. We are confident that we could recover ELB state data in a similar event significantly faster (if necessary) for any future operational event. We will also incorporate our learning from this event into our service architecture. We believe that we can reprogram our ELB control plane workflows to more thoughtfully reconcile the central service data with the current load balancer state. This would allow the service to recover automatically from logical data loss or corruption without needing manual data restoration.

Here's my question: If ITIL Service Transition (thoughtful change management) and DevOps (agile processes with infrastructure-as-code were to mate, what would the outcome be?
A) A child that wanted to run fast but couldn't because of too many manual/approval steps
B) A child that ran fast but only after the change board approved it
C) Mate multiple times; some children will run fast (with scissors) others will move carefully
D) No mating required; just fix the architecture (service recovery)

This is the discussion that I'm having with my colleagues. And to be clear, we aren't talking about what Amazon could/should do, we're talking about what WE should do with our own projects.

Although there's no unanimous agreement there has been some common beliefs:1. Fix the architecture. I like to say that "cloud providers make their architecture highly available so we don't have to." This is an exaggeration, but if the cloud provider does their job right, we will have to focus less on making our application components HA and more about correctly using the providers HA components. There's little disagreement on this topic. AWS screwed up the MTTR on the ELB. We've all screwed up things before... just fix it.

2. Rescind dev-team access. So this is where it gets interesting. Remember all that Kumbaya between developers and operators? Gone. Oh shit - maybe we should have called the movement "DevTestOps"! One simple mistake and you pulled my access to production?? LOL - hell, yea. The fact is all services aren't created equal. I have no visibility into Amazon's internal target SLA's - but I'm going to guess that there are a few services that are five-9's (or 5.26 minutes of down-time per year). Certain BUSINESS CRITICAL services shouldn't be working in DevOps time. They should be thoughtfully planned out with Change Advisory Boards with Change Records and Release Windows by pre-approved Change Roles. Yes - if it's BUSINESS CRITICAL - pull out your ITIL manuals and follow the !*@$ing steps!

Again - there's little disagreement here. People who run highly available architectures know that to re-release something critical requires a special attention to detail. Run the playbook like your launching a nuclear missile: focus on the details.

To be clear, I love infrastructure-as-code. I think everything can be automated and it kills me to think about putting manual steps into tasks that we all know should run human-free. If your application is two-9's (3.6 days of down-time), automate it! Hell, give the developers access to production data - - you can fix it later! What about 99.9% uptime (8.76 hours)? Hmm... not so sure. What about 99.99% up-time? (52.56 minutes)? Well, that's not a lot of time to fix things if they go wrong. But wait - if I did DevOps automation correctly, shouldn't I be able to back out quickly? The answer is Yes - you SHOULD be able to run your SaveMyAss.py script and it MIGHT work.

For me, it's not an either/or choice between ITIL Transition Management and DevOps. IMHO, both have a time and a place. That said, I don't think that the answer is to inbreed the two - DevOps will get fat and be the loser in that battle. Keep agile agile. Use structure when you need it.