Saturday, March 12, 2011

How not to build a Private Cloud

It's all $, FUD, and internal politics. An MBO Cloud is what you get when the CEO tells the CIO to "figure out that cloud thing" (Management By Objective - i.e. the CIO bonus depends on it).

There is no technical reason for private cloud to exist.

[update: to clarify, that doesn't mean that I'm against private clouds or don't think they exist, because $, FUD and internal politics are a fact of life that constrain what can be done. Change also takes time and you have to "go to war with the army you have". However, this post is about what happens if your organization reallocates the $, isn't afraid, and has effectively no internal politics getting in the way.

This post was written in the middle of a debate on twitter between @adrianco @reillyusa @beaker and others including key insights from @swardley.

You should also read Christian Reilly's follow-up post "The Hollywood Culture" http://bit.ly/ePsisJ and many thanks to @bernardgolden for pointing out the excellent Business Week cover story on Cloud Computing http://ow.ly/4dm07 - after reading it I was amazed how well it aligned with what I write here - then I saw that it was by Ashlee Vance, one of the most clueful journalists around.

Netflix ITops Security Architect Bill Burns also wrote a very interesting post on the security challenges of cloud, we've been working together and he's on the interview team for the "Global Cloud Security Architect" I mention below.]

Too big for public cloud? You should *be* a public cloud.

Organizations who run infrastructure at the scale of tens to hundreds of thousands of instances have created cloud based models and opened them up to other organizations as public clouds. Amazon, Google, Microsoft are the clearest examples, they have expertise in software architecture, which is why they dominate the API definition. Telcos and hosting companies are adopting this model to provide additional public cloud capacity, using clones and variants of the API. Other organizations at this scale are already figuring out how to expose their capacity to their customers, partners and supply chain. The task you take on is to simultaneously hire the best people to run your cloud (competing with Amazon, Google etc.), and run it at low cost, which is why you need to be at huge scale and you need to decide that running infrastructure is a core competency of your business. Netflix is too small, doesn't regard infrastructure as core, and doesn't want to hire a bunch of ITops people.

It costs too much to port our apps? Your $ are mis-allocated.

What does it cost to build a private cloud, and how long does it take, and how many consultants and top tier ITops staff do you have to hire? Sounds like a nice empire building opportunity for the CIO. The alternative is to allocate that money to the development organization, hire more developers and rewrite your legacy apps to run on the public cloud, and give the development VP the budget to run public cloud directly. The payback is more incremental and manageable, but this is effectively a re-org of your business to move a large chunk of budget and headcount around. This is what happened at Netflix. It probably takes an act-of-CEO at most companies, the barriers are mostly political. Yes it will take time, but so will bringing up a private cloud.

Replace your apps with Saas offerings.

Many internal apps can be replaced by cloud services, we just outsourced our internal help desk and incident management software. No-one I know does payroll in-house. This is uncontroversial and is happening.

We can't put confidential data in a public cloud? This is just FUD.

The enterprise vendors are desperate to sell private clouds, so they are sowing this fear, uncertainty and doubt in their customer base to slow down adoption of public clouds. The reality is that many companies are already putting confidential data in public clouds. I asked the question "when will someone using PCI level 1 be in production on AWS" at a Cloud Connect panel, and was told that it is already being done, and Terremark pointed out that they host H&R Block's tax business. There are several models of public cloud with different SLA, cost and operational models that can support confidential data securely. There is also an argument that datacenter security is not as strong as people would like to think, and that the large cloud vendors can do a better job than most enterprises at keeping the infrastructure secure. At Netflix, we are about to transition to a global cloud based business, we are currently hiring a "Cloud Security Architect" who understands compliance rules like PCI (the credit card standard) on a global basis (we didn't need global expertise before). Part of their job is going to be to implement this.

There is no way my execs will sign off on this! Do they care about being competitive?

The biggest champion at Netflix for doing public cloud and doing it "properly" with an optimized architecture was our CEO Reed Hastings. He personally argued that we should try to do NoSQL rather than MySQL to push the envelope. Why? Because the bigger risk for Netflix was that we wouldn't scale and have the agility to compete. He was right, we have grown faster than our ability to build datacenters, and we have the agility we need to outrun our competition. Netflix has never had a CIO in the first place, we do have an excellent VP of operations though, and there is plenty to do running the CDN's and Saas vendors that support Enterprise IT.

Will private clouds be successful? I think there will be a few train wrecks.

The train wrecks will come as ITops discover that it's much harder and more expensive than they thought, and takes a lot longer than expected to build a private cloud. Meanwhile their developer organization won't be waiting for them, and will increasingly turn to public clouds to get their jobs done. We could argue about definitions but there are private clouds that are effectively the back ends for specialized large scale ecosystems like engineering companies that have to interface to the things that build stuff, or operate in places where there is no effective connection into the public clouds. For example, on board a ship that has limited external bandwidth, or to support a third world construction project. My take is that these will be indistinguishable from specialized Saas offerings within a supply chain ecosystem.

How not to build a public cloud - The Netflix Way

Re-org your company to give budget and headcount to the developers, let them run the public cloud operationsIgnore the FUD, best practices and patterns for compliance and security already exist and are audit-ableGet the CEO to give the CIO a different MBO, to shrink their datacenter.

25 comments:

Since software such as CouchDB, MongoDB, Riak, etc., etc. widely exist so people can run their own NoSQL instance rather than use the cloud, do you think the companies who back them will be successful in the long run?

NoSQL is an orthogonal issue, in Netflix case Reed encouraged us to use SimpleDB on AWS rather than MySQL to start with, now we are transitioning to Cassandra on AWS. The choice of database doesn't affect the argument for where you should be running it.

Adrian, did you really mean to title this "How not to build a private cloud"? It seems more of a "Why not to build a private cloud" topic. "How not to do X" implies there's some sort of way in which you WOULD do X, whereas you're making the argument that private clouds are a non-starter

There are reasons under the umbrella of "security" that don't have anything to do with risk of theft or destruction that give large organizations pause when considering putting valuable data in the cloud. Instead, they have to do with legal protections covering discovery and criminal procedure. Simply put, the protections covering data stored in the cloud are weaker than if you stored the data on your own premises.

@Roy, the action I'm advocating is to move $ and Cloud responsibility by shrinking ITops and growing the development organization. That is the "how to". The FUD and Politics are what might prevent that from happening.

@Michael where is the line drawn in terms of owning your own premises for data storage? Netflix doesn't own it's datacenter premises (we host cages in places like Qwest). How is that different to hosting data in AWS US-East? There is still an auditable shared responsibility between Netflix and Qwest or Netflix and Amazon to secure data in a compliant manner.

@Nathan most Netflix developers have a simpler relationship with AWS than they used to have with ITops. They check code into perforce, and a few minutes later an instance is running in our test account. Once its tested they push a button and it's autoscaled in production. If there are issues roll-back is also trivial. Developers are responsible for their code, if it breaks in production they get called.

We have a handful of "devops" staff who manage AWS more directly and build the tools. Some of them used to work for ITops. We also have layers of Java code that hide the details of the AWS API's etc. from developers. Key/value store get and put functions are trivial to code to. Over-all it's less complex than dealing with Oracle SQL queries and schemas.

Voip latency issues don't make sense to me. The end users for Voip probably aren't near a datacenter, so Voip might need to be hosted in the office building machine room if it matters that much. I would also point at Skype as an example of Voip in the cloud, and it has excellent latency hiding and echo cancellation, doesn't seem to mind latency.

@Nathan we do hire very good developers, but they don't all need to be ops experts. I think the mindset conversion from SQL to NoSQL is a bigger challenge than the change from deploying in datacenter to deploying in cloud.

Your argument seems to be, "I do not understand what traditional operations entail, so I will jump to the conclusion that they provide no value."

In the Netflix development model (which can be summed up as, "shoot first and ask questions later"), this may work. Most other businesses are a bit more cautious with how they write software, how they deploy it, where and how it's hosted, and how it's supported, though.

There are valid reasons for such caution in many cases, just as you have your reasons for throwing it to the wind (which, again, in the highly unique Netflix environment are probably also valid). Assuming that experience to be generally applicable, though, is quite a leap.

@Mike, I'm not saying that ITops has no value, I'm saying that cloud migration shouldn't be led by ITops. That's like putting the fox in charge of the hen house, public cloud is a natural competitor to the datacenter management function (and budget) of ITops and private cloud is a response to that threat.

In one sense I'm glad other businesses don't copy Netflix, especially if they are trying to compete with us, on the other hand we are leveraging the AWS ecosystem, so the more people that join in, the better.

I guess if you see Ops' function as datacenter management, that largely explains where your viewpoint is coming from. In companies where they're much more than the people who put machines in racks, Ops' value is no different whether you're in a public cloud, private cloud, or traditional data center. That your dev and ops groups fight over budget and organizational posturing sounds like internal Netflix politics to me, not technical justification for the specific way you're doing things. In places where dev and ops work together, both teams bring valuable skill sets to whatever environment your code is running in.

There's a third piece, too, that you didn't mention: does Netflix not have a QA team?

@Mike most of ITops budget is spent managing datacenters and the hardware and software that lives in them. Our Dev and Ops group don't fight, we work together on the shared responsibilities. I've seen far more ingrained Dev-Ops mistrust and conflict at other companies. The right thing for Netflix was to stop trying to build more datacenters and move our apps out to cloud. The ITops dept manages the network including CDN's, and the internal apps that support employees, which are gradually moving to SaaS. We still have lots of storage and big Oracle servers to keep running, as the business grows the proportion of traffic that goes to the DC shrinks to keep in a static footprint.

@adamzero I proposed that the money which would have been spent building a private cloud is spent instead to port internal apps to cloud and migrate to Saas alternatives. Your COTS vendors probably also have a cloud strategy or a competitor that does. e.g. let Microsoft run your Exchange servers etc. for you as a cloud service.

I was at a Varnish meetup at NYTimes, and their 'special products' guy was giving a talk.

He was referring to how 'the main folks' keep the main NYT stuff on their own private cloud, and 'his team' didn't have the time or the patience to work with them, and would spin up EC2 instances and get work done there. The frustrated developer shift to the public cloud is in full swing.

the real question is... with all the development going to "instant gratification" clouds, what happens when the app needs to run in production- assuming a larger org than yours.. I think the sh** is about to hit the fan when all the rogue developers come back with their shiny new app that can't be supported internally, but also can't be run externally (for political, compliance, etc reasons ). The "enterprise" IT ship is a big one to turn and it won't go fast- for many reasons...unlike netflix, many orgs need to take baby steps toward efficiencies- like moving some apps to a public cloud, and leaving some onsite... it's a process, not a destination.

Adrian - how many mainframes did you have when you decided to go to public cloud? How many COTS packages were you running that were only licensed for AIX? How many dedicated lines to a SWIFT transaction processor did you have? My list of 'legacy concerns' could go on and on... Most companies born in the last decade don't have the same issues as the traditional enterprise. And the idea of "just porting them" is not an option.

@Jeff, we have no mainframes, but we do have several very large Oracle servers on AIX. We've been moving our COTS packages to SaaS vendors. Our internal apps are all web based, most of the company runs on Mac laptops nowadays.

I have many friends who work on Mainframe systems in the finance industry. I don't expect them to be running everything on cloud, but anyone who has a web based component to their product can copy what we have done, and for companies like Netflix, where our product is entirely delivered via the Internet, there is no reason to do IT the way banks do.