Tag Archives: infrastructure

I was planning to go to this meeting here in town about “Preparing for the post-IaaS phase of cloud adoption” and it brought home to me how backwards people are generally thinking about cloud. So now you get Ernest’s Cloud Rant of the Day.

What people are doing is moving in order of comfort, basically. “I’ll start with private cloud… Then maybe public IaaS… Eventually we’ll look at that other whizbang stuff.” But here’s what your decision path should be instead.

Cloud Procurement Flowchart

Is it available as a SaaS solution? If so, use that. You shouldn’t need to host servers or write code for many of your needs – everything from email to ERP is commoditized nowadays.

Can I do it in a public PaaS? Then use a public PaaS (Heroku/Beanstalk/Google App Engine/Azure), unless you have some real (not FUD) requirements to do it in house.

Can I do it in a private PaaS? Then use Cloudfoundry or similar. Or do you really (for non-FUD reasons) need access to the hardware?

Can I do it in public IaaS? Then use Amazon. Or do you really (for non-FUD reasons) need it “on premise” (probably not really on premise, but in some datacenter you’re leasing – which is different from being outsourced in the cloud why)?

Can I do it in a private cloud? This is your final recourse before doing it the old fashioned way – unless you have extremely unique hardware requirements, you probably can. Also, you can do hybrid cloud – basically private cloud plus public cloud (IaaS only really). This gets you some of the IaaS benefits while complicating your architecture.

What About The Cost?

This seems to be inverted from how people are inching into the cloud. But the lower on this list you are, the less additional value you are getting from the solution (assuming the same price point). You should instead be reluctantly dragged into these lower levels – which require more effort and often (though not always) more expense. “But what about the cost,” you say, “the cloud gets more expensive than me running a couple servers?”

You need to keep in mind the real costs of your infrastructure when you do this – I see a lot of people spending a lot of work on private cloud that really shouldn’t be. If you simply compare “buying servers” with “cost per month in Amazon” it can seem like you need to go hybrid after a couple hundred thousand dollars appear on your bill. But:

1. Make sure you are taking into account your fully loaded cost (includes data center, power cooling, etc.) of all assets (servers, storage, network…) you are using to do this private. Use the real numbers, not the “funny money” numbers – at a previous company we allocated network and other shared costs across the entire company, while “our IT budget” had to pay for servers – don’t be a goon, you should consider what it’s costing your entire company. Storage especially is way cheaper in the cloud versus enterprise SANs.

2. Make sure you are taking into account the cost of the manpower to run it. And that’s not just their salary (fully loaded with benefits/bonuses), and the proportion of each layer of management going up that has to deal with their concerns (Even if the director only has to spend 30% of his time messing with the data center team, and the VP 10%, and the CTO 5%, and the CEO 1% – that’s a lot of freaking money). It’s also the opportunity cost of having people (smart technical people) doing your plumbing instead of doing things to forward your company. I would argue that instead of putting in the employee’s salary, you’d do better to put in your revenue per employee! Why? Because for that same money you could be having someone improving product, making sales, etc. and making you additional revenue. If all you look at is “cost reduction” you are probably divorced enough from the business goals of your organization that you are not making good decisions. This isn’t to say you don’t need any of that manpower, but ideally with more plumbing being outsourced you can turn their technical skills to something of more productive use.

3. Make sure you are taking into account the additional lag time and the cost of that time to market delay from DIYing. Some people couch this as just for purposes of innovation – “well, if you’re a small, quick moving, innovative firm or startup, then this velocity matters to you – if you’re a larger enterprise, with yearly budget cycles, not so much.” That’s not true. Assuming you are implementing all this stuff with some end goal in mind, you are burning value along with time the longer it takes you to deliver it – we like to call that cost of delay. Heck, just plain cost of money over that period is significant – I’ve seen companies go through quite a set of gyrations to be able to bill 30 days earlier to get that additional benefit; if you can deliver projects a month earlier from leveraging reusable work (which is all SaaS/PaaS/IaaS are) then you accelerate your cashflow. If you have to wait 12 months for the IT group to get a private cloud working, you are effectively losing the benefit of your deliverable * 12 months.

4. Account for complexity. The problem with “hybrid cloud,” in most implementations, is that it’s not seamless from on prem to public, and therefore your app architecture has to be doubly complicated. In a previous position where I ran a large SaaS service, we were spread across AWS (virtual everything) and Rackspace (vserver, F5 LBs, etc.) and it was a total nightmare – we were trying to migrate all the way out to the cloud just so we could delete half of the cruft in all our code that touched the infrastructure – complexity that caused production issues (frequently) and slowed our rate of delivering new functionality. The KISS principle is wrathful when ignored.

I’m not saying hybrid cloud, private cloud, etc. are never the answer – but I would say that on average they are usually not the right answer, and if you are using them as your default approach then it’s better than even money you’re being inefficient. Furthermore, using SaaS and PaaS requires less expertise than IaaS which uses less than private cloud – people justify “starting with private” because you are “leveraging skill sets” or whatever – and then 6 months later you have a whole team still trying to bake off OpenStack vs Eucalyptus when you could have had your app already running in a public PaaS. I’m not sure why I need to say out loud “delivering the most amount of value with the least amount of effort, time, and expenditure is good” – but apparently I do. Just because you *can* do something does not mean you *should* do it. You need to carefully shepherd your time to delivery and your costs, and not just let things float in a morass of IT because “these things take time…”

I think that using “DevOp” as a job or role name isn’t a good idea, and that really what the term indicates is that there are two major classes of technical role,

Devs – people who work with code mostly

Ops – people who work with systems mostly

You could say a “DevOp” is someone who does some of both, but I think the preferred usage is that DevOps, like agile, is about a methodology of collaboration. A given person having both skill sets of fulfilling both roles doesn’t require a special term, it’s like anyone else in IT with multiple hats.

Of course, inside each of these two areas is a wide variety of skills and specialized roles. Many of the people talking about “DevOps” are in five-person Web shops, in which case “Ops” is an adequate descriptor of “all the infrastructure crap guy #5 does.”

But in larger shops, you start to realize how many different roles there are. In the dev world, you get specialization, from UI developers to service developers to embedded developers to algorithm developers. I’d tend to say that even Web design/development (HTML/CSS/JS) and QA are often considered part of the “dev side of the house.” It’s the same in Ops.

Now, traditionally many “systems” teams, also known as “infrastructure” teams, have been divided up by technology silo only. You have a list of teams of types UNIX, Windows, storage, network, database, security, etc. This approach has its strengths but also has critical weaknesses – which is why ITIL, for example, has been urging people to reorganize around “services you are delivering” lines.

In the dev world, you don’t usually see tech silos like that. “Here’s the C programmer department, the Java programmer department, the SQL programmer department… Hand your specs to all those departments and hope you get a working app out of it!” No, everyone knows intuitively that’s insane. But largely we still do the same thing in traditional systems teams (“Ops” umbrella).

So historically, the first solution that emerged was a separate kind of group. Here at NI, the first was a “Web Ops” group called the Web Admins, which was formed ten years ago when it became clear that running a successful Web site cannot be done by bringing together fractional effort from various tech silos. The Web Admins work with the developers and the other systems teams – the systems teams do OS builds, networking, rack-and-jack, storage/data center, etc. and the Web Admins do the software (app servers, collab systems, search, content management, etc.), SaaS, load balancing, operational support, release management, etc. Our Web Admin team ended up expanding very strongly into the application performance management and Web security areas because no one else was filling them.

In more dotcommey companies, you see the split between their “IT group” and their “Engineering” or “Operations” group that is “support for their products,” as two entirely different beasts.

Anyway, the success of this team spawned others, so now there are several teams we call “App Admins” here at NI, that perform this same role with respect to sitting between the developers and the “system admins.” To make it more complicated, even some of the apps (“Dev”) teams are also spawning “App Ops” teams that handle CI work and production issue escalation, freeing up the core dev teams for more large-scale projects. Our dev teams are organized around line of business (ecommerce, community, support, etc.) so they find that helpful. (I’ll note that the interface between line of business organization and technology silo organization is not an easy one.)

Which of these teams are the “DevOps?” None of them. Naturally, the teams that are more in the middle feel the need for it more, which is why I as a previous manager of the Web Admins am the primary evangelist for DevOps in our organization. The “App Admins” and the new “App Ops” teams work a lot more closely together on “operational” issues.

But this is where the term “Ops” has bad connotations – in my mind, “operations”, as closely related to “support”, is about the recurring activities around the runtime operation of our systems and apps. In fact, we split the Web Admin team into two sub-teams – an “operations” team handling requests, monitoring, releases, and other interrupt driven activity, and a “systems” team that does systems engineering. The interface between systems engineering and core dev teams is just as important as the interface around runtime, even more so I would say, and is where a lot of the agile development/agile infrastructure methodology bears the most fruit. Our system engineering team is involved in projects alongside the developers from their initiation, and influence the overall design of the app/system (side note, I wish there was a word that captured “both app and system” well; when you say system people sometimes take that to mean both and sometimes to just mean the infrastructure). And *that’s* DevOps.

Heck, our DBA team is split up even more – at one point they had a “production support” team, a “release” team, an “architecture” team, and a “projects” team.

But even on the back end systems teams, there are those that have more of a culture of collaboration – “DevOps” you might call it – and they are more of a pleasure to interface with, and then there’s those who are not, who focus on process over people, you might say. I am down with the “DevOps” term just because it has the branding buzz around it, but I think it really is just a sexier way to say “Agile systems administration.”

On a related note, I’ve started to see job postings go by for “DevOps Engineers” and other such. I think that’s OK to some degree, because it does differentiate the likely kind of operating environment of those jobs from all the noise posted as “UNIX Engineer III”, but if you are using “DevOps” as a job description you need to be pretty clear in your posting what you mean in terms of exact skills because of this confusion. Do you mean you just want a jack of all trades who can write Java/C# code as well as do your sysadmin work because you’re cheap? Or do you want a sysadmin who can script and automate stuff? Or do you want someone who will be embedded on project teams and understand their business requirements and help them to accomplish them? Those are all different things that have different skill sets behind them.

What do you think? It seems to me we don’t really have a good understanding of the taxonomy of the different kinds of roles within Ops, and thus that confuses the discussion of DevOps significantly. Is it a name for, or a description of, or a prescription for, some specific sub-team? Which is it for – production support, systems engineering, does IT count or is it just “product” support orgs for SaaS?

From the “sad but true” files comes an extremely insightful point apparently discussed over beer by the UK devops crew recently – that we are talking about dev and ops collaboration but the current state of collaboration among ops teams is pretty crappy.

This resonates deeply with me. I’ve seen that problem in spades. I think in general that a lot of the discussion about the agile ops space is too simplistic in that it seems tuned to organizations of “five guys, three of whom are coders and two of whom are operations” and there’s no differentiation. In real life, there’s often larger orgs and a lot of differentiation that causes various collaboration challenges. Some people refer to this as Web vs Enterprise, but I don’t think that’s strictly true; once your Web shop grows from 5 guys to 200 it runs afoul of this too – it’s a simple scalability and organizational engineering problem.

As an aside, I don’t even like the “Ops” term – a sysadmin team can split into subgroups that do systems engineering, release management, and operational support… Just saying “Ops” seems to me to create implications of not being a partner in the initial design and development of the overall system/app/service/site/whatever you want to call it.

Ops Verticals

Here, we have a large Infrastructure department. Originally, it was completely siloed by technology verticals, and there’s a lot of subgroups. Network, UNIX, Windows, DBA, Lotus Notes, Telecom, Storage, Data Center… Some ten plus years ago when the company launched their Web site in earnest, they quickly realized that wasn’t going to work out. You had the buck-passing behavior described in the blog posts above that made issues impossible to solve in a timely fashion, plus it made collaboration with devs/business nearly impossible. Not only did you need like 8 admins to come involve themselves in your project, but they did not speak similar enough languages – you’d have some crusty UNIX admin yelling “WHAT ABOUT THE INODES” until the business analyst started to cry.

Dev Silos

But are our developers here better off? They are siloed by business unit. Just among the Web developers there’s the eCommerce developers, eCRM, Product Advisors, Community, Support, Content Management… On the one hand, they are able to be very agile in creating solutions inside their specific niche. On the other hand, they are all working within the same system environment, and they don’t always stay on the same page in terms of what technologies they are using. “Well, I’m sure THAT team bought a lovely million dollar CMS, but we’re going to buy our own different million dollar CMS. No, you don’t get more admin resource.” Over time, they tried to produce architecture groups and other cross-team initiatives to try to rein in the craziness, with mixed but overall positive results.

Plugging the Dike

What we did was create a Web Administration group (Web Ops, whatever you want to call it) that was holistically responsible for Web site uptime, performance, and security. Running that team was my previous gig, did it for five years. That group was more horizontally focused and would serve as an interface to the various technology verticals; it worked closely with developers in system design during development, coordinated the release process, and involved devs in troubleshooting during the production phase.

BizOps?

In fact, we didn’t just partner with the developers – we partnered with the business owners of our Web site too, instead of tolerating the old model of “Business collaborates with the developers, who then come and tell ops what to do.” This was a remarkably easy sell really. The company lost money every minute the Web site was down, and it was clear that the dev silos weren’t going to be able to fix that any more than the ops silos were. So we quickly got a seat at the same table.

Results

This was a huge success. To this day, our director of Web Marketing is one of the biggest advocates of the Web operations team. Since then, other application administration (our word for this cross-disciplinary ops) teams have formed along the same model. The DevOps collaboration has been good overall – with certain stresses coming from the Web Ops team’s role as gatekeeper and process enforcement. Ironically, the biggest issues and worst relationships were within Infrastructure between the ops teams!

OpsOps – The Fly In The Ointment

The ops team silos haven’t gone down quietly. To this day the head DBA still says “I don’t see a good reason for you guys [WebOps] to exist.” I think there’s a common “a thing is just the sum of its parts” mindset among admins for whatever reason. There are also turf wars arising from the technology silo division and the blurring of technology lines by modern tech. I tried again and again to pitch “collaborative system administration.” But the default sysadmin behavior is to say “these systems are mine and I have root on them. Those are your systems and you have root on them. Stay on your side of the line and I’ll stay on mine.”

Fun specific Catch-22 situations we found ourselves in:

Buying a monitoring tool that correlates events across all the different tiers to help root-cause production problems – but the DBAs refusing to allow it on “their” databases.

Buying a hardware load balancer – we were going to manage it, not the network team, and it wasn’t a UNIX or Windows server, so we couldn’t get anyone to rack and jack it (and of course we weren’t allowed to because “Why would a webops person need server room access, that’s what the other teams are for”).

Some of the problem is just attitude, pure and simple. We had problems even with collaboration inside the various ops teams! We’d work with one DBA to design a system and then later need to get support from another DBA, who would gripe that “no one told/consulted them!” Part of the value of the agile principles that “DevOps” tries to distill is just a generic “get it into your damn head you need to be communicating and working together and that needs to be your default mode of operation.” I think it’s great to harp on that message because it’s little understood among ops. For every dev group that deliberately ostracizes their ops team, there’s two ops teams who don’t think they need to talk to the devs – in the end, it’s mostly our fault.

Part of the problem is organizational. I also believe (and ITIL, I think, agrees with me) that the technology-silo model has outlived its usefulness. I’d like to see admin teams organized by service area with integral DBAs, OS admins, etc. But people are scared of this for a couple reasons. One is that those admins might do things differently from area to area (the same problem we have with our devs) – this could be mitigated by “same tech” cross-org standards/discussions. The other is that this model is not the cheapest. You can squeeze every last penny out if you only have 4 Windows admins and they’re shared by 8 functional areas. Of course, you are cutting off your nose to spite your face because you lose lots more in abandoned agility, but frankly corporate finance rules (minimize G&A spending) are a powerful driver here.

If nothing else, there’s not “one right organization” – I’d be tempted to reorg everyone from verticals into horizontals, let that run for 5 years, and then reorg back the other way, just to keep the stratification from setting in.

Specialist vs Generalist

One other issue. The Web Ops team we created required us to hire generalists – but generalists that knew their stuff in a lot of different areas. It became very hard to hire for that position and training took months before someone was at all effective. Being a generalist doesn’t scale well. Specialization is inevitable and, indeed, desirable (as I think pretty much anything in the history of anything demonstrates). You can mitigate that with some cross-training and having people be generalists in some areas, but in the end, once you get past that “three devs, two ops, that’s the company” model, specialization is needed.

That’s why I think one of the common definitions of DevOps – all ops folks learning to be developers or vice versa – is fundamentally flawed. It’s not sustainable. You either need to hire all expensive superstars that can be good at both, or you hire people that suck at both.

What you do is have people with varying mixes. In my current team we have a continuum of pure ops people, ops folks doing light dev, devs doing light ops, and pure devs. It’s good to have some folks who are generalizing and some who are specializing. It’s not specializing that is bad, it’s specialists who don’t collaborate that are bad.

Conclusion

So I’ve shared a lot of experiences and opinions above but I’m not sure I have a brilliant solution to the problem. I do think we need to recognize that Ops/Ops collaboration is an issue that arises with scale and one potentially even harder to overcome than Dev/Ops collaboration. I do think stressing collaboration as a value and trying to break down organizational silos may help. I’d be happy to hear other folks’ experiences and thoughts!