Category Archives: CloudOps

Post navigation

I love great conversations about technology – especially ones where the answer is not very neatly settled into winners and losers (which is ALL of them in IT). I’m excited that RackN has (re)launched the L8ist Sh9y (aka Latest Shiny) podcast around this exact theme.

Please check out the deep and thoughtful discussion I just had with Mark Thiele (notes) of Aperca where we covered Mark’s thought on why public cloud will be under 20% of IT and culture issues head on.

While the RackN team and I have been heads down radically simplifying physical data center automation, I’ve still been tracking some key cloud infrastructure areas. One of the more interesting ones to me is Edge Infrastructure.

This once obscure topic has come front and center based on coming computing stress from home video, retail machine and distributed IoT. It’s clear that these are not solved from centralized data centers.

While I’m posting primarily on the RackN.com blog, I like to take time to bring critical items back to my personal blog as a collection. WARNIING: Some of these statements run counter to other industry. Please let me know what you think!

By far the largest issue of the Edge discussion was actually agreeing about what “edge” meant. It seemed as if every session had a 50% mandatory overhead in definitioning. Putting my usual operations spin on the problem, I choose to define edge infrastructure in data center management terms. Edge infrastructure has very distinct challenges compared to hyperscale data centers. Read article for the list...

Running each site as a mini-cloud is clearly not the right answer. There are multiple challenges here. First, any scale infrastructure problem must be solved at the physical layer first. Second, we must have tooling that brings repeatable, automation processes to that layer. It’s not sufficient to have deep control of a single site: we must be able to reliably distribute automation over thousands of sites with limited operational support and bandwidth. These requirements are outside the scope of cloud focused tools.

If “cloudification” is not the solution then where should we look for management patterns? We believe that software development CI/CD and immutable infrastructure patterns are well suited to edge infrastructure use cases. We discussed this at a session at the OpenStack OpenDev Edge summit.

What do YOU think? This is an evolving topic and it’s time to engage in a healthy discussion.

We believe Cloud Native development disciplines are essential regardless of the infrastructure.

Today, RackN announce very low entry level support for Digital Rebar Provisioning – the RESTful Cobbler PXE/DHCP replacement. Having a company actually standing behind this core data center function with support is a big deal; however…

We’re making two BIG claims with Provision: breaking DevOps bottlenecks and cloud native physical provisioning. We think both points are critical to SRE and Ops success because our current approaches are not keeping pace with developer productivity and hardware complexity.

I’m going to post more about Provision can help address the political struggles of SRE and DevOps that I’ve been watching in our industry. A hint is in the release, but the Cloud Native comment needs to be addressed.

First, Cloud Native is an architecture, not an infrastructure statement.

There is no requirement that we use VMs or AWS in Cloud Native. From that perspective, “Cloud” is a useful but deceptive adjective. Cloud Native is born from applications that had to succeed in hands-off, lower SLA infrastructure with fast delivery cycles on untrusted systems. These are very hostile environments compared to “legacy” IT.

What makes Digital Rebar Provision Cloud Native? A lot!

The following is a list of key attributes I consider essential for Cloud Native design.

Micro-services Enabled: The larger Digital Rebar project is a micro-services design. Provision reflects a stand-alone bundling of two services: DHCP and Provision. The new Provision service is designed to both stand alone (with embedded UX) and be part of a larger system.

Swagger RESTful API: We designed the APIs first based on years of experience. We spent a lot of time making sure that the API conformed to spec and that includes maintaining the Swagger spec so integration is easy.

Remote CLI: We build and test our CLI extensively. In fact, we expect that to be the primary user interface.

Security Designed In: We are serious about security even in challenging environments like PXE where options are limited by 20 year old protocols. HTTPS is required and user or bearer token authentication is required. That means that even API calls from machines can be secured.

12 Factor & API Config: There is no file configuration for Provision. The system starts with command line flags or environment variables. Deeper configuration is done via API/CLI. That ensures that the system can be fully managed by remote and configured securely becausee credentials are required for configuration.

Fast Start / Golang: Provision is a totally self-contained golang app including the UX. Even so, it’s very small. You can run it on a laptop from nothing in about 2 minutes including download.

CI/CD Coverage: We committed to deep test coverage for Provision and have consistently increased coverage with every commit. It ensures quality and prevents regressions.

Documentation In-project Auto-generated: On-boarding is important since we’re talking about small, API-driven units. A lot of Provisioning documentation is generated directly from the code into the actual project documentation. Also, the written documentation is in Restructured Text in the project with good indexes and cross-references. We regenerate the documentation with every commit.

We believe these development disciplines are essential regardless of the infrastructure. That’s why we made sure the v3 Provision (and ultimately every component of Digital Rebar as we iterate to v3) was built to these standards.

I’ve been digging into what it means to be a site reliability engineer (SRE) and thinking about my experience trying to automate infrastructure in a way to scales dramatically better. I’m not thinking about scale in number of nodes, but in operator efficiency. The primary way to create that efficiency is limit site customization and to improve reuse. Those changes need to start before the first install.

As an industry, we must address the “day 2” problem in collaboratively developed open software before users’ first install.

Happily, platforms like Kubernetes are designed to hide these infrastructure variations for developers. That means we can expect a productivity explosion for the huge number of applications that can narrowly target platforms. Unfortunately, that does nothing for the platforms or infrastructure bound applications. For this lower level software, we need to accept that operations environments are heterogeneous.

It’s multidimensional because we are building the operations practice simultaneously with the software itself. To make things even harder, the infrastructure and dependencies are also constantly changing. Since this degree of rapid multi-factor innovation is the new normal, we have to plan that our operations automation itself must be as upgradable.

If we upgrade both the software AND the related deployment automation then each deployment will become a cul-de-sac after day 1.

For open communities, that cul-de-sac challenge limits projects’ ability to feed operational improvements back into the user base and makes it harder for early users to stay current. These challenges limit the virtuous feedback cycles that help communities grow.

The solution is to approach shared project deployment automation as also being continuously deployed.

This is a deceptively hard problem.

This is a hard problem because each deployment is unique and those differences make it hard to absorb community advances without being constantly broken. That is one of the reasons why company opt out of the community and into vendor distributions. While Vendors are critical to the ecosystem, the practice ultimately limits the growth and health of the community.

Our approach at RackN, as reflected in open Digital Rebar, is to create management abstractions that isolate deployment variables based on system level concerns. Unlike project generated templates, this approach absorbs heterogeneity and brings in the external information that often complicate project deployment automation.

We believe that this is a general way to solve the broader problem and invite you to participate in helping us solve the Day 2 problems that limit our open communities.

I’ve been posting about the unique composable operations approach the RackN team has taken with Digital Rebar to enable hybrid infrastructure and mix-and-match underlay tooling. The orchestration design (what we call annealing) allows us to dynamically add roles to the environment and execute them as single role/node interactions in operational chains.

With our latest patches (short demo videos below), you can now create single role Ansible or Bash scripts dynamically and then incorporate them into the node execution.

That makes it very easy to extend an existing deployment on-the-fly for quick changes or as part of a development process.

You can also run an ad hoc bash script against one or groups of machines. If that script is something unique to your environment, you can manage it without having to push it back upsteam because Digital Rebar workloads are composable and designed to be safely integrated from multiple sources.

Beyond tweaking running systems, this is fastest script development workflow that I’ve ever seen. I can make fast, surgical iterative changes to my scripts without having to rerun whole playbooks or runlists. Even better, I can build multiple operating system environments side-by-side and test changes in parallel.

For secure environments, I don’t have to hand out user SSH access to systems because the actions run in Digital Rebar context. Digital Rebar can limit control per user or tenant.

I’m very excited about how this capability can be used for dev, test and production systems. Check it out and let me know what you think.

Software development technology is so frothy that we’re developing collective immunity to constant churn and hype cycles. Lately, every time someone tells me that they have hot “picked technology Foo” they also explain how they are also planning contingencies for when Foo fails. Not if, when.

Required contingency? That’s why I believe 2017 is the year of the IT Escape Clause, or, more colorfully, the IT Crawfish.

When I lived in New Orleans, I learned that crawfish are anxious creatures (basically tiny lobsters) with powerful (and delicious) tails that propel them backward at any hint of any danger. Their ability to instantly back out of any situation has turned their name into a common use verb: crawfish means to back out or quickly retreat.

In IT terms, it means that your go-forward plans always include a quick escape hatch if there’s some problem. I like Subbu Allamaraju’s description of this as Change Agility. I’ve also seen this called lock-in prevention or contingency planning. Both are important; however, we’re reaching new levels for 2017 because we can’t predict which technology stacks are robust and complete.

The fact is the none of them are robust or complete compared to historical platforms. So we go forward with an eye on alternatives.

How did we get to this state? I blame the 2016 Infrastructure Revolt.

Way, way, way back in 2010 (that’s about bronze age in the Cloud era), we started talking about developers helping automate infrastructure as part of deploying their code. We created some great tools for this and co-opted the term DevOps to describe provisioning automation. Compared to the part, it was glorious with glittering self-service rebellions and API-driven enlightenment.

In reality, DevOps was really painful because most developers felt that time fixing infrastructure was a distraction from coding features.

In 2016, we finally reached a sufficient platform capability set in tools like CI/CD pipelines, Docker Containers, Kubernetes, Serverless/Lambda and others that Developers had real alternatives to dealing with infrastructure directly. Once we reached this tipping point, the idea of coding against infrastructure directly become unattractive. In fact, the world’s largest infrastructure company, Amazon, is actively repositioning as a platform services company. Their re:Invent message was very clear: if you want to get the most from AWS, use our services instead of the servers.

For most users, using platform services instead of infrastructure is excellent advice to save cost and time.

The dilemma is that platforms are still evolving rapidly. So rapidly that adopters cannot count of the services to exist in their current form for multiple generations. However, the real benefits drive aggressive adoption. They also drive the rise of Crawfish IT.

Complexity has always part of IT and it’s increasing as we embrace microservices and highly abstracted platforms. Making everyone cope with this challenge is unsustainable.

We’re just more aware of infrastructure complexity now that DevOps is exposing this cluster configuration to developers and automation tooling. We are also building platforms from more loosely connected open components. The benefit of customization and rapid development has the unfortunate side-effect of adding integration points. Even worse, those integrations generally require operations in a specific sequence.

The result is a developer rebellion against DevOps on low level (IaaS) platforms towards ones with higher level abstractions (PaaS) like Kubernetes.This rebellion is taking the form of “cloud native” being in opposition to “devops” processes. I discussed exactly that point with John Furrier on theCUBE at Kubecon and again in my Messy Underlay presentation Defrag Conf.

It is very clear that DevOps mission to share ownership of messy production operations requirements is not widely welcomed. Unfortunately, there is no magic cure for production complexity because systems are inherently complex.

There is a (re)growing expectation that operations will remain operations instead of becoming a shared team responsibility. While this thinking apparently rolls back core principles of the DevOps movement, we must respect the perceived productivity impact of making operations responsibility overly broad.

What is the right way to share production responsibility between teams? We can start to leverage platforms like Kubernetes to hide underlay complexity and allow DevOps shared ownership in the right places. That means that operations still owns the complex underlay and platform jobs. Overall, I think that’s a workable diversion.