Physical servers (aka bare metal) are the core building block for any data center; however, they are often abstracted out of sight by a virtualization layer such as VMware, KVM, HyperV or many others. These platforms are useful for many reasons. In this post, we’re focused on the fact that they provide a control API for infrastructure that makes it possible to manage compute, storage and network requests. Yet the abstraction comes at a price in cost, complexity and performance.

The historical lack of good API control has made bare metal less attractive, but that is changing quickly due to two forces.

These two forces are Container Platforms and Bare Metal as a Service or BMaaS (disclosure:RackN offers a private BMaaS platform calledDigital Rebar). Container Platforms such as Kubernetes provide an application service abstraction level for data center consumers that eliminates the need for users to worry about traditional infrastructure concerns. That means that most users no longer rely on APIs for compute, network or storage allowing the platform to handle those issues. On the other side, BMaaS VM infrastructure level APIs for the actual physical layer of the data center allow users who care about compute, network or storage the ability to work without VMs.

The IBM bare metal Kubernetes announcement illustrates both of these forces working together. Users of the managed Kubernetes service are working through the container abstraction interface and really don’t worry about the infrastructure; however, IBM is able to leverage their internal bare metal APIs to offer enhanced features to those users without changing the service offering. These benefits include security (IBM White Paper on Security), isolation, performance and (eventually) access to metal features like GPUs. While the IBM offering still includes VMs as an option, it is easy to anticipate that becoming less attractive for all but smaller clusters.

The impact for DC2020 is that operators need to rethink how they rely on virtualization as a ubiquitous abstraction. As more applications rely on container service abstractions the platforms will grow in size and virtualization will provide less value. With the advent of better control of the bare metal infrastructure, operators have real options to get deep control without adding virtualization as a requirement.

Shifting to new platforms creates opportunities to streamline operations in DC2020.

Even with virtualization and containers, having better control of the bare metal is a critical addition to data center operations. The ideal data center has automation and control APIs for every possible component from the metal up.

Background: This post was inspired by a mult-cloud session session at IBM Think2018 where I am attending as a guest of IBM. Providing hybrid solutions is a priority for IBM and it’s customers are clearly looking for multi-cloud options. In this way, IBM has made a choice to support competitive platforms. This post explores why they would do that.

There is considerable angst and hype over the terms multi-cloud and hybrid-cloud. While it would be much simpler if companies could silo into a single platform, innovation and economics requires a multi-party approach. The problem is NOT that we want to have choice and multiple suppliers. The problem is that we are moving so quickly that there is minimal interoperability and minimal efforts to create interoperability.

To drive interoperability, we need a strong commercial incentive to create an neutral ecosystem.

Even something with aclear ANSI standard like SQL has interoperability challenges. It also seems like the software industry has given up on standards in favor of APIs and rapid innovation. The reality on the ground is that technology is fundamentally heterogeneous and changing. For this reason, mono-anything is a myth and hybrid is really status quo.

If we accept multi-* that as a starting point, then we need to invest in portability and avoid platform assumptions when we build automation. Good design is toassume change at multiple points in your stack. Automation itself is a key requirement because it enables rapid iterative build, test and deploy cycles. It is not enough to automate for day 1, the key to working multi-* infrastructure is a continuous deployment pipeline.

That means the utility of tools like Terraform, Ansible or Docker is limited to how often you exercise them. Ideally, we’d build abstraction automation layers above these primitives; however, this has proven very difficult in practice. The degrees of variation between environments and pace of innovation make it impossible to standardize without becoming very restrictive. This may be possible for a single company but is not practical for a vendor trying to support many customers with a single platform.

This means that hybrid, while required in the market, carries an integration tax that needs to be considered.

My objective for discussing Data Center 2020 topics is to find ways to lower that tax and improve the outcome. I’m interested in hearing your opinion about this challenge and if you’ve found ways to solve it.

Counterpoint Addendum: if you are in a position to avoid multi-* deployments (e.g. a start-up) then you should consider that option. There is measurable overhead of heterogeneous automation; however, I’ve found the tipping point away from a mono-stack can be surprising low and committing to a vertical stack does make applications less innovation resilient.

Coming direct from Cambodia is a rare podcast with Jim Plamondon, the creator of how software platforms were built at Microsoft via APIs and developer evangelism. In this podcast, he talks about the early history of developer evangelism at Apple and Microsoft, the current state of open source, and the upcoming competitive industry coming from China and its roots in the third world.

Highlights

Soviet Agriculture and Technology Market Comparison

Why NeXT and Apple Failed with Software Industry but iPhone Succeeded

China Industry Takeover is Coming: Product Price Points

Books referenced in the podcast (links to Amazon, we have no agreement with them based on your click/purchase):

Digital Rebar is the open, fast and simple data center provisioning and control scaffolding designed with a cloud native architecture. Sponsored by RackN, this community is building an extensible stand-alone DCHP/PXE/IPXE service with minimal overhead offering a quick 5 minutes to provisioning solution.

Community Mission: Embrace the Heterogeneous Nature of Data Center Operations while Eliminating Complexity and Manual Steps.

We believe Cloud Native development disciplines are essential regardless of the infrastructure.

Today, RackN announce very low entry level support for Digital Rebar Provisioning – the RESTful Cobbler PXE/DHCP replacement. Having a company actually standing behind this core data center function with support is a big deal; however…

We’re making two BIG claims with Provision: breaking DevOps bottlenecks and cloud native physical provisioning. We think both points are critical to SRE and Ops success because our current approaches are not keeping pace with developer productivity and hardware complexity.

I’m going to post more about Provision can help address the political struggles of SRE and DevOps that I’ve been watching in our industry. A hint is in the release, but the Cloud Native comment needs to be addressed.

First, Cloud Native is an architecture, not an infrastructure statement.

There is no requirement that we use VMs or AWS in Cloud Native. From that perspective, “Cloud” is a useful but deceptive adjective. Cloud Native is born from applications that had to succeed in hands-off, lower SLA infrastructure with fast delivery cycles on untrusted systems. These are very hostile environments compared to “legacy” IT.

What makes Digital Rebar Provision Cloud Native? A lot!

The following is a list of key attributes I consider essential for Cloud Native design.

Micro-services Enabled: The larger Digital Rebar project is a micro-services design. Provision reflects a stand-alone bundling of two services: DHCP and Provision. The new Provision service is designed to both stand alone (with embedded UX) and be part of a larger system.

Swagger RESTful API: We designed the APIs first based on years of experience. We spent a lot of time making sure that the API conformed to spec and that includes maintaining the Swagger spec so integration is easy.

Remote CLI: We build and test our CLI extensively. In fact, we expect that to be the primary user interface.

Security Designed In: We are serious about security even in challenging environments like PXE where options are limited by 20 year old protocols. HTTPS is required and user or bearer token authentication is required. That means that even API calls from machines can be secured.

12 Factor & API Config: There is no file configuration for Provision. The system starts with command line flags or environment variables. Deeper configuration is done via API/CLI. That ensures that the system can be fully managed by remote and configured securely becausee credentials are required for configuration.

Fast Start / Golang: Provision is a totally self-contained golang app including the UX. Even so, it’s very small. You can run it on a laptop from nothing in about 2 minutes including download.

CI/CD Coverage: We committed to deep test coverage for Provision and have consistently increased coverage with every commit. It ensures quality and prevents regressions.

Documentation In-project Auto-generated: On-boarding is important since we’re talking about small, API-driven units. A lot of Provisioning documentation is generated directly from the code into the actual project documentation. Also, the written documentation is in Restructured Text in the project with good indexes and cross-references. We regenerate the documentation with every commit.

We believe these development disciplines are essential regardless of the infrastructure. That’s why we made sure the v3 Provision (and ultimately every component of Digital Rebar as we iterate to v3) was built to these standards.

I strive to stay neutral as OpenStack DefCore co-chair; however, as someone asking for another Board term, it’s important to review my thinking so that you can make an informed voting decision.

DefCore, while always on the edge of controversy, recently became ground zero for the “what is OpenStack” debate [discussion write up]. My preferred small core “it’s an IaaS product” answer is only one side. Others favor “it’s an open cloud community” while another faction champions an “open cloud platform.” I’m struggling to find a way that it can be all together at the same time.

The TL;DR is that, today, OpenStack vendors are required to implement a system that can run Linux guests. This is an example of an implementation over API bias because there’s nothing in the API that drives that specific requirement.

From a pragmatic “get it done” perspective, OpenStack needs to remain implementation driven for now. That means that we care that “OpenStack” clouds run VMs.

While there are pragmatic reasons for this, I think that long term success will require OpenStack to become an API specification. So today’s “right answer” actually undermines the long term community value. This has been a long standing paradox in OpenStack.

Breaking the API to implementation link allows an ecosystem to grow with truly alternate implementations (not just plug-ins). This is a threat to the community “upstream first” mentality. OpenStack needs to be confident enough in the quality and utility of the shared code base that it can allow competitive implementations. Open communities should not need walls to win but they do need clear API definition.

What is my posture for this specific issue? It’s complicated.

First, I think that the user and ecosystem expectations are being largely ignored in these discussions. Many of the controversial items here are vendor initiatives, not user needs. Right now, I’ve heard clearly that those expectations are for OpenStack to be an IaaS the runs VMs. OpenStack really needs to focus on delivering a reliably operable VM based IaaS experience. Until that’s solid, the other efforts are vendor noise.

Second, I think that there are serious test gaps that jeopardize the standard. The fundamental premise of DefCore is that we can use the development tests for API and behavior validation. We chose this path instead of creating an independent test suite. We either need to address tests for interop within the current body of tests or discuss splitting the efforts. Both require more investment than we’ve been willing to make.

We have mechanisms in place to collects data from test results and expand the test base. Instead of creating new rules or guidelines, I think we can work within the current framework.

The simple answer would be to block non-VM implementations; however, I trust that cloud consumers will make good decisions when given sufficient information. I think we need to fix the tests and accept non-VM clouds if they pass the corrected tests.

For this and other reasons, I want OpenStack vendors to be specific about the configurations that they test and support. We took steps to address this in DefCore last year but pulled back from being specific about requirements. In this particular case, I believe we should require the official OpenStack vendor to state clear details about their supported implementation. Customers will continue vote with their wallet about which configuration details are important.

This is a complex issue and we need community input. That means that we need to hear from you! Here’s the TC Position and the DefCore Patch.

Designated sections provides the “you must include this” part of the core definition. Having common code as part of core is a central part of how DefCore is driving OpenStack operability.

So, why do we need this?

From our very formation, OpenStack has valued implementation over specification; consequently, there is a fairly strong community bias to ensure contributions are upstreamed. This bias is codified into the very structure of the GNU General Public License (GPL) but intentionally missing in the Apache Public License (APL v2) that OpenStack follows. The choice of Apache2 was important for OpenStack to attract commercial interests, who often consider GPL a “poison pill” because of the upstream requirements.

Nothing in the Apache license requires consumers of the code to share their changes; however, the OpenStack foundation does have control of how the OpenStack™ brand is used. Thus it’s possible for someone to fork and reuse OpenStack code without permission, but they cannot called it “OpenStack” code. This restriction only has strength if the OpenStack brand has value (protecting that value is the primary duty of the Foundation).

This intersection between License and Brand is the essence of why the Board has created the DefCore process.

Ok, how are we going to pick the designated code?

Figuring out which code should be designated is highly project specific and ultimately subjective; however, it’s also important to the community that we have a consistent and predictable strategy. While the work falls to the project technical leads (with ratification by the Technical Committee), the DefCore and Technical committees worked together to define a set of principles to guide the selection.

This Technical Committee resolution formally approves the general selection principles for “designated sections” of code, as part of the DefCore effort. We’ve taken the liberty to create a graphical representation (above) that visualizes this table using white for designated and black for non-designated sections. We’ve also included the DefCore principle of having an official “reference implementation.”

Here is the text from the resolution presented as a table:

Should be DESIGNATED:

Should NOT be DESIGNATED:

code provides the project external REST API, or

code is shared and provides common functionality for all options, or

code implements logic that is critical for cross-platform operation

code interfaces to vendor-specific functions, or

project design explicitly intended this section to be replaceable, or

code extends the project external REST API in a new or different way, or

code is being deprecated

The resolution includes the expectation that “code that is not clearly designated is assumed to be designated unless determined otherwise. The default assumption will be to consider code designated.”

This definition is a starting point. Our next step is to apply these rules to projects and make sure that they provide meaningful results.

Wow, isn’t that a lot of code?

Not really. Its important to remember that designated sections alone do not define core: the must-pass tests are also a critical component. Consequently, designated code in projects that do not have must-pass tests is not actually required for OpenStack licensed implementation.