Kubernetes For The Microsoft Data Platform Professional 101

With the announcement of SQL Server 2019 big data clusters at Ignite, Kubernetes (often abbreviated to K8s) now stands front and center as part of Microsoft’s data platform vision. The obvious inference being that this is something that the Microsoft data platform community is going to show an increased interest in. The post aims to provide some context around:

why container orchestration is required

how Kubernetes is architected

the basics of working with Kubernetes

and why embracing open source software should be approached in an eyes wide open manner

The Need For Orchestration

If you take Docker in isolation and consider what a data center looks like, logging on to individual machines in order to stop and start containers, check on their health is not a scale-able practice. In fact I would humbly suggest that Docker on its own should mostly be used for sandpit type environments running locally on developer’s machines. For:

Horizontal scaling

Scheduling

Resilience

Password and config management

Storage orchestration

Connection load balancing

Service discovery

something is required that performs container orchestration. Thus, the “Container orchestration wars” began with the protagonists of:

Docker swarm

Mesosphere

Kubernetes

These are the protagonists of note, but there are also other options such as Nomad from Hashicorp, VMWare’s photonic O/s and Netflix’s Titus. Two years or so ago it became clear that Kubernetes was pulling away from the rest of the pack. All of the public cloud providers provide a Kubernetes-as-a-service offering and from taking an initial stance not to ship a Kubernetes based platform, Docker announced at Dockercon Europe 2017 that it would fully embrace Kubernetes.

Google trends illustrates the traction that Kubernetes is enjoying right now:

The Core Design Principles Behind Kubernetes

Kubernetes is a platform for developers

First and foremost Kubernetes is intended to be a platform for developers, as stated by Kubernetes founding engineer Brendan Burns:

From the beginning of the container revolution two things have become clear: First, the decoupling of the layers of the technology stack are producing a clean, principled layering of concepts with clear contracts, ownership and responsibility. Second, the introduction of these layers has enabled developers to focus their attention exclusively on the thing that matters to them – the application.

Kubernetes is declarative

Secondly, Kubernetes works in a declarative manner. Simply put, you specify the desired state you wish your application execute in, and Kubernetes ensures that it always runs in that exact same state. The yaml excerpt below is for the deployment of a SQL Server instance to Kubernetes, note the line that states replicas: 1

Kubernetes relies on a number of control loops, one of which is known as a “Replication controller”. A replication controller ensures that the number of replicas (pods) specified in the deployment specification is always up and running. In short this drives the actual state towards the desired state. Originally, this was intended for stateless applications. Say for example the aim of a container is to take an image and return the likelihood that it is a class of a particular type of image. To scale this out, it makes sense to set replicas to a value greater than one. For stateful applications in which containers cluster together, Kubernetes makes a special provision for this in the form of an object known as a statefulset. Statefulsets is an advanced topic in its own right. For the purposes of this blog, the focus will be on simple stateful applications. Because the desired state is for the pod containing the SQL Server instance to always be up, replicas is set to 1.

Kubernetes is (under-lying) platform agnostic

Kubernetes runs on most of the popular public clouds, on bare metal, virtualized infrastructure and / or openstack. Taking a cue from an interview Kubernetes co-founder Joe Beda gave with the cube, to quote Joe Beda:

What everyone has been struggling with cloud with, is not how I get a vm up, but how do I run my code. And as Google got more serious about the cloud, every big company wants to dogfood their products, so how do we make the experience developers inside Google have match the experience that cloud customers have.

Google achieved this by taking the engineering know how that went it its own in house orchestration platform; ‘Borg’ and created Kubernetes.

Kubernetes Architecture

A High Level Overview

To distill K8s down as much as possible, it consists of:

a control plane, known as the master node

a data store for storing the state of the cluster based on the etcd key value store database

agents (kubelets) by which the master node talks to its worker nodes

For the purposes of high availability, two master nodes and three etcd instances are recommended. Graphically the high level architecture of Kubernetes looks something like this:

Pods

Containers run inside pods. A pod is the atomic unit of scheduling for a Kubernetes cluster. Containers within a pod are always co-scheduled on the same host and they share the same stable ip address. For stateful applications, a pod is not just one or more containers, but containers plus volumes. In short, pods embody the logic of actual applications.

Services

In order to compose the pods into an application that can be made available to the outside world, a service object is required:

Labels are key value pairs and can be associated with each object in Kubernetes. The ‘Selector’ in the service yaml says:

“I want the mssql-deployment service to consist of all pods labelled mssql”

The service will perform load balancing across all pods associated with it. A control loop checks to see if any new pod replicas are spun up, and adjusts the load balancing accordingly.

Typically when you are working with Kubernetes and a public cloud provider, the service endpoint also acts as a load balancer. When working with Kubernetes on premises you will need to create an ingress controller to load balance connections for you.

Scheduling

And now I have to explain scheduling to you guys, and the fastest way to do this because I don’t have much time, is to play Tetris.

The easiest games of Tetris are when the blocks that fall from the top of the screen are based on a multiple of a standard bock size. Ergo, the Kubernetes documentation makes the following recommendation:

By configuring the CPU requests and limits of the Containers that run in your cluster, you can make efficient use of the CPU resources available on your cluster Nodes. By keeping a Pod CPU request low, you give the Pod a good chance of being scheduled. By having a CPU limit that is greater than the CPU request, you accomplish two things:

The Pod can have bursts of activity where it makes use of CPU resources that happen to be available.

The amount of CPU resources a Pod can use during a burst is limited to some reasonable amount.

Tooling

The fact that a SQL Server instance runs on Kubernetes is opaque to any client tool that accesses it. All tooling communicates with a Kubernetes cluster via the API server (a REST endpoint). The most fundamental tool used when administering Kubernetes and deploying applications to it, is kubectl. The PowerShell based scripts most Microsoft data platform professionals know and love can still be executed outside the cluster against SQL Server instances running inside the cluster. Kubectl is essentially a command line for talking to the API server via yaml.

When Google open sourced Kubernetes they teamed up with the Linux Foundation to form the “Cloud native computing foundation”. The significance of this is that this is where a lot of the popular Kubernetes tools come from:

Most people will eventually find that working with Kubernetes is a very yaml intensive experience. The solution to this is to become acquainted with what is essentially the Kubernetes package manager: helm. In keeping with the nautical theme of Kubernetes, helm packages are referred to as charts and these can (and should) be parameter-ised. Helm is not perfect and there are alternatives to this such as Skaffold , however it does a job and it has been widely adopted. Helm in its current incarnation is on version 2.0 at the time of writing this post. Many of the criticisms Helm has come in for are addressed in version 3.0 of the tool.

The touch point for storage in a Pod is the volume, volumes in turn consume storage via persistent volume claims (PVC), and the relationship between a volume and PVC is 1:1. Therefore, you may want to install the chart for SQL Server instances using a different values file for each environment, say; test, user acceptance, integration testing etc. In that, each instance will require different persistent volume claims. Also each instance’s port 1433 will need to map to a unique external port.

Can I Run Stateful Apps On Kubernetes ?

A lot of people say you cannot run stateful things (on K8s), you can totally run stateful things

Another factor in the wide spread adoption of Kubernetes, is the outstanding developer community engagement work carried out by this man:

There has been some debate over whether Kubernetes is a suitable platform for running SQL Server on in the community. What the Microsoft data platform community should really be cognizant of is what using open source projects in production really means. If you head over to GitHub, you will notice that Kubernetes is available under the Apache 2.0 license:

Simply put the Apache 2.0 license means that providing that the correct notices are observed, the software can freely be reused. The Apache 2.0 license provides zero guarantees as to bug fixes being made available or things appearing on the road map. If we go back to GitHub and look at the open issues sorted in ascending date order, we see the following, note that the issue at the top has been open since 2014:

Kubernetes Based Platform-As-A-Service

The 64 million dollar question is: if I hit a bug, worst still an edge case, where do I go to for a bug fixes ?. The answer is to use Kubernetes as a service in the public cloud or a PaaS based on Kubernetes, such as Redhat OpenShift (from version 3.0) onward.

You can run this on premises, in the public cloud. In fact if you run this on IaaS on the public cloud, you avoid cloud vendor lock-in, you have commercial support and 100% control over what fixes / updates are applied and when.

Addendum 9/11/2016Vanilla Kubernetes is 100% portable as are most vendors implementation of kubernetes-as-a-service. What makes Openshift special is that it is Kubernetes plus a tool chain. Because the tool chain is portable also, this means that you do not get locked into using tools that are bespoke to specific cloud providers.

There are other Kubernetes based PaaS offerings out there, but right now at the time of writing this blog post, OpenShift is the PaaS offering that Microsoft is putting its weight behind.

The developers that maintain the Kubernetes are not just any old developers, thockin, or Tim Hockin to give him his full name, is a software engineer at Google. Aside from this, the question should be asked, is maintaining an open source based platform core to what my organisation does, and is it something that contributes to the organisations bottom line ?.

Consider the fact that developers love open source projects:

In the words of Kubernetes founding engineer Brendan Burns, releasing tools to your cluster without the appropriate prior due diligence can create problems:

However, these tools can also make your cluster more unstable, less secure, and more prone to failures. They can expose your users to immature, poorly supported software that feels like an “official” part of the cluster but actually serves to make the users’ life more difficult. Part of managing a Kubernetes cluster is knowing how and when to add these tools, platforms, and projects into the cluster.

Managing Kubernetes, O’Reilly Press

I hope that this blog post has provided (such much needed IMHO) context behind Kubernetes, in future blog posts will be more practical in nature and cover deploying and managing applications on a cluster.