Printing Money with your Idle Hardware

{GUEST POST} About the Author: Samuel Cozannet is a Strategic Cloud Expert with strong technical background (OpenStack, Kubernetes, Big Data, public cloud, etc) from 10+ years of experience in product management, operations, architecture and DevOps.

Let’s assume you operate a large bare metal cluster which you rent to your customers to run their workloads.

You notice that the cluster is in fact only being used at 60% on average over the last months. You keep 15% capacity free in case of surges, and you need 5% of the cluster to operate the infra itself.

There are peaks at certain times of the day where it loads at 90%, but this is really hard to predict because they depend on your customers’ businesses and do not follow patterns of data you own.

Overall, you always run with about 20% of untapped capacity.

This may happen to you as a service provider or when you have a bunch of servers in your office.

Either way, it's a waste.

What if we could leverage that pool of resources and put it to good use? Let's see how. Yes Tesla Motors this is for you. :)

"Opportunistic AutoScaling"

Kubernetes and Autoscaling

Since v1.2 of K8s, you can use the Horizontal Pod Autoscaler to autoscale an app based on CPU consumption. Cool, but a very limited use case.

Since the version 1.6, you may use custom metrics to autoscale an application within a cluster. Much better. It means you can expose metrics such as the number of hits on an API or its latency, then scale the serving pods according to that metric instead of the default CPU load.

Now doesn't that mean we could expose remaining capacity of the cluster (20%), and leverage it to autoscale a money printing application, so that the cluster is permanently used at its optimal capacity?

Opportunism

The above behaviour is what I call "Opportunistic Autoscaling". The ability for an app to leverage otherwise unused capacity of the infrastructure, be it CPUs, memory or GPUs.

The business critical app will be measured on how it performs. Your API must always answer with a low latency for example. On the other hand, your non business app can only consume what’s left:

If on the contrary your paid load goes DOWN, the remaining capacity goes UP, and the number of opportunist containers should go UP.

The target is to have a load that is as constant as possible around a threshold you define (80% in our example), thus collecting an average of 20% unused power and monetize it.

Printing Money

It may look like a silly application but I can definitely tell you that the mining pools seem to see a load increase when office hours finish, showing that (some) business resources are definitely being used for mining at night!

In a real life scenario, crypto-mining is effectively adding resources to a compute grid, hence it does also make sense beyond the hype and fun. There are also a lot of other interesting use cases, including:

Lambda on the edges using a serverless framework (Telco / Cloud Operator);

Elastic transcoding (Media Lab / Cloud): Think of what Ikea is doing on workstations but in a compute cluster;

AI on the edges (Media Lab / Cloud);

Caching (CDN);

The cool use case you’ll share in comments.

OK, enough talking! Let's get this done and increase revenues.

DISCLAIMER: Deploying this is fairly complex and involves multiple steps. As a consequence this post is a lot longer and more technical than usual ones. If you are here because you like the use case but do not wish to dig into technical details, you can essentially skip from now to the conclusion.

Using Reversed Custom Metrics

In this blog, we will create a K8s cluster with a custom metrics API on bare metal.

We then create an app that exposes (among others) a metric as follows:

This metrics decreases when the load of the cluster grows, and grows when the load shrinks. This effectively will make the metric a “client” of the requested load on a server.

Then we will use this metric to configure a Horizontal Pod Autoscaler (HPA) in Kubernetes. This will result in keeping the load in the cluster as high as possible.

Full disclosure

This blog post has been sponsored by Kontron who gracefully allowed me to play with a 6 nodes cluster of their latest SymKloud Converged Platform, an awesome piece of hardware in which you can mix and match node modules to create a cluster. There are modules with GPUs, CPUs, some dedicated to storage or caching… Each 2U server can contain up to 9 "sleds", effectively going up to 288 cores with dual CPU sleds, or, with 9 single CPU sleds for up to 144 cores but each attached to a corresponding to nVidia P4 module. My cluster had 6 workers, 2 of them with GPUs.

In addition, my friend Ronan Delacroix helped me with the code and wrote all of the python needed for this experiment.

This work was presented at MWC 2018 in Barcelona at the Kontron booth.

If you need to replicate this in a way or another, you will need a K8s Cluster in version 1.8, with RBAC active, and an admin role.

Important Note: The APIs we will be using here are very unstable and subject to big changes. I really recommend you read the K8s change log to check on them.

For example, there was a change in 1.8 on the name of the APIs. If you run a 1.7 cluster, this will impact you.

There are also changes in 1.9 and the custon.metrics.k8s.io moves from v1alpha1 to v1beta1.

There are some details of the configuration we will see today that are done in a certain way on CDK and may be slightly different on clusters that are self hosted. I will try to mention them whenever possible. In any case, feel free to ask questions in the Q&A.

RBAC Configuration

NOTE: this applies to CDK, and will apply to GKE when the API Aggregation is GA with K8s 1.9.

On CDK and GKE the default user is not a real admin from an RBAC perspective, so you need to update it before you can create other Cluster Role Bindings that extend your own role.

Note: By default Helm deploys without resource constraints. When trying to surcharge a cluster and maximize its usage, it means that Tiller will be part of the pods that may go away because resources are exhausted. If you do not want that to happen, you can edit the manifest and reapply it:

This is because the proxy in CDK uses a Kubeconfig and not a client certificate.

However, we do enable the aggregator routing because the control plane of Kubernetes is not self hosted and we fall in the case “If you are not running kube-proxy on a host running the API server, then you must make sure that the system is enabled with the enable-aggregator-routing flag”.

Also we added the client-ca-file flag to export the CA of the API server in the cluster.

Now for the Controller Manager, we must tell it to use the HPA, which we do with:

Note that the 2 last options here are really for demos to make it quick to observe the results of actions. You may not need change them for your use case (they default to 3m and 5m).

Just to make sure the settings are applied restart the 2 services with:

$ for service in apiserver controller-manager; do
juju run --application kubernetes-master 'sudo systemctl restart snap.kube-${service}.daemon.service'
done

This will make Kubernetes create a configmap in the kube-system namespace called extension-apiserver-authentication, which contains all the additional flags we generated and their configuration. You can have a look at it via..

At the end of the next sections, you will have 3 more APIs in this list:

monitoring.coreos.com/v1, for the Prometheus Operator

metrics.k8s.io, for the Metrics Server that collects metrics for CPU and Memory

custom.metrics.k8s.io, for the custom metrics you want to expose

Adding the Metrics Server API

There are 2 implementations of the Metrics API (metrics.k8s.io) at this stage: Heapster and the Metrics Server. At the time of this writing, the Metrics Server has a simple deployment method, while Heapster required some work on my end, and I was too lazy to write the code.

We can simply deploy it with:

kubectl create -f src/manifest-metrics-server.yaml

This manifest contains:

the Service Account for the Metrics Server

a RoleBinding so that the Metrics Server can read the configmap above

a ClusterRoleBinding so that the Metrics Server inherits the system:auth-delegator ClusterRole (you can find documetation about that here.

a Deployment and ClusterIP Service for the Metrics Server

an APIService object, which is a registration of the new API into the API Server.

No big surprise here, you can access the CPU and memory consumption in real time. Refer to the docs for more details about how to query the API.

Installing the Custom Metrics Pipeline

Right before we have taken a shortcut, having a metric pipeline that is directly exposable as the aggregated API. Unfortunately, in the case of custom metrics, we must do this in 2 distinct steps.

First of all we must deploy the custom metrics pipeline, which will give us the ability to collect metrics. We use Prometheus for that part as the canonical example of metrics collection system on K8s.

Then we will expose these metrics via a specific API Server. We will use the work of Sully (@DirectXMan12) that can be found here for that.

Prometheus has many installation methods. My personal favorite is the Prometheus Operator. It takes a lot of efforts to architect a piece of software using traditional solution. But crafting a software model that ties to the underlying distributed infrastructure beautifully is closer to art than to anything else.

That is essentially what the operator is. The operator models how Prometheus should be given a set of conditions, then realizes that in Kubernetes. Whaow, good job @CoreOS.

Note that you can create an Operator for anything, and that something similar is coming for Tensorflow as far as I can see the APIs coming up… Anyway, let’s not get distracted.

Install the Prometheus Operator with:

$ kubectl create -f src/manifest-prometheus-operator.yaml

This contains:

a Service Account for the operator

a ClusterRole and ClusterRoleBinding that are fairly extensive, so that the Operator can deploy Custom Resource Definitions for Prometheus (instances of), Alert Managers and Service Monitors.

The RBAC manifest will allow Prometheus to read the metrics it needs in the cluster and /metrics endpoints of any object (pod or service). The Prometheus manifest defines an instance and a service to expose it as a nodePort (so we can have a look at the UI).

What is important in this second file is the section:

serviceMonitorSelector:
matchLabels:
demo: autoscaling

This essentially dedicates the Prometheus instance to Service Monitors with this label (or set of labels). When we will define the applications we want to monitor and how, we will need that information.

Note that this is a trivial example of deployment, with no persistent storage or any fancy thingy. If you are contemplating using this for a more production grade usage, you will need to spend some time on this.

OK, now you can connect on the UI and check that you have everything deployed correctly. It is pretty empty for now…

Installing the Custom Metrics Adapter

Now that we have the ability to collect metrics via our Prometheus pipeline, we want to expose them under the aggregated API.

First of all, you will need some certificates. Joy. This is all documented here. Run the following commands to generate your precious:

In order to authenticate our extended API server against the Kubernetes API Server, we have several options:

Using a client certificate

Using a Kubeconfig file

Using BasicAuth or Token authentication

Adding users with certificates in CDK is a project in itself and would deserve its own blog post. If interested, ping me in questions and we can discuss this in DMs. BasicAuth and Tokens are easy, but they also require to edit /root/cdk/known_tokens.csv or /root/cdk/basic_auth.csv on all masters and restart the API server daemon everywhere.

So the solution with the least complexity is actually the Kubeconfig file. Thanks to RBAC, the only thing we need to create a new user is a service account, which will give us access to an authentication token, which we can then put into our kubeconfig.

a metrics of type pod which tries to make sure that pods get an average 500m queries (which slightly above what the standard load from Kubernetes + Prometheus is)

So this means that you do not need the application to rely on its own metrics. You could potentially target any application metrics and use them to manage another application. Very powerful principles.

Let us say for example that you manage an application based on the principle of decoupled invocation, such as a chat or an order management solution. Some day, you start getting a peak of requests on the front end, and the backend does not follow. The queue fills up, and you start experimenting delays in processing of the requests. Well now you can scale the workers that process the queue based on the requests made on the front end. You create a target object that monitors the number of http_requests on the frontend, but the scale target may be your application. It is as simple as that.

Now look at how Custom API reacts to this (it may take a couple of minutes before this works)

What is of interest to us in this example is the “cpu_capacity_remaining”. As mentioned in the intro, I had access thanks to Kontron to a 184-core cluster. I decided to “reserve” 30 cores, or 15% of my capacity to give room for load peaks. This gave me an autoscaler looking like:

You will note I am using Electroneum as my crypto. The reason for this is practical. It is a very new cryptocurrency, with limited mining resources allocated to it right now, which means you can directly measure your impact and see daily returns, which is cool for demos. In case you are wondering, as this currency requires a Monero miner, this setup can easily be converted into something more lucrative by pointing it to a real Monero pool.

To replicate this blog with your own machines, edit the src/manifest-etn.yaml file according to your own cluster then deploy with :

$ kubectl apply -f src/manifest-etn.yaml

This manifest contains

a Deployment of the miner

a Horizontal Pod Autoscaler as seen above

a service to expose the UI on port 30500 of nodes.

Now let us check on our HPAs with:

$ kubectl get hpa -w

Alright, we are all set! Now we can finally check how our application reacts to load.

Opportunistic Load Balancer in Motion

In order to supercharge our cluster, we reuse our shell-demo application, and generate 10 hits per second on the API for 5 min. Because we are expecting only 0.5 hits, this will quickly trigger the scale out:

@shell-demo:/# for i in $(seq 1 1 3000); do curl -sL http://sample-metrics-app.default.svc; sleep 0.1 ; done
Hello! My name is sample-metrics-app-85b4c48ff-pgmm5. I have served 23528 requests so far.
Hello! My name is sample-metrics-app-85b4c48ff-pgmm5. I have served 23530 requests so far.
Hello! My name is sample-metrics-app-85b4c48ff-pgmm5. I have served 23532 requests so far.
Hello! My name is sample-metrics-app-85b4c48ff-gskxw. I have served 4 requests so far.
Hello! My name is sample-metrics-app-85b4c48ff-gskxw. I have served 5 requests so far.
Hello! My name is sample-metrics-app-85b4c48ff-8lpt5. I have served 23524 requests so far.
Hello! My name is sample-metrics-app-85b4c48ff-gskxw. I have served 6 requests so far.
Hello! My name is sample-metrics-app-85b4c48ff-pgmm5. I have served 7 requests so far.
Hello! My name is sample-metrics-app-85b4c48ff-pgmm5. I have served 23534 requests so far.
...

There you go, we can see the new pod coming in. Each new pod requests 4 CPU cores to the cluster. This unbalances the HPA, that tries to counter by releasing miners. Over 5 minutes, our app will scale up to 17 replicas, thus claiming 68 cores to the cluster, which will be freed from the mining app. After 5 minutes, the load is now normal and we see a scale down of the simple app from 17 pods down to its stable version at 2 replicas. The HPA for the miner will react and start harvesting the capacity.

This can be seen in the UI on the CPU capacity graph

We now have an application that is opportunistically adjusting to the load created by other applications in the cluster all by itself. To see a little better how the HPA behaves, we can look directly at Grafana:

Here you can clearly identify the peaks of load on the second graph in green, how the HPA reacts by scaling the number of API replicas. On the top graph, we can see the blue area (business load) going up, and shortly after the yellow line going down (this is the opportunistic app scaling down), the red "remaining CPU cores" fluctuating, while the total (yellow + blue + red) is about constant, representing the total number of cores in the system (184).

I should share a potatoe as this post was really really long.

Some thoughts about the HPA

Keep the non-business load low

While creating this blog, I had a hugely hard time configuring the HPA to make it stable and convergent and not completely erratic. One must understand that the HPA in K8s is, so far, pretty dumb. It is not exactly learning from the past, rather systematically repeating the same reaction patterns regardless of the fact they failed or succeeded in the past.

Let’s say a custom metric is at 150% of its target value, then the cluster will perform a 150% capacity increase. This means that if your application is creating 2% HPA resource value for 1% increase of scale, you will enter into a turbulence zone, with the HPA proving incapable of converging because it is always overreacting to the environment.

Because of that behaviour, if the opportunistic load represents the majority of your total cluster load, you have a risk of generating an ever fluctuating, sub-optimal HPA. Below is an example where the mining rig varies between 15 and 120 cores (60% of the cluster), while the business load is only ~20%. Under these conditions, the cluster takes too long to converge and, effectively, sometimes never does so.

Long story, short: NEVER, EVER use an HPA that can diverge!! Experiment and learn, keep the influence of the HPA reasonable in the cluster.

Total Failure

So this is something I was not able to debug completely.

In the last Grafana screen above you can see that there is a longer peak of high load in the second load surge. In effect, the HPA got stuck and for some reason would never go back until you force it to.

From my experience, this only happens when an HPA fluctuates greatly then reaches its max. At this point, if that condition lasts for too long, it will then fail to downscale afterwards, thus effectively crashing.

Again, when building an HPA, do some experiments. Test your metrics, make sure they work well together.

Conclusion

I always dreamt of building the “Opportunistic Autoscaler”. For the first time in my life, thanks to Ronan, Kontron, and the awesome work done by the community on Kubernetes and Canonical on CDK, I was able to put it together. And it "just works"!

At the beginning of the post, we wanted to add value by either reducing costs or increase revenues. Depending on your opportunistic app, you may be in either or both use cases.

For sure, over the course of the month this was up and running, we managed to:

average ~25 free CPU cores, from a target of 24 in the HPA , creating random load every 30 mins.

Opportunistically consume an average of 30 cores, which would have been lost otherwise.

Does this make money when mining cryptos? Not much. We mined about 1000 ETN while testing this, for a total value of about $100. More than nothing, but not a lot.

But now.... think Serverless and do the math. 30 cores is about 1 sled of the system, free at any time. Assuming this also translates into the same amount of RAM being free:

A sled can have up to 256GB RAM;

For Lambda, AWS charges 0,00001667$/GB-s plus a bit for the invocations;

There are 86400secs/day x 365,25 = 31 557 600 s / yr

soooo… 256 * 0.00001667 * 31, 557, 600 = $134 672.69

$134 672.69 is -- if it was permanently running Lambda 100% all the time -- the business value of the very sled we just used. Not bad for an "unused" resource.

Does that give you some ideas?

References

I would like to give a special thanks to @Luxas and @DirectXMan12 for inspiring this work and for the fantastic walkthrough above.