How 6 of the world’s largest companies use Kub + Sysdig.

How to write a custom Kubernetes scheduler using your monitoring metrics

This article covers the use case of creating a custom Kubernetes scheduler and implements an example using monitoring metrics coming from Sysdig: system, network, services, statsd, JMX or Prometheus metrics.

The default Kubernetes scheduler does a fantastic job for most typical workloads. Starting from Kubernetes 1.6 advanced scheduling features like node or pod affinity, taints and tolerations allows you to configure several pod scheduling policies: in a specific set of nodes (node affinity/anti-affinity), close or far away from other running pods (pod affinity/anti-affinity), or just based on some tags that pods like or dislike (taints and tolerations).

But maybe you have some more specific requirements or would like to use higher level and dynamic application information to map your new pods to the physical nodes. Always striving for extensibility and flexibility, Kubernetes 1.6 introduced multiple scheduler/custom scheduler support as a beta feature.

What if you could use any of the metrics already present in your Kubernetes monitoring system to configure the behaviour of your pod scheduler?

The following is an example of a custom scheduler using metrics from our Kubernetes monitoring tool: Sysdig Monitor. In Sysdig, all metrics are automatically tagged with Kubernetes metadata, so you can easily do advanced monitoring, alerting, troubleshooting and now, advanced scheduling too.

Coding this scheduler may be a lot simpler that you may imagine. Let’s start with a simple example to give you some context and ideas. Say for example that you want to optimize the responsiveness that your users perceive, so you decide that you want to place new web server pods in the physical host that is scoring the best HTTP response times at that specific point in time.

Normally, as a prerequisite you would have to instrument your application, but Sysdig collects requests, errors and response times metrics for any application or service without any kind of code instrumentation. But if you wanted to write the scheduler based on the behavior of an internal application metric, Sysdig will get any statsd, JMX or Prometheus metrics for you automagically, awesome! Isn’t it?

Configure your pods to use a custom Kubernetes scheduler

First, you need to configure your pods to use a custom scheduler:

This is a very simple vanilla Nginx replicationController. Note that we added schedulerName: sysdigsched to the pod definition. Remember that this is a Kubernetes 1.6+ feature, so this config parameter will throw an error when using older versions.

For this example, we are only going to use the net.http.request.time metric, but the metrics variable is actually an array, you can easily configure the metric you want to use from an external file or use several metrics to create your custom “node score” function.

Next, you define the scheduler name:

scheduler_name = "sysdigsched"

This is the name that will be registered on the Kubernetes API, it has to match the pod spec name.

And now, the main loop of the scheduler, it waits for a new event containing an object in Pending state and a spec that requires our scheduler_name.

Custom Kubernetes scheduler – Golang implementation

During KubeCon EU 2018, we presented a newer and more complete Golang version of the Python code above. You will find the source code and usage instructions here.

This implementation still cannot be considered production ready, however, it has some relevant improvements over the Python version:
* Metrics cache & metrics reuse
* Failover and failover recovery
* Async event handling and scheduling

Further thoughts

This is a relatively simple PoC example, if you really plan to code your own production-level scheduler:

Declare and properly manage all the possible exception conditions.

An scheduler has to be fast, benchmark the time it takes to pick a node, average and outliers, maybe you want to use Sysdig Tracers for that?

If your code returns an error or is taking too long, you can always code a fallback to the default Kubernetes scheduler, much better than having orphaned pending pods.

A few more use case examples for writing a custom Kubernetes scheduler, we are sure you can come up with your own:

We hope you found this example useful when diving deep in customizing your Kubernetes cluster behavior. For more deep dives and clear visibility on what your Kubernetes and your containers are doing, check out our Sysdig Monitor and Sysdig Secure products and start a free trial yourself.