Kubeflow 0.3 released with simpler setup and improved machine learning development

Early this week, the Kubeflow project launched its latest version- Kubeflow 0.3, just 3 months after version 0.2 was out. This release comes with easier deployment and customization of components along with better multi-framework support.

Kubeflow is the machine learning toolkit for Kubernetes. It is an open source project dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Users are provided with a easy to use ML stack anywhere that Kubernetes is already running, and this stack can self configure based on the cluster it deploys into.

Features of Kubeflow 0.3

1. Declarative and Extensible Deployment

Kubeflow 0.3 comes with a deployment command line script; kfctl.sh. This tool allows consistent configuration and deployment of Kubernetes resources and non-K8s resources (e.g. clusters, filesystems, etc. Minikube deployment provides a single command shell script based deployment. Users can also use MicroK8s to easily run Kubeflow on their laptop.

2. Better Inference Capabilities

Version 0.3 makes it possible to do batch inference with GPUs (but non distributed) for TensorFlow using Apache Beam. Batch and streaming data processing jobs that run on a variety of execution engines can be easily written with Apache Beam.

Running TFServing in production is now easier because of the Liveness probe added and using fluentd to log request and responses to enable model retraining.

It also takes advantage of the NVIDIA TensorRT Inference Server to offer more options for online prediction using both CPUs and GPUs. This Server is a containerized, production-ready AI inference server which maximizes utilization of GPU servers. It does this by running multiple models concurrently on the GPU and supports all the top AI frameworks.

3. Hyperparameter tuning

Kubeflow 0.3 introduces a new K8s custom controller, StudyJob, which allows a hyperparameter search to be defined using YAML thus making it easy to use hyperparameter tuning without writing any code.

4. Miscellaneous updates

The upgrade includes a release of a K8s custom controller for Chainer (docs).

Cisco has created a v1alpha2 API for PyTorch that brings parity and consistency with the TFJob operator.

It is easier to handle production workloads for PyTorch and TFJob because of the new features added to them. There is also support provided for gang-scheduling using Kube Arbitrator to avoid stranding resources and deadlocking in clusters under heavy load.

The 0.3 Kubeflow Jupyter images ship with TF Data-Validation. TF Data-Validation is a library used to explore and validate machine learning data.

You can check the examples added by the team to understand how to leverage Kubeflow.

The XGBoost example indicates how to use non-DL frameworks with Kubeflow

The team has said that the next major release: 0.4, will be coming by the end of this year. They will focus on ease of use to perform common ML tasks without having to learn Kubernetes. They also plan to make it easier to track models by providing a simple API and database for tracking models.

Finally, they intend to upgrade the PyTorch and TFJob operators to beta.