“Kubernetifying” the Analytics Stack

Kubernetes is already proving its value for cloud-centric apps. Now there is a move to bring such benefits to analytics workloads. Alluxio CEO Steven Mih shares how ‘Kubernetifying’ an analytics stack can solve data sharing and elasticity issues.

While containers and Kubernetes have proven their value to stateless apps and self-contained databases, advanced analytic workloads have yet to widely embrace Kubernetes.

That may be about to change. It is becoming clear is that "Kubernetifying” the analytics stack can deliver many untapped advantages. (How)?

It turns out Kubernetes itself provides an ideal environment for these types of workloads. By making components of the advanced analytics stack (technologies such as Apache Spark, Presto, and storage) more container-friendly, one gets the operational benefits of K8s. This benefit is in addition to solving emerging data sharing and orchestration challenges.

So, “Kubernetifying the analytics stack,” as we’re calling it, can solve both data sharing and elasticity challenges because it can support the moving of data from remote data silos into Kubernetes clusters for tighter data locality.

If we look back, we can see that the analytics stack itself has over the years already made a Kubernetes-friendly shift.

Note the analytics move from

(a) tightly-coupled data warehouses to (b) analytics on Hadoop to (c) analytics run on the cloud - because data is now stored in many locations

So, today, the analytics stack itself has now become more disaggregated. This means, in turn, that original database core elements can be their own standalone system or layer.

Conveniently, Kubernetes can allow for these different pieces to be put together in a way that simplifies running applications in any environment.

As a result, the stage is set to transform the way software and applications are deployed and scaled, agnostic of the underlying infrastructure.

The timing for such innovation through “Kubernetifying of the analytics stack’ would appear perfect.

Looking at where today’s data trends are taking us (especially in the advanced analytics and AI space), there’s a greater need for orchestrating data into and out of your Kubernetes deployment due to the demands of distributed model training and processing.

We already see more around the disaggregation of compute and storage - the two can no longer be tied together based on what today’s realities of disparate data stores and the on-demand benefits of cloud computing.

Further, today’s modern analytics stack is split apart across data lakes (S3, HDFS, GCS, etc.), compute frameworks (Presto, Apache Spark, Hive, Tensorflow, etc.), and other technologies like catalog services (Hive Metastore, AWS Glue, KMS, etc.). Despite the appeal of this dis-integration, the downside is that moving data into and out of K8s can be hard.

This is where “Kubernetifying” of your analytics stack becomes an intriguing option.

To get practical, let me share several reasons.

At a high level, Kubernetes simplifies the complexity of deploying many distributed systems together. And, as we continue to see this trend of the disaggregation of compute and storage becoming more common, most likely there will be more advanced and operational AI workloads running on K8s clusters.

As I mentioned, because the types of workloads we typically see in the analytics space are elastic and require the ability to scale horizontally, Kubernetes is an ideal environment for these workloads. It’s also much easier to manage the costs of these workloads with the flexibility K8s provides.

One last point, most analytic workloads are varied - some short-lived, others long-lived. This mix is well-suited to a containerized environment. Kubernetes manages containers across any environment, so companies get the flexibility to migrate to the cloud or employ a multi-cloud approach.

So, moving from design, let’s explore how to deploy a Kubernetified analytics stack.

The first thing you’ll want to do is get containerized versions of each framework (these are widely available).

Finally, you’ll need to determine the security model and the resource model. As an aside, you’ll also need to choose an approach to run your workloads - either as-a-service or on your own clusters.

After the design and deployment steps for Kubernetifying your analytics stack, the last step to be successful is to bring data locality back into the environment.

Today’s AI workload require this piece. For data-driven workloads in disaggregated stacks, there’s no native data access layer within a Kubernetes cluster. For query engines and machine learning frameworks that are deployed within a Kubernetes cluster, any critical data sitting outside the cluster breaks locality.

Data orchestration technologies solve for these issues and bring three critical things to your Kubernetes environment:

Data locality on demand for caching data close to compute for big data analytics or machine learning workloads

High-speed data sharing across compute jobs

Data abstraction across data silos

So, with a data orchestration platform, you can cache data close to compute, get a closer storage layer for compute jobs, and unify persistent storage, enabling seamless data movement in and out of your Kubernetes cluster.

Let me conclude by sharing that this Kubernetifying of the analytics stack is not theoretical. It is happening today.

Many types of companies are Kubernetifying their analytics stack. We see it with a wide range of companies -- from leading financial companies to small analytics-as-a-service startups.

And, as to the importance of data orchestration, these companies are using this technology to pull data into their Kubernetes deployment with excellent results. They mount external object stores as well as remote data sources. They get data locality on demand by caching data close to compute for their workloads, high-speed data sharing across compute jobs, and data abstraction across data silos.

Further, by containerizing their stacks and separating compute from storage, these companies have seen a dramatically higher performance, elasticity, and less engineering overhead.

Because of the separation, or disaggregation, of compute and storage, Kubernetes provides much more flexibility, scalability, and performance than Hadoop. (In all honesty, Hadoop can be a pain when it comes to analytic workloads.) Kubernetifying your analytics stack coupled with data orchestration technologies is the next phase of the advanced analytics and AI infrastructure evolution.

Steven Mih is CEO of Alluxio, developer of open source data orchestration software for the cloud that focuses on moving data closer to AI and machine learning compute frameworks. Steven has 20+ years in enterprise technologies. Prior to Alluxio he held posts at Aviatrix, Couchbase, AMD among others.