3 Reasons Behind T-Mobile’s Success with Kubernetes

Many companies are experimenting with Kubernetes. Only some are achieving business outcomes with the technology. What can the experimenters learn from the success stories? Let’s look at T-Mobile.

Founding members of T-Mobile’s Platform Engineering Team, James Webb and Brendan Aye, have a lot to say about creating an on-premises container orchestration service using Kubernetes.

They shared some initial outcomes at SpringOne Platform and KubeCon—most notably that T-Mobile is successfully running mission critical applications with Kubernetes, powered by Pivotal Container Service (PKS). These workloads include third-party software applications that are vital to T-Mobile’s order management and customer support, as well as other consumer-facing apps like maps.t-mobile.com.

All of this was accomplished in a relatively short amount of time. So how did T-Mobile’s platform team realize value from it’s Kubernetes service so quickly?

When listening to Webb and Aye discuss their work, what bubbles up is that they’ve learned a lot from managing their application platform, Pivotal Application Service (PAS). Their experience with the app platform informed their approach to Kubernetes, with excellent results. Here are some of the things that have contributed to T-Mobile’s success with Kubernetes in production:

“We are huge fans of BOSH,” said Aye. “The whole day-two operations piece of the upgrades, the OS patching...it all makes the process so seamless for us and consistent across both environments.”

T-Mobile’s story underscores the value of employing abstractions for applications and containers in a single platform. When you have both abstractions running in the same control plane, more workloads will benefit from the automation and security features of Pivotal Cloud Foundry (PCF). A single platform means application teams don’t have to switch back and forth between completely different systems. Platform teams also benefit from one set of technology tools to learn and manage. Using one platform enables teams to be more efficient and focused on delivering great software for the business.

Built on the common operational foundation of BOSH, PKS provides an excellent environment for containerized workloads to run alongside PAS applications. The reality is you’re going to have Kubernetes and an application platform working together. It’s not an either/or decision. Industry pundits have been saying this for a while, and more enterprises are following this advice.

According to Webb and Aye, T-Mobile has been “wildly successful” with PAS (which Aye and Webb sometimes refer to as their Platform-as-a-Service or PaaS). During an interview at SpringOne Platform, Webb explained that T-Mobile’s PAS “is our first choice for apps, especially for code that's written in-house.”

Take a look at the outcomes they shared at KubeCon:

T-Mobile uses PKS to run workloads that don’t quite fit in the 12-factor app world, as well as commercial off-the-shelf (COTS) software packages. “A lot of vendors come with pre-supplied containers, or we have applications that require persistent storage,” Webb said.

For example, T-Mobile’s order management system is a third-party application. This application uses a local cache which then offers out a RMI port to receive updates to that cache. It uses TCP routing (instead of HTTP routing) and the cache should have persistent storage underneath but does not. These requirements are “non-standard” in the “PaaS world,” noted Aye.

Now, these types of containerized applications run on PKS. “They [the application teams] could run a much more generic container in Kubernetes, get best in class orchestration, and be able to really adapt to the needs that they have: be able to deliver services from vendors or from Docker Hub, whatever they want to do,” explained Aye.

“The first app actually went live [on PKS] in August [2018] and it's maps.t-mobile.com,” said Webb. “It's a coverage map where you put in a location and it shows you T-Mobile coverage in that area. This is also a very cool example because they are running in three or four different places and load balancing across them. So they are running in our on-prem Kubernetes, they're running in our on-prem PaaS and they are running in the public cloud.”

“We set a very high bar for how we supported [developers] on the PaaS side,” Webb said at KubeCon.

In the years since Webb and Aye began managing PAS, they’ve learned a lot about running a platform and what it can (and should) offer. As it turns out, already running a distributed system like PAS at scale is helpful when you want to add Kubernetes into the mix.

In fact, the success with PAS helped Webb and Aye determine their requirements for a Kubernetes service (or “CaaS” for container-as-a-service). They wanted the same high availability, resilience, scaling and automation that they have for PAS in their Kubernetes service. PKS met those requirements. And, because PKS is part of PCF, it shares an operational toolchain with PAS.

In addition to a native Kubernetes experience, they wanted a lot of built-in services and support. “And most importantly,” said Webb, “centralized logging and metrics. That's a big, big deal.” Here was the full set of requirements they gathered, as shared at KubeCon:

A valuable nugget of advice from Webb and Aye is the importance of setting up automated deployment for a Kubernetes service right from the start. “A huge lesson learned on the Cloud Foundry side was everything we installed to start with, we installed by hand,” said Webb. “And now… they are automating the mound of tech debt that we left behind. On the CaaS side, we wanted to start with automating everything we possibly can. Not just control points but cluster builds.”

Using a continuous integration and delivery (CI/CD) tool such as Concourse provides the automation needed to upgrade and patch efficiently. The CI/CD pipelines that Webb and Aye’s team already implemented for PAS serves as a model, making it easier to build the pipelines for PKS. This automation allowed Webb and Aye to address the December 2018 Kubernetes vulnerability very quickly across all of their clusters, with no downtime.

“The same day that the 1.11.5 patch was released for the API CVE, we were patched within 36 hours. And we've seen this back on the PAS side as well, where usually our systems are patched far ahead of anyone else's systems because it’s a single action to initiate a patch,” said Webb.

3. Apply platform-as-a-product principles.

Perhaps one of the messiest parts of establishing Kubernetes comes down to deciding how this technology will fit into the organization’s existing roles and responsibilities.

Kubernetes is relatively immature—it’s easy to forget that the tech has only been around since 2014. There are many different ways to configure Kubernetes, but most admins will use kubectl to apply configurations to a cluster (infrastructure tasks) and for deploying, scaling and cycling apps (application tasks). Kubernetes is not known for its ease of use. As Webb put it, “with Kubernetes there’s a steep learning curve.”

To help with that learning curve, the T-Mobile team wanted to automate reference designs and best practice configurations to help developers get to production. PKS helps with that. As Webb described it, “You have a set of plans and then you choose which plan you want to use, choose how many nodes you want, press a button, deploy the cluster.”

The “menu” of configurations in PKS allows Webb and Aye’s platform team to configure clusters in whatever way is most relevant to T-Mobile’s application teams. Webb and Aye take responsibility for building Kubernetes clusters that deliver superior uptime and security standards for the enterprise. They’re not limited to a single cluster configuration from a managed service offering. Nor are the app teams burdened by having to build and configure their own clusters. PKS provides functionalities that cater to the divergent needs of platform teams and application teams.

At T-Mobile, the teams are structured to take full advantage of the platform-as-a-product practice model. Aye and Webb are a part of the Platform Team which manages and delivers the platform (“product”) to application teams (“customers”). Delivering a platform involves more interactions than just maintenance and provisioning an environment. “We're being advocates for the platform… We are actively helping users develop good patterns. It helps us understand their workloads. It helps them understand our concerns as well,” Webb explained.

“We're looking to provide, at least for the initial go around, a curated environment,” Webb noted at Kubecon. “We're not handing over clusters with cluster administrator access and then they [developers] go to town. We are providing a resource.”

“Once a cluster is deployed, we have some more pipelines that kick in and basically ‘T-Mobilize’ the cluster,” said Webb. We install monitoring and persistent storage, ingress, logging. It's still some manual steps to get the internal balancers configured.... [and then] we consider the cluster production ready.”

As the cluster owners, Webb and Aye are also collaborating with T-Mobile’s public cloud team to determine a common Kubernetes offering.

“It's very important from our standpoint that when teams move between on-premise and Cloud providers, they don't have to learn a whole new set of workflows or API pulls,” said Aye. “Using Kubernetes as that abstraction, kubectl or the Kubernetes API is what you have to learn. You don't have to learn AWS versus Azure versus GCP. You can focus on that abstraction and move much more quickly between on-prem and Cloud providers.”

Webb and Aye’s Platform team does all the heavy lifting and delivers production clusters allowing their customers to focus on coding.

Kubernetes Supports a Business Need

Sticking to all the best practices won’t make a difference if your technology stack isn’t aligned with your business. Kubernetes (or any new tech) will merely be a shiny object to tinker with unless it is implemented to address a clearly-defined business problem.

Aye and Webb needed a CaaS to run 3rd party software that is critical to T-Mobile’s business. They established their requirements for Kubernetes based on their experience and success with PAS. And they can now track business outcomes to make sure the technology is delivering valuable outcomes.