DevOps for AI – The AI Layer

Diego Oppenheimer

1 year ago

When Google’s Gradient Ventures invested in us, they did so with an understanding that it is incredibly hard to deploy AI/ML infrastructure — and that every dev team is going to need to solve this problem.Our solution, the AI Layer, is the best-in-class architecture.As our co-founder, Kenny Daniel says: Tensorflow is open-source, but scaling it is not.DevOps for AI presents massive challenges to all sizes of organizations. If your team is only a couple of developers, you don’t want to distract them from their primary mission by requiring them to put in a ton of effort supporting infrastructure.For large organizations, AI/ML models require a completely customized DevOps stack. Let us show you how…

From zero to deploying at scale

Step 1: Big Data— Collect and Clean For the last ten years, every conference has had someone warning us all that data is the new oil. The companies that can collect and clean the most data will be able to put that data to use to solve real business problems. The companies with the most data are best positioned to win, but they have to make good use of that data.

In the early days of train transportation, rail companies first made sure that they had the coal extraction infrastructure in place, to guarantee their viability in the market. Similarly, in AI, data must be collected and cleaned before models can be trained with the data. This is often the first step in a company’s path to market dominance.

Step 2: Machine Learning and AI Engineering—Build the Models After companies begin to amass data, the hiring bonanza for AI/ML engineers begins. Data is being used for advanced analytics and training AI/ML models. Today there are thousands of engineers turning out models. The very largest tech companies are deploying some of their models, but the vast majority of Data Science and Machine Learning investments are merely producing internal Business Intelligence reports — nothing more. The aim of AI isn’t just PowerPoint slides… it’s to drive value for your customers. The last mile of getting trained models to production at scale is daunting and has most corporate dev teams frustrated.

In our train metaphor, the AI/ML models are the engines and trains. They can make use of the data and put them work — but until recently, these engines didn’t have the track infrastructure to put their power to use. Today, only the biggest tech companies have built the infrastructure, and they’ve done it at enormous cost both in initial outlay and in maintenance.

Step 3: Dev Ops—Make the models work in production We’ve talked with multiple large organizations that refer to their AI/ML deployment infrastructure as “Frankenstein”. The infrastructure that has evolved for deploying traditional applications is not optimized for the organizational challenges of deploying AI/ML at scale.

Initially, trains and tracks were built with custom sizing. Each rail system had different sized tracks built for different trains. Until there was standardization, trains were one-trick ponies going, for example, between a single coal mine and a single dock. Once infrastructure became standardized, technically capable, and cost-efficient, it set off a network effect that transformed everything.

We believe that the AI infrastructure, which brings models from development to production, is the last mile: it converts everyone’s AI/ML knowledge into true ROI.

Step 4: Integration and Iteration—Continuous Improvement Once the infrastructure and practices are in place, allowing data scientists and ML engineers to push models and algorithms in their language of choice into elastically scalable production, data scientists can iterate on their model rapidly with the live data they get from users.

Once networks of tracks were standardized and could connect, business took off, investment poured in, and trains changed everything. AI will be exactly the same. We just have to make those connections first.

The AI Layer

The goal of the AI layer is to separate the AI/ML models from the business logic of your applications, automate DevOps for easy deployment, and decrease administrative overhead by making models, algorithms, and functions discoverable, setting permissions and automatically versioning.

The AI layer is the set of software, systems, and processes that sit between an organizations AI/ML models and their business logic layer. Any company that plans to deploy AI/ML models will need to decide whether to build their own AI Layer or to buy their AI layer. Based on the complexities that we’ve worked through over the last 5 years, we believe that the best way forward is to show what we’re doing under the hood and let you make the decision about which path to take.

We’re always happy to get on a call to help you figure out the best path.

Our AI Layer has these features:

Serverless Microservices Serving AI/ML models are the killer app for a serverless architecture. They are often compute-heavy but only in spikes, they are stateless, and the data scientist and ML engineers who build these models are often working asynchronously from the platform or product engineering teams.

Continuous Delivery The closer that you can get to having Data Scientists and ML Engineers being able to push their code into production with zero intermediate steps, the better. When friction is removed, it allows for fast iteration, and that leads to progress.

Multi-cloud The architecture of your systems are likely impacted by the specifics of your company. For instance, maybe regulations require that you keep some data in a private cloud and other data in AWS S3. Your AI Layer should be flexible enough to run where you want. This flexibility is made possible by a well architected AI Layer, a microservices architecture, with advanced container orchestration. With how fast this world is changing, it’s essential that your deployment options are as flexible as possible.

Language Agnostic A huge advantage of microservices is that they can be in any language you like because they are able to run on their own. Architecting your AI layer to use microservices allows your data scientists and ML engineers to use R or Python without any impact on the rest of your architecture. This type of flexibility lowers the friction for collaboration and deployment of AI.

Container Orchestration for AI In order to abstract away the specifics of the servers and cloud services you’ll use, you’ll need an advanced system like Kubernetes or Docker Swarm. On top of that, you’ll need to custom build your scheduling to increase utilization of your high-end compute resources like GPUs, TPUs, or this new generation of Neural Network optimized chipsets. Remember that over the next 3-5 years there are going to be massive investments and innovations in this area—so you’ll need to ensure flexibility is designed into your system. You’ll also want to build memory management services that allow smart routing of compute resources between CPUs and GPUs based on system usage.

Discoverability As organizations invest in Data Science and Machine Learning, they find that work is being duplicated all over the organization. For most, there is no centralized system for logging all models, making them searchable, and allowing engineers to deploy them as microservices. This means that many teams may be duplicating efforts, and other teams may have the need for a model but not the resources to develop it. A proper AI layer allows you to make use of your entire AI/ML portfolio.

Version Control and API Pipelining A good ML practice should be iterating on models regularly based on new data from the model being used. Without good version control and automatic API pipelining teams are incentivized to have infrequent updates to models. The administration system for your AI layer should make versioning and API pipelining simple and should empower individual engineers and data scientists to deploy their models as often as is useful, and have all of the versioning logic in place to keep legacy API versions supported.

Permissioning The idea of making all of an organizations most valuable assets available to anyone in the company may send shivers down the spine of some in IT governance. For this type of organization, it’s essential that strong permissioning is available. An AI layer should support your corporate Auth provider.

Security Governing all of your models can present a number of challenges—you’ll need to ensure permissioning, access controls and provide auditability.

Monitoring and Analytics You’ll want to make sure that various stakeholders have insight into the performance of your AI layer. You’ll want to see which models are most used, the performance of the overall system, performance on a per model basis, usage by user or team (often important for accounting).

In Closing

The AI Layer is the missing last mile to ROI on AI investments. Giving your ML engineers and Data Scientists the ability to deploy scalable models with Git push will increase speed to market, increase the number of iterations you can make, and will make your talent happy. Making your models discoverable and scalable with API driven serverless microservices will empower your engineers to build intelligence into all of your applications. When you get into the phase of designing your own AI layer or deciding if you’re going to purchase, we’d be happy to talk with your team and help you understand what we’ve learned over the last 5 years. Just shoot us a message.