Machine Learning: The Third Pillar of the Operational AI Platform

Announcing the Beta Program for Splice Machine ML Manager

Today we are excited to announce the beta program for a new product from Splice Machine called ML Manager. ML Manager is a critical component of our vision to operationalize and incorporate artificial intelligence and machine learning into mission-critical applications. We call this Operational AI as depicted in the diagram below. We believe that in order for organizations to make intelligent decisions in real-time and at scale, they need to bring together three types of intelligence into a unified platform: operational intelligence, business intelligence, and artificial intelligence or machine learning.

For those who are familiar with Splice Machine know that it is an open-source data platform that incorporates the scalability of Apache HBase and the in-memory performance of Apache Spark. Splice Machine’s cost-based optimizer uses advanced statistics to choose the best compute engine,, index access, join order and join algorithm for each task. By leveraging the dual-engine architecture, Splice Machine concurrently processes transactional and analytical workloads at scale. As you can see that HBase and Spark form the SQL operational database and the SQL data warehouse components of the Splice Machine platform. Today we are completing this picture by extending the third capability of comprehensive machine learning to our platform.

Introducing ML Manager

ML Manager is a data science workbench with one significant difference – the ML capabilities are native to the platform. This means that ML modes are executed inside the database and this architectural approach offers a number of significant advantages over other workbench offerings in the market.

First and foremost, traditional data science workbenches tend to be part of the IT infrastructure that is separate from the OLTP database that powers transactional applications and the data warehouse that powers the analytical applications. This negatively affects both the model building and training process as well as the model deployment process. In order to build and train the model, data engineers need to ETL data from various systems. This is a cumbersome and inflexible process that introduces latency into the experimental process and breaks whenever a new data source is introduced or the data structure changes. Once the data scientist has built and trained the model and is ready to deploy into production, it again requires complex data transformations to operationalize the feature vectors. This data science process built on traditional infrastructure prolongs the time to learn and limits the number of models that can be deployed into production in a fixed period of time.

The limitations of traditional data science architecture do not end there. If the model is operating on rapidly changing data that requires the model to be trained frequently, then the latency built into this process due to ETL and feature operationalization forces the model to operate at a suboptimal level and in certain extreme cases make predictions that are incorrect. Furthermore, in order to build a robust model, requires data scientists to continuously experiment with different data sets, transformations and parameters. Only through continuous experimentation, a data scientist can identify the most effective model to put into production.

In order to overcome the issues, inherent in the traditional data science process, Splice Machine has built an in-database data science workbench to support the complete lifecycle of machine learning ranging from transactional updates to data wrangling to experimentation, and finally to deployment – all delivered as part of a real-time integrated platform. With an RDBMS at its core and integrated with MLflow, ML Manager supports industry-leading notebooks and libraries and provides a seamless path to model deployment.

ML Manager’s architecture is represented in the diagram below:

MLflow Integration

A key element of ML Manager is integration with MLflow. MLflow provides an API and UI for tracking and managing the ML pipelines and experiment runs. By tracking the runs, the data scientist can choose any version that he or she chooses to deploy into production.

Benefits:

Tracking API to manage all the experiments and runs a data scientist might perform

Reports experiment and run metrics in a useful UI

Includes tools to visually compare experiment runs

Tracks the pipeline artifacts for later deployment across all runs

Sagemaker Deployment Integration

Sagemaker enables data scientists to get their models into an auto-scaled production environment faster. Through Sagemaker’s integration with ML Manager, data scientists can deploy their models seamlessly via a Docker-based build and deploy process.

Benefits:

Streamlined deployment of the selected model to an application that can invoke the model’s API for decision making at any scale

H2O Integration (Coming Soon)

H2O’s libraries are among the most well-regarded ML libraries in the community. We have pre-packaged these libraries With ML Manager, so data scientists can start using them without any integration effort or delay.

Benefits:

Sparkling Water provides access to H20 algorithms from Spark

Deep Learning framework (including TensorFlow integration)

GLM, GBM, XGBoost

AutoML – automatic training and tuning framework for models

Apache Zeppelin Notebooks

Web-based notebooks such as Apache Zeppelin represent the Industry standard mechanism to rapidly develop and collaborate on data science solutions. With Zeppelin part of ML Manager, data science teams can collaborate and share their code and analytics in real-time with other team members..

Benefits:

Plug in any language or data-processing-backend such as Apache Spark, Python, JDBC, and Shell into Zeppelin

Built-in Apache Spark integration

Visualize data from any language used as a backend

Share results as an embedded iframe

Built-in Angular UI components

Native Spark Data Source

ML Manager leverages the power of the Splice Machine Native Spark DataSource to provide dramatic performance improvements for large scale data operations. The Native Spark DataSource works directly on Spark-optimized DataFrames and RDDs, eliminating the need to serialize data over the wire using the JDBC protocol.

ML Manager Beta Program

We will provide you with information on how you can test drive our new product. The feedback you provide will help us improve ML Manager functionality and fix any issues. We look forward to hearing from you.