Category: Data Science

Developing and bringing machine learning models into production is a task with a lot of challenges. These include model and attribute selection, dealing with missing values, normalization and others.

Finding a workflow that puts all the gears, from data preprocessing and analysis over building models and selecting the best performing one to serving the model in a real time API, into motion is the one I want to discuss here.

Life cycle of a machine learning model

The life cycle of machine learning is basically described by the iteration of the following four steps.

Machine learning steps

Each of these steps is under constant evaluation. Especially in case model performance can be enhanced by adding different data attributes or preprocessing methods.

For the presented approach we split the process of modeling into two other parts. Part one contains the above mentioned four steps and we call it Manual Run Modeling. Step two is automating the steps of part one.

Manual Run Modeling

In the manual part we start by analysing our new task. After that we come up with a hypothesis we want to prove and test.

Development and Prototyping Environment

First we set up a development environment for working on the new task. For this we spin up a Jupyter notebook server. We deploy it on Google Cloud AI Platform. It provides ready to run containers for Jupyter. The notebook approach enables us to develop fast and share results with the team using a browser. With the ability to easily visualize data inline in a notebook, this approach is especially useful in the data extraction and preprocessing process.

Data preparation and visualization

Python provides some nice packages for generating graphics on data for faster insights. This speeds up our prototyping process in the notebook. We are especially fond of using Seaborn.

We load the data identified for this model into a dataframe in the notebook. After that we start looking at each attribute and its values, often in combination with the other attributes. For a first overview we use a pairplot provided by Seaborn.

We use a combination of visualizations, such as e.g. a correlation matrix. Then we decide which attributes to use and how to handle outliers and missing values. Finally we use one hot encoding for categorical attributes and normalize the continuous attributes to create the input into our models.

Model Selection and Evaluation

When the data is ready we choose several models to find a solution for our problem. These models can range from a multilinear regression model over random forests to deep neural networks with Tensorflow.

After splitting the data for training, evaluation and test we decide on a measure each model has to optimize for, e.g. mean squared error or precision. This depends on the kind of problem. Once we identified the best performing model we start by transforming the code for Google Cloud AI Platform.

RT Prediction Deployment

After manual evaluation of preprocessing and modeling, we start the task of automating training and deployment for bringing machine learning models into production. This can be split into three tasks:

Training on Google Cloud AI Platform

After deciding on a model to go forward into production with, we optimize our code for data extraction and preprocessing. The reason for this is to make it reusable and compliant with Google Cloud AI Platform rules. This means basically we have to create a Python package out of the first three steps.

This Python package is then deployed to Google Cloud platform and executed there. If you have custom packages there is an option to supply those too. An example call for training on the cloud would then look like the following example.

One advantage of using Google Cloud AI platform is the possibility of using automated hyperparameter tuning for models. This enables us to train a model automatically with different configurations. Then we select the one performing best for the defined measure in hptuning_config.yaml.

In the AI platform dashboard you can then see, which hyperparameter combination of your defined values in params had the best results for the defined hyperparameterMetricTag and goal.

The identified model is then ready to be deployed to the platform, where Google provides an URL to access the model in real time.

Deploying the model on GCP

Deploying to production is done with a Jenkins job. We use Jenkinsfile to define our jobs as part of our code. A model deployment consists of the following steps:

Copying the model to the correct GCP bucket (differs for our three development systems development, staging and production)

Deploy model to AI platform using a gcloud command

Test model with a prepared test dataset

If all of these steps are successful the model is ready for usage in the specified environment via an URL endpoint.

Deploying the Real Time API

Since the model is deployed and accessible using an URL endpoint, we now have to build a transformation API that takes the input data and transforms it into the needed format for the model endpoint and then calls the model.

To make using the model easier for other services, our data entry format is JSON. This makes the data human readable and changes to any steps concerning the model, except changing the number of attributes, can be done without dependencies on our client services.

REST Service

As framework for our REST API we chose Flask, since it is lightweight, flexible, easy to use and also written in Python. Since API and model are written in the same language we can make use of the preprocessing from the training package we needed for training above. The main work here lies in adapting the code to only run one single event, instead of the batch prediction, used to validate the result during training.

We also created an extra package containing all transformation functions, we use in several of our models. This package contains, e.g. min-max-normalization and distance calculation functions.

Speed is important in this component, so we store all data for enriching and transforming the incoming data inside a cache.

After receiving the prediction from the model, we qualify the results for regression models, by adding a confidence value. This helps our clients to better understand the results, especially if they are meant to be shown to end users.

Each of our responses has its own error code and message that is supplied in the result. The result is again in JSON format with the following fields:

success: true or false, indicating result of request

message: (error) message for response

prediction object

Deployment of API

Deployment to our production system is then handled by a Jenkins job with the following steps:

By using Cloud Run we do not need to worry about hardware configuration and can focus on optimizing the API and the model.

Conclusion

By following this process we make sure that the time spent on the necessary things, beside building a model, for bringing machine learning models into production is kept to a minimum and does not include managing any hardware resources or availability concerns.

Especially the part after the manual data and model selection process is usable as a template to fasten the deployment process. This is thanks to the tools provided by Google and our extracting reusable functions into their own Python package.

R Project and Production

Running R Project in production is a controversially discussed topic, as is everything concerning R vs Python. Lately there have been some additions to the R Project, that made me look into this again. Researching R and its usage in production environments I came across several packages / project, that can be used as a solution for this:

Plumber

For reasons of ease of use and because it was not a hosted version, I took a deeper look into Plumber. This felt quite natural as it uses function decorators for defining endpoints and parameters. This is similar to Spring Boot, which I normally use for programming REST APIs.
So using Plumber is really straight forward, as the example below shows:

The #’ @get defines the endpoint for this request. In this case /hello, so the full url on localhost is http://127.0.0.1:8001/hello. To pass in one or more parameters you can use the decorator #’ @param parameter_name parameter_description. A more complicated example using Plumber is hosted on our Gitlab. This example was built with the help of Tidy Textmining.

Production ready?

Plumber comes with Swagger, so the webserver is automatically available. As the R instance is already running, processing the R code does not take long. If your model is complicated, then, of course, this is reflected in the processing time. But as R is a single thread programming language, Plumber can only process one request at a time.
There are ways to tweak this of course. You can run several instances of the service, using a Docker image. This is decribed here. There is also the option of using a webserver to fork the request to serveral instances of R. Depending on the need of the API, single thread processing can be fast enough. If the service has to be highly available the Docker solutions seems like a good choice, as it comes with a load balancer.

Conclussion

After testing Plumber I am surprised by the easy of use. This package makes deploying an REST API in R really easy. Depending on your business needs, it might even be enough for a productive scenario, especially when used in combination with Docker.

Pivotal ported their massively parallel processing (MPP) database Greenplum to Hadoop and made it open source as an incubating project at Apache, called Apache HAWQ. This bring together full ANSI SQL with MPP capabilities and Hadoop integration.

The integration in an existing Hadoop installation is easy, as you can integrate all existing data via external tables. This is done using the pxf API to query external data. This API is customizable, but already brings the most used formats ready made. These include:

To access and store small amounts of data Apache HAWQ has an interface called gpfdist. This enables you to store data outside of your HDFS and still access it within HAWQ to join with the data stored in HDFS. This is especially handy, when you need small tables for dimension or mapping data in Apache HAWQ. This data will then not use a whole block of your HDFS, that is mostly empty.

Apache HAWQ even come integrated with MADlib, also an Apache incubating product, developed by Pivotal. MADlib is a Machine Learning framework, based on SQL. So moving data between different tools for analysing it, is not need anymore. If you have stored your data in Apache HAWQ, you can mine it in the database directly and don’t have to export it, e.g. to a Spark client or tools like Knime or RapidMiner.

MADlib algorithms

MADLib comes with algorithms in the following categories:

Classification

Regression

Clustering

Topic Modelling

Assocition Rule Mining

Descriptive Statistics

By using HAWQ you even can leverage tools like Tableau with real time database connections, which was not satisfactory so far when you used Hive.

But there is also the possibility to add your own interpreter to Zeppelin. This makes this tool really flexible.
Another feature it has, is the built in integration of Apache Spark. It ships with the following features and more:

Automatic SparkContext and SQLContext injection

Runtime jar dependency loading from local filesystem or maven repository.

Canceling job and displaying its progress

It also has built in visualization, which is an improvemnt over using ipython notebooks I think. The visualization covers the most basic graphs, like:

Tables

BarCharts

Pies

Scatterplot

Lines

These visualizations can be used with all interpreters and are always the same. So you can show data from Postgres and Spark in the same notebook with the same functions used. There is no need to handle different data sources differently.
You can also use dynamic forms in your notebooks, e.g. to provide filter options to the user. This comes in handy, if you embedd a notebook in your own website.

Apache Spark has release version 2.0, which is a major step forward in usability for Spark users and mostly for people, who refrained from using it, due to the costs of learning a new programming language or tool. This is in the past now, as Spark 2.0 supports improved SQL functionalities with SQL2003 support. It can now run all 99 TPC-DS. The new SQL parser supports ANSI-SQL and HiveQL and sub queries.
Another new features is native csv data source support, based on the already existing Databricksspark csv module. I personally used this module as well as the spark avro module before and they make working with data in those formats really easy.
Also there were some new features added to MLlib:

Spark increased its performance with the release of 2.0. The goal was to make Spark 2.0 10x faster and Databricks shows this performance tuning in a notebook.

All of these improvements make Spark a more complete tool for data processing and analysing. The added SQL2003 support even makes it available for a larger user base and more importantly makes it easier to migrate existing applications from databases to Spark.

In Data Science there are two languages that compete for users. On one side there is R, on the other Python. Both have a huge userbase, but there is some discussion, which is better to use in a Data Science context. Lets explore both a bit:

R
R is a language and programming environment especially developed for statistical computing and grahics. It has been around some time and several thousand packages to tackle statistical problems. With RStudio it also provides an interactive programming environment, that makes analysing data pretty easy.

Python
Python is a full range programming language, that makes it easy to integrate into a company wide system. With the packages Numpy, Pandas and Scikit-learn, Mathplotlib in combination with IPython, it also provides a full range suite for statistical computing and interactive programming environment.

R was developed solely for the purpose of statistical computing, so it has some advantages there, since it is specialized and has been around some years. Python is coming from a programming language and moves now into the data analysis field. In combination with all the other stuff it can do, websites and easy integrations into Hadoop Streaming or Apache Spark.
And for people who want to use the best of both sides can always use the R Python integration Rpy2.

I personally am recently working with Python for my ETL processes, including MapReduce, and anlysing data, which works awesome in combination with IPython as interactive development tool.

Since Apache Spark became a Top Level Project at Apache almost a year ago, it has seen some wide coverage and adoption in the industry. Due to its promise of being faster than Hadoop MapReduce, about 100x in memory and 10x on disk, it seems like a real alternative to doing pure MapReduce.
Written in Scala, it provides the ability to write applications fast in Java, Python and Scala, and the syntax isn’t that hard to learn. There are even tools available for using SQL (Spark SQL), Machine Learning (MLib) interoperating with Pythons Numpy, graphics and streaming. This makes Spark to a real good alternative for big data processing.
Another feature of Apache Spark is, that it runs everywhere. On top of Hadoop, standalone, in the cloud and can easily access diverse data stores, such as HDFS, Amazon S3, Cassandra, HBase.

The easy integration into Amazon Web Services is what makes it attractive to me, since I am using this already. I also like the Python integration, because latelly, that became my favourite language for data manipulation and machine learning.

Besides the official parts of Spark mentioned above, there are also some really nice external packages, that for example integrate Spark with tools such as PIG, Amazon Redshift, some machine learning algorithms, and so on.

Given the promised speed gain, the ease of use and the full range of tools available, and the integration in third party programms, such as Tableau or MicroStrategy, Spark seems to look into a bright future.

The inventors of Apache Spark also founded a company called databricks, which offers professional services around Spark.

With Hadoop 2.0 and the new additions of Stinger and Impala I did a (not representive) test of the performance on a Virtual Box running on my desktop computer. It was using the following setup:

4 GB RAM

Intel Core i5 2500 3.3 GHz

The datasets were the following:

Dataset 1: 71.386.291 rows and 5 columns

Dataset 2: 132.430.086 rows and 4 columns

Dataset 3: partitioned data of 2.153.924 rows and 32 columns

Dataset 4: unpartitioned data of 2.153.924 rows and 32 columns

The results were the following:

Query

Hive (0.10.0)

Impala

Stinger (Hive 0.12.0)

Join tables

167.61 sec

31.46 sec

122.58 sec

Partitioned tables Dataset 3

42.45 sec

0.29 sec

20.97 sec

Unpartitioned tables Dataset 4

47.92 sec

1.20 sec

36.46 sec

Grouped Select Dataset 1

533.83 sec

81.11 sec

444.634 sec

Grouped Select Dataset 2

323.56 sec

49.72 sec

313.98 sec

Count Dataset 1

252.56 sec

66.48 sec

243.91 sec

Count Dataset 2

158.93 sec

41.64 sec

174.46 sec

Compare Impala vs. Stinger

This shows that Stinger provides a faster SQL interface on Hive, but since it is still using Map / Reduce when calculating data it is no match for Impala that doesn’t use Map / Reduce. So using Impala makes sense when you want to analyse data in Hadoop using SQL even on a small installation. This should give you easy and fast access to all data stored in your Hadoop cluster, that was before not possible.
Facebook’s Presto should achieve nearly the same results, since the underlying technique is similar. These latest additions and changes to the Hadoop framework really seem like a big boost in making this project more accessible for many people.

Earlier this month Facebook open sourced its own product for using SQL on Hadoop. It is called Presto and is something like Facebook’s answer to Cloudera’s Impala or Hortonwork’s Stinger already presented in an earlier post called SQL and Hadoop on this site.
Presto is unlike Hive and more like Impala, since it doesn’t rely on MapReduce for its queries. This makes it about 10 times faster than Hive on large datasets, or so Facebook claims in a blog post.
This product may have a huge impact on the further development of SQL on Hadoop tools, if it’s taken up by enough companies. But since there is no commercial goal linked to it right now, it seems more like Facebook will develop it as their needs increase. So they will not be hurried along.
Like Impala it does support a huge subset of ANSI SQL contrary to Hive’s SQL like HiveQL. So it again aims on making Hadoop more accessible for a broader audience of analyst, that already are familiar with SQL.
Analysis on Big Data sets have been strengthened by this release even more and the entry level investments for more companies to use Hadoop as data storage system are decreasing with every development in this direction.

Bringing SQL to Hadoop has been one of the major trends in Big Data these last twelve months. Reason enough for me to take a closer look at that scene right now. One reason to build an interface based on SQL for Hadoop is to make the technology available for more people. Companies that have used SQL for decades won’t just stop and use something different for analysing and accessing their data.
Another reason lies in the nature of Hadoop, as it’s build as a batch processing system, which can be slow in answering queries. These new products emerging are trying to speed up the already existing SQL product Apache released named Hive.
There are two approaches to bringing SQL to Hadoop:

SQL natively on Hadoop

DBMS on Hadoop

SQL natively on Hadoop

Some example products in this category are:

Stinger from HortonWorks, which claims to make SQL on Hadoop 100x faster than Hive. This product is based on Hadoop 2.0 and the new YARN framework.

Impala from Coudera, which also claims speed up SQL queries compared to Hive. It is also design to co-exist with MapReduce and can be cleanly integrated into the Hadoop stack.

DBMS on Hadoop

Some example products in this category are:

Hadapt, which includes a PostgreSQL instance on each node and takes advantage of the distirubted filesystem for speed and supports advanced SQL functions. They recently introduced a feature called “Schemaless SQL” for their product. This integrates data such as JSON, Documents, etc. into their system and lets you access them by SQL. This stores the data in the original form on the HDFS and emerges columns in a Multistructured table as needed. They posted a detailed explanation here.

CitusDB, which also includes a PostgreSQL instance on each node. This means advanced SQL functions are supported here too.

Tajo founded in South Korea is still in incubator mode with Apache, but will bear watching too.

The two different approaches have their benefits each, and to decide which fits you better, I would test both of them. The main issue with all the products is, that this is all relatively new and there is little experience with the technology yet. Some of the products even are still in development, only offering Beta access.
But here is where the future of Big Data will take us. Making the benefits of Hadoop available for more analysts by building an interface they already can use.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.