Prediction.IO — MVC for machine learning

Some time ago when I needed to do simple, proof-of-concept spam recognition and I was looking for simple framework in scala or python I met prediction.io by chance. Quick glimpse on page make me interested and I decided it is worth to try it. My impression of this tool was really great. Simple, easy to setup, minimal effort to run machine learning engine and regardless of simplicity in usage it gives really complete machine learning solution. It is based on cutting edge data mining solutions: elasticsearch, hbase spark(together with MLLib) with scala as domain language. I will write how to run server for simple spam detection using this tool with minimum effort.

A few steps to set spam detection server in 5 mins

So let’s start step by step…

First of all install it with one simple command:$ bash -c “$(curl -s https://install.prediction.io/install.sh)”
It will be installed in your $home/PredictionIO directory together with all required tools.

Now when you have prediction.io installed go to its directory. Now you need to download one of prediction.io’s template. Template is complete prediction engine. On official page: https://templates.prediction.io/ you can find really nice set of ready-to-use templates. For mail prediction I picked-up “template-scala-parallel-textclassification”. So let’s run:$ pio template get template-scala-parallel-textclassification spam-detectionand go to newly created directory “spam-detection” where are template was downloaded. Let’s see in in src how little source files and how little code is needed to get complete engine.

Now as we have template ready we need to build it, train it and deploy. But to do anything first of all we need to launch all required third-party tools mentioned earlier. But don’t worry, to do it, single command is predefined:$ pio-start-all

Above command launch also event server. Event server is prediction.io solution to provide data to our engine via http requests(there are bunch of APIs for most popular programming language to do it, but of course you can use just curl). Normally to add new single entry of data you do specific http request passing through javascript’s object with parameters according to your data structure. But on the beginning to provide bunch of data for first training of our engine we can use command-line command to import data from .json file. The template even provide example files with data required to first training session in “data” directory (emails.json and stopwords.json). To send any data you need create application ID to obtain identifier for your application(in consequence it created table in hbase for the app and ID is then required by sending events inform towhich table data should go). So to do all stuff(create appID and import 2 files with data) let’s type:$ pio app new MyTextApp$ pio import — appid *** — input data/stopwords.json$ pio import — appid *** — input data/emails.jsonChanging *** with your application ID.

Now it’s time to run three core, magic commands:$ pio build
$ pio train
$ pio deployFirst command build the project(template), second one train a model using data we imported earlier and the last one deploy engine. From then on you can happily open page: http://localhost:8000 and see your server is running and ready to do predictions. To test it let’s run some first queries:$ curl -H “Content-Type: application/json” -d ‘{ “text”:”I like speed and fast motorcycles.” }’ http://localhost:8000/queries.json
$ curl -H “Content-Type: application/json” -d ‘{ “text”:”Earn extra cash!” }’ http://localhost:8000/queries.jsonand you will see following results:{“category”:”not spam”,”confidence”:0.852619510921587}
{“category”:”spam”,”confidence”:0.5268770133242983}

What else we need?

On the beginning of post I mentioned this solution is complete so it couldn’t be without ability to evaluate our machine learning engine to allow to tune/choose best parameters of algorithm and also to choose best algorithm. PredictionIO comes with Evaluation component and all templates provide simple implementation of it. It is standard the Evaluators use cross-validation method(we can of course configure k-folds number). To test how it works let’s type:$ pio eval org.template.textclassification.AccuracyEvaluation org.template.textclassification.EngineParamsListand you will se one-by-one test results for all parameters for all algorithms. Best configuration is saved in best.json by default. This config can be used later by train session(pio train -v best.json).

We can do more and more…

We can do off course much more than it was described above. We can add more engines: existing ones(from other templates for instance) or easily write our own. We can add more metrics(in our example there is single and simple one: accuracy) and of course we can write our own. We can edit/extend every part of the template: data processing before it go to training session, edit engine behaviour etc etc. We are not limited to use one-best engine. We can easily configure our template to use more than one engine in parallel and result of prediction will be combination of more engines results. MLlib will come with big help for all of mentioned tasks.