Machine Learning Throwdown, Part 6 – Summary

All right, let’s wrap this thing up! In the fifth post of the series, I discussed a few miscellaneous topics for each service to help you make a more educated decision about which one might be the best for you. This post will be my last. I will review what we’ve learned so far and provide some final thoughts and recommendations.

The Posts

I’ve been at this for a while. Let’s do a recap:

Part 1 – Introduction

In the first post, I introduced myself and the machine learning throwdown. BigML hired me to spend some time this summer comparing their service to the competition. I am very pleased that I was able to write my honest opinions even if they were not always in BigML’s favor.

That said, are my results completely unbiased? Probably not. I tried to remain objective, but BigML did pay me to do this comparison. I spent some time with them in their office, I ate their snacks, and I drank their Kool-Aid coffee. Use my advice as a starting point, but play around with these services and make your own decision about which one is best for you.

As a reminder, I compared three cloud-based machine learning services: BigML, Google Prediction API, and Prior Knowledge. BigML and Prior Knowledge are both in beta while Google Prediction API has been out of beta for nearly a year. Weka, a time-tested application and suite of algorithms for machine learning, was also included in the throwdown to compare the cloud-based services to a traditional desktop application.

Part 2 – Data Preparation

My second post looked at getting started with each service and importing your data. Some important considerations include the amount of setup and configuration required to get started, the availability of libraries for your favorite programming language, how strict they are about the format of your data, and the amount of data they can handle.

I am incredibly impressed with BigML in this category. Machine learning is not easy, but they have done more than any of the other services to help make this technology accessible to non-experts. I’m not the only person impressed by how easy it is to use BigML. In a recent article on GigaOM, Derrick Harris talks about how he was able to analyze data with BigML from his couch with company at his house and two toddlers running around.

Part 3 – Models

My third post talked about the process of turning your data into a predictive model. Models range from black box, where all the details are hidden from you, to completely white box where you can see and understand the model and use it to gain insights about your data. Some other important considerations include how easy it is to create and optimize a model, the type of data a model can learn from, and the types of operations supported by the model.

Part 4 – Predictions

My fourth post was a fun one. I presented the results of computing cross-validation scores indicating how well each service is able to make accurate predictions. Google Prediction API came in first most of the time, but a closer look revealed that the runners-up were usually not far behind. It turns out that the quality of your data is often the limiting factor rather than your choice of service/model. It might be wise to try your own data on multiple services to see which one makes the best predictions, but this is quite time-consuming and nontrivial because they don’t all report cross-validation scores using the same metrics. If you really need to squeeze every last bit of predictive performance out of your data, it’s probably time to look into hiring a data scientist ($$$).

Part 5 – Miscellaneous

My fifth post covered a few miscellaneous topics including stability, cost, support, and documentation. The big surprise here was that all of the services suffered from multiple random failures while I was evaluating them. They all have some work to do in this area. For now, consider using BigML or Weka if you need completely reliable predictions. Both of these options allow you to make predictions offline without worrying about occasional API failures.

Final Thoughts

If you have been following along with this series of blog posts and haven’t tried any of these services yet, what are you waiting for? Find some data that interests you and see what you can do with it! You can use your own data or find some from a source such as the UC Irvine Machine Learning Repository.

Which service should you use? I strongly recommend starting with BigML since it is the easiest to use and everything can be done on the website without writing code. Their interactive decision trees let you visually explore models in ways the other services don’t even come close to. Check out other posts on BigML’s blog for examples and browse the public model gallery for inspiration. Contact BigML if you need help or if there are new features you would like to see.

If BigML isn’t your cup of tea, please do try the other services. They each have their own unique features so you should be able to find something that works for you. We’re at the beginning of an important era where anyone can use data to help them make decisions. The more people that use any of these services, the better it is for everyone!

I hope you have enjoyed reading these posts as much as I have enjoyed writing them. Goodbye and good luck!

(Note: Per Dec 5, 2012 Prior Knowledge no longer supports its public API.)

Postscript from Charles Parker, Nick’s first level of supervision at BigML: Let’s give an Internets round of applause for Nick’s fine work this summer. He picked up on this machine learning stuff awfully quickly and did some nice development on top of some shoddy platforms (see post numero four). He’s going to work on his Master’s for the next year or two, but he’ll be looking for work again soon enough and when he is you should hire him (unless we beat you to the punch).