KeystoneML is a software framework, written in Scala, from the UC Berkeley AMPLab designed to simplify the construction of large scale, end-to-end, machine learning pipelines with Apache Spark.

We contributed to the design of spark.ml during the development of KeystoneML, so if you’re familiar with spark.ml then you’ll recognize some shared concepts, but there are a few important differences, particularly around type safety and chaining, which lead to pipelines that are easier to construct and more robust.

KeystoneML also presents a richer set of operators than those present in spark.ml including featurizers for images, text, and speech, and provides several example pipelines that reproduce state-of-the-art academic results on public data sets.

News

2017-04-18 The KeystoneML paper will be presented at ICDE 2017. See you in San Diego!

2017-03-02 KeystoneML version 0.4.0 has been released and pushed to Maven central. See the release notes for more information.

2016-03-24 KeystoneML version 0.3.0 has been released and pushed to Maven central. See the release notes for more information.

2015-10-08 We’ve put together a minimal example application for you to use as a basis for starting your own projects that use KeystoneML.

2015-09-18 KeystoneML version 0.2.0 has been pushed to Maven central. See the release notes for more information.

2015-09-17 KeystoneML is on Maven Central. We have added a new “linking” section below.

What is KeystoneML for?

KeystoneML makes constructing even complicated machine learning pipelines easy. Here’s an example text categorization pipeline which creates bigram features and creates a Naive Bayes model based on the 100,000 most common features.

Getting Help and Contributing

KeystoneML is an Apache Licensed open-source project and we welcome contributions.
Have a look at our Github Issues page if you’d like to contribute, and feel free to fork the repo and submit a pull request!

Citing

If you use KeystoneML in academic work, please cite the following paper: