mmlsparkMicrosoft Machine Learning for Apache Spark

Microsoft Machine Learning for Apache Spark

MMLSpark is an ecosystem of tools aimed towards expanding the distributed computing framework
Apache Spark in several new directions.
MMLSpark adds many deep learning and data science tools to the Spark ecosystem,
including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit
(CNTK), LightGBM and
OpenCV. These tools enable powerful and highly-scalable predictive and analytical models
for a variety of datasources.

MMLSpark also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users
can embed any web service into their SparkML models. In this vein, MMLSpark provides easy to use
SparkML transformers for a wide variety of Microsoft Cognitive Services. For production grade deployment, the Spark Serving project enables high throughput,
sub-millisecond latency web services, backed by your Spark cluster.

MMLSpark requires Scala 2.11, Spark 2.3+, and either Python 2.7 or Python 3.5+.
See the API documentation for
Scala and for
PySpark.

If you're using the Azure Portal to run the script action, go to Script actions → Submit new in the Overview section of your cluster blade. In
the Bash script URI field, input the script action URL provided above. Mark
the rest of the options as shown on the screenshot to the right.

Submit, and the cluster should finish configuring within 10 minutes or so.

SBT

If you are building a Spark application in Scala, add the following lines to
your build.sbt:

Building from source

You can also easily create your own build by cloning this repo and use the main
build script: ./runme. Run it once to install the needed dependencies, and
again to do a build. See this guide for more
information.

R (Beta)

To try out MMLSpark using the R autogenerated wrappers see our
instructions. Note: This feature is still under development
and some necessary custom wrappers may be missing.