Converting Apache Spark ML pipeline models to PMML documents

The project has been around for more than two years by now.
The first iteration defined public API entry point in the form of a org.jpmml.sparkml.ConverterUtil utility class. Subsequent iterations have been mostly about adding support for more transformer and model types, and expanding Apache Spark version coverage.
However, recent iterations have been gradually introducing new public API building blocks, and the last iteration (26 June, 2018) made them official.

The current blog post details this breaking API change, and all the new features and functionality that it entails.

API overview

The old API was designed after Apache Spark MLlib’s org.apache.spark.mllib.pmml.PMMLExportable trait.
The ConverterUtil#toPMML(StructType, PipelineModel) utility method was simply doing its best to emulates the non-existing org.apache.spark.ml.PipelineModel#toPMML(StructType) method.

The new API is designed after the builder pattern.
The primary (ie. essential) state of the org.jpmml.sparkml.PMMLBuilder class includes the data schema and the fitted pipeline. The initial values are supplied via the the two-argument PMMLBuilder(StructType, PipelineModel) constructor, and can be updated any time via the #setSchema(StructType) and #setPipelineModel(PipelineModel) mutator methods.

Data schema mutation may involve renaming columns or clarifying their data types.
Apache Spark ML pays little attention to the data type of categorical features, because after string indexing, vector indexing, vector slicing and dicing, they all end up as double arrays anyway.
In contrast, JPMML-SparkML carefully collects and maintains (meta-)information about each and every feature, with the aim of using it to generate more precise and nuanced PMML documents. For example, JPMML-SparkML (just like all other JPMML conversion libraries) eagerly takes note if the data type of a column is indicated as boolean, and generates simplified, binary logic PMML language constructs wherever possible.

Pipeline mutation may involve inserting new transformers and models, or removing existing ones. For example, generating a “complete” PMML document, then removing all transformers and generating a “model-only” PMML document.

The secondary (ie. non-essential) state of the PMMLBuilder class includes conversion options and verification data.

Application code should treat PMMLBuilder objects as local throwaway objects.
Due to the tight coupling to the Apache Spark runtime environment, they are not suitable for persistence, or exchanging between applications and runtime environments.

JPMML-SparkML is versioned after Apache Spark major versions. There is an active JPMML-SparkML development branch for every supported Apache Spark major version. The “conversion logic” for some transformer or model is implemented in the earliest development branch, and is then merged forward to later development branches (with appropriate modifications).

At the time of writing this (July 2018; updated in January 2019), JPMML-SparkML supports all current Apache Spark 2.X versions:

JPMML-SparkML checks the version of Apache Spark runtime environment before doing any conversion work.
For example, the following exception is thrown when JPMML-SparkML version 1.4(.7) discovers that it has been improperly paired with Apache Spark version 2.2:

Library JAR

The library JAR file can be “imported” into Apache Spark version 2.3.0 (and newer) using the --packages command-line option. Package coordinates must follow Apache Maven conventions ${groupId}:${artifactId}:${version}, where the groupId and artifactId are fixed as org.jpmml and jpmml-sparkml, respectively.

Important: this library JAR file is not directly usable with Apache Spark versions 2.0 through 2.2 due to the SPARK-15526 classpath conflict.

This classpath conflict typically manifests itself during the conversion work, in the form of some obscure java.lang.NoSuchFieldError or java.lang.NoSuchMethodError:

java.lang.NoSuchMethodError: org.dmg.pmml.Field.setOpType(Lorg/dmg/pmml/OpType;)Lorg/dmg/pmml/PMMLObject;
at org.jpmml.converter.PMMLEncoder.toCategorical(PMMLEncoder.java:215)
at org.jpmml.sparkml.feature.StringIndexerModelConverter.encodeFeatures(StringIndexerModelConverter.java:74)
at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:47)
at org.jpmml.sparkml.feature.RFormulaModelConverter.registerFeatures(RFormulaModelConverter.java:65)
at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:114)
at org.jpmml.sparkml.PMMLBuilder.buildByteArray(PMMLBuilder.java:278)
at org.jpmml.sparkml.PMMLBuilder.buildByteArray(PMMLBuilder.java:274)
... 48 elided

Executable uber-JAR

The executable uber-JAR file can be “imported” into any Apache Spark version using the --jars command-line option.

For example, starting PySpark with the JPMML-SparkML executable uber-JAR:

Updating application code

The org.jpmml.sparkml.ConverterUtil utility class is still part of JPMML-SparkML, but it has been marked as deprecated, and rendered dysfunctional - both #toPMML(StructType, PipelineModel) and #toPMMLByteArray(StructType, PipelineModel) utility methods throw an java.lang.UnsupportedOperationException when invoked:

The org.jpmml.sparkml.PMMLBuilder builder class currently exposes three builder methods:

#build() - Returns the PMML document as a live org.dmg.pmml.PMML object.

#buildByteArray() - Returns the PMML document as a byte array.

#buildFile(java.io.File) - Writes the PMML document to the specified file in local filesystem. Upon success, returns the argument java.io.File object unchanged.

The first option is aimed at PMML-savvy applications that wish to perform extra processing on the PMML document (eg. adding or removing feature transformations).
However, most applications should be content with the JPMML-SparkML generated PMML document, and will be processing it as a generic blob.
The choice between the last two options depends on the approximate size/complexity of the PMML document (eg. elementary models vs. ensemble models) and overall application architecture.

New API features and functionality

Conversion options

The “business logic” of some Apache Spark ML transformer or model can often be translated to PMML in more than one way. Some representations are easier to approach for humans (eg. interpretation and manual modification), whereas some other representations are more compact or faster to execute for machines.

The purpose of conversion options is to activate the most optimal representation for the intended application scenario. Granted, the content of PMML documents is well structured and is fairly easy to manipulate using JPMML-Model and JPMML-Converter libraries at any later time. However, achieving the desired outcome by toggling high-level controls is much more productive than writing low-level application code.

There is no easy recipe for deciding which conversion options to tweak, and in which way. It could very well be the case that the defaults work fine for everything except for one specific feature/operation/algorithmic detail.
The recommended way of going about this problematics is generating a “grid” of PMML documents by systematically varying the values of small number of key conversion options, and capturing and analyzing relevant metrics during scoring.

Conversion options are systematized as Java marker interfaces, which inherit from the org.jpmml.sparkml.HasOptions base marker interface.
A similar convention is being enforced across all JPMML conversion libraries. For example, in the JPMML-SkLearn library, which deals with the conversion of Scikit-Learn pipelines to PMML, the class hierarchy is rooted at the org.jpmml.sklearn.HasOptions base marker interface.

Unfortunately, the documentation is severely lacking in this area.
To discover and learn which conversion options are available (in a particular JPMML-SparkML version), simply order the Java IDE to display the class hierarchy starting from the HasOptions base marker interface, and browse through it.

org.jpmml.sparkml.model.HasTreeOptions#OPTION_COMPACT. Controls the encoding of the tree data structure in decision tree models and their ensembles - Apache Spark-style binary splits vs. PMML-style multi-splits. Can reduce the size of PMML documents anywhere between 25 to 50%.

For maximum future-proofness, all conversion option names and values should be given as Java class constants. For example, the name of the decision tree compaction option should be given as org.jpmml.sparkml.model.HasTreeOptions#OPTION_COMPACT (instead of a Java string literal "compact"). If this conversion options should be renamed, relocated, or removed in some future JPMML-SparkML version, then the Java IDE/compiler would automatically issue a notification about it.

The PMMLBuilder builder class currently exposes the following mutator methods:

#putOption(String, Object). Sets the conversion option for all pipeline stages.

Verification

Every conversion operation raises concern, whether the JPMML-SparkML library was doing a good job or not. Say, the conversion operation appears to have succeeded (ie. there were no exceptions thrown, and no warning- or error-level log messages emitted), but how to be sure that the PMML representation of the fitted pipeline shall be making exactly the same predictions as the Apache Spark representation did?

The PMML specification addresses this condundrum with the model verification mechanism.
In brief, it is possible to embed a verification data into models. The verification dataset has two column groups - the inputs, and the predictions that the original ML framework made when those inputs were fed into it.
PMML engines are expected to perform self-checks using the verification data before commissioning the model. If the live predictions made by the PMML engine agree with the stored predictions of the original ML framework (within the defined acceptability criteria), then that is a strong indication that everything is going favourably.
Next to most common cases, the verification dataset should aim to include all sorts of fringe cases (eg. missing, outlier, invalid values) in order to increase the confidence.

The JPMML-Evaluator library can be ordered to perform self-checks on the org.jpmml.evaluator.Evaluator object by invoking its #verify() method.
JPMML-SparkML integration tests indicate that the JPMML family of software (ie. JPMML-SparkML converter plus JPMML-Evaluator scorer) is consistently able to reproduce Apache Spark predictions (eg. regression targets, classification probabilities) with an abolute/relative error of 1e-14 or less.

JPMML-SparkML wrappers

The JPMML-SparkML library is written in Java. It is very easy to integrate into any Java or Scala application to give it Apache Spark ingestion capabilities.

However, there is an even more important audience of data scientists that would like to access this functionality from within their Python (PySpark) and R (SparkR and Sparklyr) scripts.

The JPMML family of software now includes Python and R wrapper libraries for the JPMML-SparkML library. The wrappers are kept as minimal and shallow as possible. Essentially, they provide a language-specific builder class that communicates with the underlying org.jpmml.sparkml.PMMLBuilder Java builder class, and handle the conversion of objects between the two environments (eg. converting Python and R strings to Java strings, and vice versa).

The Apache Spark connection is typically available in PySpark session as the sc variable. The SparkContext class has an _jvm attribute, which gives Python (power-)users direct access to JPMML-SparkML functionality via the Py4J gateway.

Sparklyr

Contrary to Java, Scala and Python, object-oriented design and programming is a bit challenging in R.

The sparklyr2pmml::PMMLBuilder S4 class can only capture the state. It defines two slots sc and java_pmml_builder, which can be initialized via the two-argument constructor. However, most R users are advised to regard this constructor as private, and heed the three-argument sparklyr2pmml::PMMLBuilder S4 helper function instead. This function takes care of initializing a proper java_pmml_builder object based on the sparklyr::tbl_spark training dataset and sparklyr::ml_pipeline_model fitted pipeline objects.

All mutator and builder methods have been “outsourced” to standalone S4 generic functions, which take the sparklyr2pmml::PMMLBuilder object as the first argument: