Integrating SPSS Model Scoring in InfoSphere Streams, Part
2: Using a generic operator

Part 2 of this "Integrating SPSS Model Scoring in InfoSphere Streams" series shows
how to create a generic operator to execute IBM® SPSS Modeler
predictive models in an
InfoSphere® Streams application. It builds off the work of the non-generic
operator produced in Part 1, where we wrote and used an InfoSphere Streams
operator to execute a predictive model in an InfoSphere
Streams application using the IBM SPSS Modeler Solution Publisher Runtime
library API.

Mike Koranda is a senior technical staff member in IBM's Software Group and has been working at IBM for more than 30 years. He has been working in the development of the InfoSphere Streams product for the past six years.

Before you start

This tutorial describes how to create a generic operator that can be
used from InfoSphere Streams applications to execute SPSS predictive
models. It also provide a sample operator that can be directly used
with any appropriate SPSS model and a sample Streams application that
demonstrates its use.

About this series

InfoSphere Streams is a platform that enables real-time analytics of
data in motion. The IBM SPSS family of products provide the ability to
build predictive analytic models. This "Integrating SPSS Model Scoring in InfoSphere Streams" is for Streams developers who need to
leverage the powerful predictive models in a real-time scoring
environment.

About this tutorial

This tutorial extends on the non-generic operator produced in Part
1, which presents a. technique that is quite flexible
but requires some C++ programming skill to customize.

Objectives

In this tutorial, you learn how the non-generic operator is extended to
use the predictive model's XML metadata to allow use of a SPSS
predictive model in Streams without C++ skill required.

Prerequisites

This tutorial is written for Streams component developers and
application programmers who have Streams programming language skills
and C++ skills. Use the tutorial as a reference, or the
samples in it can be examined and executed to demonstrate the
techniques described. To execute the samples, you should have a
general familiarity with using a UNIX® command-line shell and working
knowledge of Streams programming.

System requirements

To run the examples, you need a Red Hat Enterprise Linux® box
with InfoSphere Streams V2.0 or later and IBM SPSS Modeler Solution
Publisher 14.2 fixpack 1, plus the Solution Publisher hot
fix, which is scheduled to be available 14 Oct 2011.

Overview

Introduction

This tutorial shows how to create a generic operator to execute SPSS
predictive models in an InfoSphere Streams application. It builds off
the work of the non-generic operator produced in Part
1, where we
wrote and used an InfoSphere Streams operator to execute an IBM SPSS
Modeler predictive model in an InfoSphere Streams application using
the IBM SPSS Modeler Solution Publisher Runtime library API.

Recap
from Part 1

In Part 1, we developed an operator that wrapped a specific
predictive model and would only work with the exact schema as shown in
the example developed. Recall that we showed how that non-generic
operator could be modified to accommodate different models and
schemas, but that required at least some C++ programming skill to
accomplish the necessary adjustments. Creating a generic operator that
can automatically adjust itself to different inputs and different
models will allow integration to be totally accomplished by a Streams
application programmer, eliminating the work and skill needed to
create individual operators for each different model to be used.

Roles and
terminology

The roles and terminology are provided in Part 1 and are not repeated
here. Our focus here is on the Streams component developer
role and how to write a generic operator that can be tailored by a
Streams application developer to execute SPSS
predictive models. For information about the other roles to understand
the work and interaction necessary for the overall solution, refer to
Part 1.

The contract
between data analyst and Streams component developer

Recall from Part 1 that in order to write the Streams operator, the
Streams component developer needs to know certain information about
the inputs and outputs of the predictive model produced by the data
analyst. Specifically, the operator developer will require:

Install location of Solution Publisher

The .pim and .par files produced during the publish

The input source node key name. This can be found in the XML
fragment:

<inputDataSources>
<inputDataSource name="file0" type="Delimited">

NOTE: While there is no technical limitation, our example is
limited to supporting a single input source for simplicity.

The input field names and storage and their order as found inside
the <inputDataSource> tag.

Also recall that to adjust the sample operator to work for a different input
tuple/model combination, you needed to adjust it in the following
spots:

_h.cgt file
Adjust the output structure

_cpp.cgt file

next_record_back —

Adjust the output structure

constructor

Adjust the .pim and .par filename and locations

Adjust the input and output file tags

Adjust the input field pointer array size

process

Adjust load input structure code

Adjust load output tuple code

Next, we describe the design of our generic operator that will use
the metadata and operator's parameters to automatically generate the
right code that in the non-generic operator needed to manual
adjustment.

Designing the operator

Designing the Streams generic scoring operator

At a conceptual level, what the generic operator needs to do is provide
parameters that a Streams application developer can use to tailor the
operators use with different predictive models with differing input and
output tuple formats. In this section, we will describe the
parameters our generic operator provides and the design decisions that
led to this set of parameters.

Our generic operator design starts with the decision to continue to only
support a single input port and a single output port.

Recall that the .pim and .par filenames in our non-generic operator were
hard coded in the .cgt file. To make our generic operator able to
accept any user-specified model files, these need to be passed
into the operator. Parameters will be used to allow their
specification similar to how the SPInstall location was handled in the
non-generic operator.

Recall that much of the information a Streams component developer
needed to build the non-generic operator was obtained by examining the
predictive model's XML metadata file. This file will be
programatically read and used to dynamically generate the necessary
operator code when an application using the generic operator is
compiled. We will use a parameter similar to those used for the .pim
and .par files to allow the XML metadata file specification.

To allow a Streams application developer to use the generic
operator, a key requirement is the ability to specify which input
tuple data is used to feed the input fields of the model. In our
simple generic model, we hard-coded this and required that
the tuple's attributes were compatible with the format expected by the
models input fields. For our generic operator, we need to allow the
application developer to indicate how to populate these input fields
from the appropriate input data. To accomplish this, we define
two parameters, each a list. The first list will contain SPL expressions
that indicate the input values derived from the input tuple data to be
used to populate the model's input fields. The second list indicates
which model input field is populated from the corresponding expression
in the first list.

By allowing SPL expressions, we allow greater flexibility in specifying
the data to be used. In the simplest case, it would be a tuple
attribute in the proper format — the same as was used in the
non-generic operator. But by allowing a SPL expression, it can now be
a complex expression made from several attributes to provide the
correct input data needed. It could be a simple type conversion from
the input tuple data available to the appropriate data type needed by
the model, a complex expression that uses several attributes to derive
the needed input field value, or it could be a literal value for cases
where the model requires data that is not in the input tuple and a
default value will suffice. While we could have gotten by with a
single list and required the application developer to ensure that the
expressions were listed in the exact order needed for the model, we
thought that to be too error-prone and chose to require the
application developer to provide the second parameter, which is an
explicit list of the mapping to the existing field names. The second
list parameter serves this purpose.

A similar set of information is needed to map from the model's output
fields to the output tuple. We felt the most natural Streams
implementation would be to specify this through the SPL output clause
and the use of a custom output function in the output port of the
operator. By doing this, we allow the application developer to choose
which output fields are used and which can be ignored. We chose to
allow both a basic assign function to populate the data in the output
tuple and one that allows for the specification of a default value
in cases where the model doesn't produce an output (missing data) for
a given set of inputs. This output value can take any SPL expression
so it could be built from existing attributes or be a literal value.

Now that we have our design planned out, the next section shows how
this design is implemented in the generic operator.

Writing the operator

Generic
scoring operator implementation

Now that we have an idea of the overall design for our operator, we
will look at the tasks necessary to implement it. First are
the mechanics of specifying the interfaces Streams application
developers will use. Then we describe the implementation code
that uses that information at compile time to produce the necessary
code.

Specifying the operator's public interface

To write the generic operator, we will describe the interface to
Streams application developers, then describe how
the Streams component developer uses this information in the generic
operator implementation. A Streams application developer using
this operator will need to have the ability to specify the following:

Model files

Model XML metadata

Input mapping

Output mapping

Specifying the model files

To allow the Streams application developer to use this operator with any
published predictive model, we need to allow the specification of the
published .pim and .par files. We do this by adding parameters and
providing the implementation that uses the parameters in the code
template. The parameters to specify the .pim and .par files are added to
the parameter definition section in the operator XML file,
SPSimple.xml, (see Download). Listing 3 shows the definition for the .pim file. The
.par file is similar.

Specifying the model XML metadata

Many of the characteristics needed by the code (field names, number,
order, type) as well as the field tags used to indicate the input and
output fields are described in the predictive model's XML metadata.
For our generic operator, we use a parameter to pass the XML file.

Specifying the output mappings

The mapping of the output fields to output tuple attributes will be
done in the output clause of the SPL operator invocation. To support
this, we need to define the custom output functions. We define three
functions: a default that does not use the model outputs and
two variations of extracting the model output and populating the output
tuple attribute. The first population assumes that if the model's
execution did not produce a value for this field, the output tuple will contain a
default value for that attribute type (for an integer a value of
0). The second allows both the specification of the field and an
additional value representing a SPL expression to produce a default
value. Listing 6 shows the output function specifications.

Implementing

Now that the interfaces have been specified, we will describe how the
Streams component developer uses this information in the generic
operator implementation. The generic operator implementation will be
described for the following areas:

Common — Perl code used in the header code-generation
template and the CPP code-generation template

Header file — The header code-generation template implementation
area

Functions — The CPP code-generation template implementation of the
functions passed on the Solution Publisher interface

Process — The CPP code-generation template implementation of the
code called in the process method when an input tuple arrives on
this operators input port

Common
use of the model XML metadata

Since the information we obtain from the XML metadata will be used in
both H and CPP code-generation templates, I chose to implement a
common routine named SP_Common.pm to validate and parse the XML
metadata document. The SPCommon.pm file contains a function
($infilename, $outfilename, @infields, @infieldtypes, $numinfields, @outfields, @outfieldtypes, $numoutfields) processModelFile($model)
This will verify the existence and format of the model file
parameter value, and parse the model file to find the input filename,
output filename needed on the SP API calls. It also returns
information for the input and output fields defined in the model —
specifically, the number of each and lists of field names and types.
This common routine gets called during the header and CPP
template processing steps. Below is an excerpt from the Perl code in
this common function.

You can see that first, the parameters value of a filename is adjusted
and validated. Then we parse the XML contents to extract the necessary
information in a usable form for the Perl processing in the
code-generation templates. In the example code above, we only show the
extracting of the first input data source filename that the model
expects. Similar code for extracting the other information necessary
is omitted, but can be found in the ZIP file.

The header template code SPSimple_h.cgt calls this function in the
SP_Common.pl file to validate and populate the model variables as
shown below.

Function implementation

In the SPSimple_cpp.cgt template, we use the output field
information in the next record back function called from
the Solution Publisher runtime to provide the addresses of the
returned data to the operator so it can copy the data into the output
tuple.

Constructor implementation

The CPP template code SPSimple_cpp.cgt constructor directly uses
some of the parameter values supplied (pimfile, for example), as well as the
values obtained from the common routine's processing of the XML
metadata. Listing 11 shows how the parameter information is used directly
in the CPP's constructor.

You can see some validation of the parameter values taking
place, comparing the size of the lists for tuple attribute
values to input model fields and then the number of ports expected by
the operator. Then the actual mapping of stream attribute to model
input field is done checking for valid type matching. Finally, the C++
code to load the storage location accessible during the model execute
is populated from the input tuple.

Processing the output fields

Listing 15 shows the use in retrieving the output fields from the model and
populating the output tuple.

You can see that the output tuple attribute assignments are
evaluated and for those that used the custom output function to
indicate they are populated from the model's output, the C++
assignment statement to the tuple attribute is generated.

The implementation of the
code-generation templates needed in the generic operator is complete. Next, we will show
how the operator is used in an SPL application.

Using the operator

Using the
scoring model operator in an InfoSphere Streams application

Again as in Part
1, we will use a simple SPL application to demonstrate
integrating a predictive model into a Streams application. We
use a file containing rows to be scored and use the InfoSphere Streams
FileSource operator to read in the information and produce a stream of
these tuples. We also write the scored tuples to an output file
one tuple at a time using the InfoSphere Streams FileSink operator.

We use the same basket rule model from Part 1 with two inputs: gender (a
string value of M or F) and income (an integer value), and produces an
output of string M or F for whether this input indicates a preference
of purchasing a combination of beer, beans, and pizza. It also produces
a floating-point number representing the confidence of that
prediction. There is one slight difference as this time in the input
data file: Rather than having a single value for income, we have a
file that contains two values — a base salary and a bonus salary that
must be added together to produce the desired income input value
required by the model.

Running
the sample SPL application

Requirements and setup

In order to build and run the sample application, you need a
functioning InfoSphere Streams environment.

You also need to ensure that the LD_LIBRARY_PATH is set on all
systems that the Streams operator will be deployed to contain
the necessary Solution Publisher libraries.

LD_LIBRARY_PATH requirement

Assuming Solution Publisher is installed in $INSTALL_PATH, the
LD_LIBRARY_PATH needs to include:

$INSTALL_PATH

$INSTALL_PATH/ext/bin/*

$INSTALL_PATH/jre/bin/classic

$INSTALL_PATH/jre/bin

A script included in the ZIP file named ldlibrarypath.sh is provided to
set the path up correctly. If Solution Publisher is not installed in
the default location, change the first line of the
script to point to your Solution Publisher install directory before
using the script. For example, if Solution Publisher is installed in
/homes/hny1/koranda/IBM/SPSS/ModelerSolutionPublisher64P/14.2, then
set:

Sample contents

The sample ZIP file contains the same .pim, .par, and XML files from the
Market Basket Analysis sample with a scoring branch added, a sample
input and expected output files, a complete Streams Programming
Language application Main.spl, and the generic operator SPSimple that
scores the Market Basket Analysis model.

We provide a simple SPL application in com.ibm.mpk/SPGeneric/Main.spl
that looks like Figure 1.

Figure 1. SPL
application

Adjusting and compiling the sample

To run the sample SPL application, unzip the SPGeneric.zip (see Download) file to your Linux system that has InfoSphere Streams and
Solution Publisher installed. If the Solution Publisher install
location is different from the default value of
/opt/IBM/SPSS/ModelerSolutionPublisher/14.2,
modify the operator XML file (SPSimple.xml) in the
com.ibm.mpk/SPSimple directory. You need to change the libPath and includePath entries to match your Solution Publisher install
location:

To see all the results produced, look in the mpkoutput.csv file
created by the FileSink. Note that the output file contains the input
values for base salary and bonus salary, as well as the combined income
value output from the model.

Hooray!

Summary

Results

This tutorial has shown how you can use a generic operator to wrap the
execution of any SPSS Modeler predictive analytic (compatible
with the Solution Publisher API restrictions and makes sense to
execute a tuple at a time). This generic operator can be easily used
by a Streams application developer to execute the predictive model
against the streaming data.

Note there are other ways to execute scoring models in InfoSphere
Streams through PMML and the Streams Mining toolkit. The direct
wrapping technique and integration with SPSS models through the
Solution Publisher interface provided here opens the scoring up to a
much larger set of models than what are supported through the PMML
integrations of the Mining Toolkit.

The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.