This tutorial provides a quick introduction to using SystemML by
running existing SystemML algorithms in standalone mode.

What is SystemML

SystemML enables large-scale machine learning (ML) via a high-level declarative
language with R-like syntax called DML and
Python-like syntax called PyDML. DML and PyDML allow data scientists to
express their ML algorithms with full flexibility but without the need to fine-tune
distributed runtime execution plans and system configurations.
These ML programs are dynamically compiled and optimized based on data
and cluster characteristics using rule-based and cost-based optimization techniques.
The compiler automatically generates hybrid runtime execution plans ranging
from in-memory, single node execution to distributed computation for Hadoop
or Spark Batch execution.
SystemML features a suite of algorithms for Descriptive Statistics, Classification,
Clustering, Regression, Matrix Factorization, and Survival Analysis. Detailed descriptions of these
algorithms can be found in the Algorithms Reference.

Standalone vs Distributed Execution Mode

SystemML’s standalone mode is designed to allow data scientists to rapidly prototype algorithms
on a single machine. In standalone mode, all operations occur on a single node in a non-Hadoop
environment. Standalone mode is not appropriate for large datasets.

For large-scale production environments, SystemML algorithm execution can be
distributed across multi-node clusters using Apache Hadoop
or Apache Spark.
We will make use of standalone mode throughout this tutorial.

The Haberman Data Set
has 306 instances and 4 attributes (including the class attribute):

Age of patient at time of operation (numerical)

Patient’s year of operation (year - 1900, numerical)

Number of positive axillary nodes detected (numerical)

Survival status (class attribute)

1 = the patient survived 5 years or longer

2 = the patient died within 5 year

We will need to create a metadata file (MTD) which stores metadata information
about the content of the data file. The name of the MTD file associated with the
data file <filename> must be <filename>.mtd.

The following table lists the number and name of each univariate statistic. The row
numbers below correspond to the elements of the first column in the output
matrix above. The signs “+” show applicability to scale or/and to categorical
features.

Row

Name of Statistic

Scale

Categ.

1

Minimum

+

2

Maximum

+

3

Range

+

4

Mean

+

5

Variance

+

6

Standard deviation

+

7

Standard error of mean

+

8

Coefficient of variation

+

9

Skewness

+

10

Kurtosis

+

11

Standard error of skewness

+

12

Standard error of kurtosis

+

13

Median

+

14

Inter quartile mean

+

15

Number of categories

+

16

Mode

+

17

Number of modes

+

Example 2 - Binary-class Support Vector Machines

Let’s take the same haberman.data to explore the
binary-class support vector machines algorithm l2-svm.dml.
This example also illustrates how to use of the sampling algorithm sample.dml
and the data split algorithm spliXY.dml.

Sampling the Test Data

First we need to use the sample.dml algorithm to separate the input into one
training data set and one data set for model prediction.

Example 3 - Linear Regression

For this example, we’ll use a standalone wrapper executable, bin/systemml, that is available to
be run directly within the project’s source directory when built locally.

After you build SystemML from source (mvn clean package), the standalone mode can be executed
either on Linux or OS X using the ./bin/systemml script, or on Windows using the
.\bin\systemml.bat batch file.

If you run from the script from the project root folder ./ or from the ./bin folder, then the
output files from running SystemML will be created inside the ./temp folder to keep them separate
from the SystemML source files managed by Git. The output files for this example will be created
under the ./temp folder.

The runtime behavior and logging behavior of SystemML can be customized by editing the files
./conf/SystemML-config.xml and ./conf/log4j.properties. Both files will be created from their
corresponding *.template files during the first execution of the SystemML executable script.

When invoking the ./bin/systemml or .\bin\systemml.bat with any of the prepackaged DML scripts
you can omit the relative path to the DML script file. The following two commands are equivalent:

In this guide we invoke the command with the relative folder to make it easier to look up the source
of the DML scripts.

Linear Regression Example

As an example of the capabilities and power of SystemML and DML, let’s consider the Linear Regression algorithm.
We require sets of data to train and test our model. To obtain this data, we can either use real data or
generate data for our algorithm. The
UCI Machine Learning Repository Datasets is one location for real data.
Use of real data typically involves some degree of data wrangling. In the following example, we will use SystemML to
generate random data to train and test our model.

SystemML is distributed in several packages, including a standalone package. We’ll operate in Standalone mode in this
example.

Run DML Script to Generate Random Data

We can execute the genLinearRegressionData.dml script in Standalone mode using either the systemml or systemml.bat
file.
In this example, we’ll generate a matrix of 1000 rows of 50 columns of test data, with sparsity 0.7. In addition to
this, a 51st column consisting of labels will
be appended to the matrix.

Divide Generated Data into Two Sample Groups

Next, we’ll create two subsets of the generated data, each of size ~50%. We can accomplish this using the sample.dml
script with the perc.csv file created in the previous step:

0.5
0.5

The sample.dml script will randomly sample rows from the linRegData.csv file and place them into 2 files based
on the percentages specified in perc.csv. This will create two sample groups of roughly 50 percent each.

The 1 file contains the first partition of data, and the 2 file contains the second partition of data.
An associated metadata file describes
the nature of each partition of data. If we open 1 and 2 and look at the number of rows, we can see that typically
the partitions are not exactly 50% but instead are close to 50%. However, we find that the total number of rows in the
original data file equals the sum of the number of rows in 1 and 2.

Split Label Column from First Sample

The next task is to split the label column from the first sample. We can do this using the splitXY.dml script.

Train Model on First Sample

Now, we can train our model based on the first sample. To do this, we utilize the LinearRegDS.dml (Linear Regression
Direct Solve) script. Note that SystemML also includes a LinearRegCG.dml (Linear Regression Conjugate Gradient)
algorithm for situations where the number of features is large.

Now that we have our betas.csv, we can test our model with our second set of data.

Test Model on Second Sample

To test our model on the second sample, we can use the GLM-predict.dml script. This script can be used for both
prediction and scoring. Here, we’re using it for scoring since we include the Y named argument. Our betas.csv
file is specified as the B named argument.