Introduction

SystemML enables flexible, scalable machine learning. This flexibility is achieved through the specification of a high-level declarative machine learning language that comes in two flavors,
one with an R-like syntax (DML) and one with a Python-like syntax (PyDML).

Algorithm scripts written in DML and PyDML can be run on Hadoop, on Spark, or in Standalone mode.
No script modifications are required to change between modes. SystemML automatically performs advanced optimizations
based on data and cluster characteristics, so much of the need to manually tweak algorithms is largely reduced or eliminated.
To understand more about DML and PyDML, we recommend that you read Beginner’s Guide to DML and PyDML.

For convenience of Python users, SystemML exposes several language-level APIs that allow Python users to use SystemML
and its algorithms without the need to know DML or PyDML. We explain these APIs in the below sections.

matrix class

The matrix class is an experimental feature that is often referred to as Python DSL.
It allows the user to perform linear algebra operations in SystemML using a NumPy-like interface.
It implements basic matrix operators, matrix functions as well as converters to common Python
types (for example: Numpy arrays, PySpark DataFrame and Pandas
DataFrame).

The primary reason for supporting this API is to reduce the learning curve for an average Python user,
who is more likely to know Numpy library, rather than the DML language.

Lazy evaluation

By default, the operations are evaluated lazily to avoid conversion overhead and also to maximize optimization scope.
To disable lazy evaluation, please us set_lazy method:

>>>importsystemmlassml>>>importnumpyasnp>>>m1=sml.matrix(np.ones((3,3))+2)WelcometoApacheSystemML!>>>m2=sml.matrix(np.ones((3,3))+3)>>>np.add(m1,m2)+m1# This matrix (mVar4) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPy() or toDF() or toPandas() methods.mVar2=load(" ",format="csv")mVar1=load(" ",format="csv")mVar3=mVar1+mVar2mVar4=mVar3+mVar1save(mVar4," ")>>>sml.set_lazy(False)>>>m1=sml.matrix(np.ones((3,3))+2)>>>m2=sml.matrix(np.ones((3,3))+3)>>>np.add(m1,m2)+m1# This matrix (mVar8) is backed by NumPy array. To fetch the NumPy array, invoke toNumPy() method.

Since matrix is backed by lazy evaluation and uses a recursive Depth First Search (DFS),
you may run into RuntimeError: maximum recursion depth exceeded.
Please see below troubleshooting steps

Dealing with the loops

It is important to note that this API doesnot pushdown loop, which means the
SystemML engine essentially gets an unrolled DML script.
This can lead to two issues:

Since matrix is backed by lazy evaluation and uses a recursive Depth First Search (DFS),
you may run into RuntimeError: maximum recursion depth exceeded.
Please see below troubleshooting steps

The unrolling of the for loop can be demonstrated by the below example:

>>>importsystemmlassml>>>importnumpyasnp>>>m1=sml.matrix(np.ones((3,3))+2)WelcometoApacheSystemML!>>>m2=sml.matrix(np.ones((3,3))+3)>>>m3=m1>>>foriinrange(5):...m3=m1*m3+m1...>>>m3# This matrix (mVar12) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPy() or toDF() or toPandas() methods.mVar1=load(" ",format="csv")mVar3=mVar1*mVar1mVar4=mVar3+mVar1mVar5=mVar1*mVar4mVar6=mVar5+mVar1mVar7=mVar1*mVar6mVar8=mVar7+mVar1mVar9=mVar1*mVar8mVar10=mVar9+mVar1mVar11=mVar1*mVar10mVar12=mVar11+mVar1save(mVar12," ")

We can reduce the impact of this unrolling by eagerly evaluating the variables inside the loop:

Here is an example that uses the above functions and trains a simple linear regression model:

>>>importnumpyasnp>>>fromsklearnimportdatasets>>>importsystemmlassml>>># Load the diabetes dataset>>>diabetes=datasets.load_diabetes()>>># Use only one feature>>>diabetes_X=diabetes.data[:,np.newaxis,2]>>># Split the data into training/testing sets>>>X_train=diabetes_X[:-20]>>>X_test=diabetes_X[-20:]>>># Split the targets into training/testing sets>>>y_train=diabetes.target[:-20]>>>y_test=diabetes.target[-20:]>>># Train Linear Regression model>>>X=sml.matrix(X_train)>>>y=sml.matrix(np.matrix(y_train).T)>>>A=X.transpose().dot(X)>>>b=X.transpose().dot(y)>>>beta=sml.solve(A,b).toNumPy()>>>y_predicted=X_test.dot(beta)>>>print('Residual sum of squares: %.2f'%np.mean((y_predicted-y_test)**2))Residualsumofsquares:25282.12

For all the above functions, we always return a two dimensional matrix, especially for aggregation functions with axis.
For example: Assuming m1 is a matrix of (3, n), NumPy returns a 1d vector of dimension (3,) for operation m1.sum(axis=1)
whereas SystemML returns a 2d matrix of dimension (3, 1).

Note: an evaluated matrix contains a data field computed by eval
method as DataFrame or NumPy array.

Support for NumPy’s universal functions

The matrix class also supports most of NumPy’s universal functions (i.e. ufuncs):

Design Decisions of matrix class (Developer documentation)

Until eval() method is invoked, we create an AST (not exposed to
the user) that consist of unevaluated operations and data
required by those operations. As an anology, a spark user can
treat eval() method similar to calling RDD.persist() followed by
RDD.count().

The AST consist of two kinds of nodes: either of type matrix or
of type DMLOp. Both these classes expose _visit method, that
helps in traversing the AST in DFS manner.

A matrix object can either be evaluated or not. If evaluated,
the attribute ‘data’ is set to one of the supported types (for
example: NumPy array or DataFrame). In this case, the attribute
‘op’ is set to None. If not evaluated, the attribute ‘op’ which
refers to one of the intermediate node of AST and if of type
DMLOp. In this case, the attribute ‘data’ is set to None.

DMLOp has an attribute ‘inputs’ which contains list of matrix
objects or DMLOp.

To simplify the traversal, every matrix object is considered
immutable and an matrix operations creates a new matrix object.
As an example: m1 = sml.matrix(np.ones((3,3))) creates a matrix
object backed by ‘data=(np.ones((3,3))’. m1 = m1 * 2 will
create a new matrix object which is now backed by ‘op=DMLOp( …)’
whose input is earlier created matrix object.

Left indexing (implemented in __setitem__ method) is a
special case, where Python expects the existing object to be
mutated. To ensure the above property, we make deep copy of
existing object and point any references to the left-indexed
matrix to the newly created object. Then the left-indexed matrix
is set to be backed by DMLOp consisting of following pydml:
left-indexed-matrix = new-deep-copied-matrix
left-indexed-matrix[index] = value

Please use m.print_ast() and/or type m for debugging. Here is a
sample session:

MLContext API

The Spark MLContext API offers a programmatic interface for interacting with SystemML from Spark using languages such as Scala, Java, and Python.
As a result, it offers a convenient way to interact with SystemML from the Spark Shell and from Notebooks such as Jupyter and Zeppelin.

Usage

fromsklearnimportdatasets,neighborsfrompyspark.sqlimportDataFrame,SQLContextimportsystemmlassmlimportpandasaspdimportos,impsqlCtx=SQLContext(sc)digits=datasets.load_digits()X_digits=digits.datay_digits=digits.target+1n_samples=len(X_digits)# Split the data into training/testing sets and convert to PySpark DataFrameX_df=sqlCtx.createDataFrame(pd.DataFrame(X_digits[:.9*n_samples]))y_df=sqlCtx.createDataFrame(pd.DataFrame(y_digits[:.9*n_samples]))ml=sml.MLContext(sc)# Get the path of MultiLogReg.dmlscriptPath=os.path.join(imp.find_module("systemml")[1],'systemml-java','scripts','algorithms','MultiLogReg.dml')script=sml.dml(scriptPath).input(X=X_df,Y_vec=y_df).output("B_out")beta=ml.execute(script).get('B_out').toNumPy()

mllearn API

mllearn API is designed to be compatible with scikit-learn and MLLib.
The classes that are part of mllearn API are LogisticRegression, LinearRegression, SVM, NaiveBayes
and Caffe2DML.

Please note that when training using mllearn API (i.e. model.fit(X_df)), SystemML
expects that labels have been converted to 1-based value.
This avoids unnecessary decoding overhead for large dataset if the label columns has already been decoded.
For scikit-learn API, there is no such requirement.

The table below describes the parameter available for mllearn algorithms:

Parameters

Description of the Parameters

LogisticRegression

LinearRegression

SVM

NaiveBayes

sparkSession

PySpark SparkSession

X

X

X

X

penalty

Used to specify the norm used in the penalization (default: ‘l2’)

only ‘l2’ supported

-

-

-

fit_intercept

Specifies whether to add intercept or not (default: True)

X

X

X

-

normalize

This parameter is ignored when fit_intercept is set to False. (default: False)

X

X

X

-

max_iter

Maximum number of iterations (default: 100)

X

X

X

-

max_inner_iter

Maximum number of inner iterations, or 0 if no maximum limit provided (default: 0)

Supports either ‘newton-cg’ or ‘direct-solve’ (default: ‘newton-cg’). Depending on the size and the sparsity of the feature matrix, one or the other solver may be more efficient. ‘direct-solve’ solver is more efficient when the number of features is relatively small (m < 1000) and input matrix X is either tall or fairly dense; otherwise ‘newton-cg’ solver is more efficient.

-

-

is_multi_class

Specifies whether to use binary-class or multi-class classifier (default: False)

Troubleshooting Python APIs

Unable to load SystemML.jar into current pyspark session.

While using SystemML’s Python package through pyspark or notebook (SparkContext is not previously created in the session), the
below method is not required. However, if the user wishes to use SystemML through spark-submit and has not previously invoked

systemml.defmatrix.setSparkContext(sc)

Before using the matrix, the user needs to invoke this function if SparkContext is not previously created in the session.

If SystemML was not installed via pip, you may have to download SystemML.jar and provide it to pyspark via --driver-class-path and --jars.

matrix API is running slow when set_lazy(False) or when eval() is called often.

This is a known issue. The matrix API is slow in this scenario due to slow Py4J conversion from Java MatrixObject or Java RDD to Python NumPy or DataFrame.
To resolve this for now, we recommend writing the matrix to FileSystemML and using load function.

maximum recursion depth exceeded

SystemML matrix is backed by lazy evaluation and uses a recursive Depth First Search (DFS).
Python can throw RuntimeError: maximum recursion depth exceeded when the recursion of DFS exceeds beyond the limit
set by Python. There are two ways to address it: