Overview

SystemML enables flexible, scalable machine learning. This flexibility is achieved
through the specification of a high-level declarative machine learning language
that comes in two flavors, one with an R-like syntax (DML) and one with
a Python-like syntax (PyDML).

Algorithm scripts written in DML and PyDML can be run on Spark, on Hadoop, or
in Standalone mode. SystemML also features an MLContext API that allows SystemML
to be accessed via Scala or Python from a Spark Shell, a Jupyter Notebook, or a Zeppelin Notebook.

This Beginner’s Guide serves as a starting point for writing DML and PyDML
scripts.

Script Invocation

DML and PyDML scripts can be invoked in a variety of ways. Suppose that we have hello.dml and
hello.pydml scripts containing the following:

print('hello ' + $1)

One way to begin working with SystemML is to download a binary distribution of SystemML
and use the runStandaloneSystemML.sh and runStandaloneSystemML.bat scripts to run SystemML in standalone
mode. The name of the DML or PyDML script is passed as the first argument to these scripts,
along with a variety of arguments. Note that PyDML invocation can be forced with the addition of a -python flag.

Data Types

SystemML has four value data types. In DML, these are: double, integer,
string, and boolean. In PyDML, these are: float, int,
str, and bool. In normal usage, the data type of a variable is implicit
based on its value. Mathematical operations typically operate on
doubles/floats, whereas integers/ints are typically useful for tasks such as
iteration and accessing elements in a matrix.

Matrix Basics

Creating a Matrix

A matrix can be created in DML using the matrix() function and in PyDML using the full()
function. In the example below, a matrix element is still considered to be of the matrix data type,
so the value is cast to a scalar in order to print it. Matrix element values are of type double/float.

For additional information about the matrix() and full() functions, please see the
Matrix Construction
section of the Language Reference. For information about the toString() function, see
the Other Built-In Functions section of the Language Reference.

Saving a Matrix

A matrix can be saved using the write() function in DML and the save() function in PyDML. SystemML supports four
different formats: text (i,j,v), mm (Matrix Market), csv (delimiter-separated values), and binary.

Saving a matrix automatically creates a metadata file for each format except for Matrix Market, since Matrix Market contains
metadata within the *.mm file. All formats are text-based except binary. The contents of the resulting files are shown here.
Note that the text (i,j,v) and mm (Matrix Market) formats index from 1, even when working with PyDML, which
is 0-based.

Loading a Matrix

A matrix can be loaded using the read() function in DML and the load() function in PyDML. As with saving, SystemML supports four
formats: text (i,j,v), mm (Matrix Market), csv (delimiter-separated values), and binary. To read a file, a corresponding
metadata file is required, except for the Matrix Market format. A metadata file is not required if a format parameter is specified to the read()
or load() functions.

Matrix Operations

DML and PyDML offer a rich set of operators and built-in functions to perform various operations on matrices and scalars.
Operators and built-in functions are described in great detail in the Language Reference
(Expressions, Built-In Functions).

In this example, we create a matrix A. Next, we create another matrix B by adding 4 to each element in A. Next, we flip
B by taking its transpose. We then multiply A and B, represented by matrix C. We create a matrix D with the same number
of rows and columns as C, and initialize its elements to 5. We then subtract D from C and divide the values of its elements
by 2 and assign the resulting matrix to D.

Matrix Indexing

The elements in a matrix can be accessed by their row and column indices. In the example below, we have 3x3 matrix A.
First, we access the element at the third row and third column. Next, we obtain a row slice (vector) of the matrix by
specifying the row and leaving the column blank. We obtain a column slice (vector) by leaving the row blank and specifying
the column. After that, we obtain a submatrix via range indexing, where we specify rows, separated by a colon, and columns,
separated by a colon.

Control Statements

DML and PyDML feature 3 loop statements: while, for, and parfor (parallel for). In the example, note that the
print statements within the parfor loop can occur in any order since the iterations occur in parallel rather than
sequentially as in a regular for loop. The parfor statement can include several optional parameters, as described
in the Language Reference (ParFor Statement).

In the above example, a 3x2 matrix of random doubles between 0 and 2 is created using the rand() function.
Additional parameters can be passed to rand() to control sparsity and other matrix characteristics.

Matrix A is passed to the doSomething function. A column of 1 values is concatenated to the matrix. A column
consisting of the values (0, 1, 2) is concatenated to the matrix. Next, a column consisting of the maximum row values
is concatenated to the matrix. A column consisting of the row sums is concatenated to the matrix, and this resulting
matrix is returned to variable B. Matrix A is output to the A.csv file and matrix B is saved as the B.csv file.

Command-Line Arguments and Default Values

Command-line arguments can be passed to DML and PyDML scripts either as named arguments or as positional arguments. Named
arguments are the preferred technique. Named arguments can be passed utilizing the -nvargs switch, and positional arguments
can be passed using the -args switch.

Default values can be set using the ifdef() function.

In the example below, a matrix is read from the file system using named argument M. The number of rows to print is specified
using the rowsToPrint argument, which defaults to 2 if no argument is supplied. Likewise, the number of columns is
specified using colsToPrint with a default value of 2.