Greenplum MADlib Extension for Analytics

About MADlib

MADlib is an open-source library for scalable in-database analytics. With the Greenplum
Database MADlib extension, you can use MADlib functionality in a Greenplum Database.

MADlib provides data-parallel implementations of mathematical, statistical and
machine-learning methods for structured and unstructured data. It provides an suite of
SQL-based algorithms for machine learning, data mining and statistics that run at scale
within a database engine, with no need for transferring data between Greenplum Database and
other tools.

MADlib can be used with PivotalR, an R package that enables users to interact with data
resident in Greenplum Database using the R client. See About MADlib, R, and PivotalR.

Installing MADlib

For Pivotal Greenplum Database, the MADlib extension is available as a package. Download
the package from Pivotal Network and then install it with the
Greenplum Package Manager (gppkg).

To install MADlib on Greenplum Database, you install the Greenplum MADlib package on
Greenplum Database and then install the MADlib function libraries on the databases that use
MADlib. For the versions of the MADlib extension supported by your version of Greenplum
Database, see the Greenplum Database Release Notes.

The gppkg utility installs Greenplum Database extensions, along with any
dependencies, on all hosts across a cluster. It also automatically installs extensions on
new hosts in the case of system expansion segment recovery.

Note: On Greenplum Database 4.3.10.0 or later, install MADlib 1.10.

If you
install or upgrade to MADlib 1.9.1 on Greenplum Database 4.3.10.0 or later, you must run
the MADlib script fix_madpack.sh that fixes the madpack
MADlib utility to work with Greenplum Database 4.3.10.0 or later. MADlib 1.10 does not
require the script. You can provide the path to the MADlib installation with the
--prefix
option.

$ fix_madpack.sh --prefix madlib-installation-path

If
you do not include the --prefix option, the script uses the location
$GPHOME/madlib.

For information about gppkg, see the Greenplum Database Utility
Guide.

Installing the Greenplum Database MADlib Package

Before you install the MADlib package, make sure that your Greenplum database is running,
you have sourced greenplum_path.sh, and that the
$MASTER_DATA_DIRECTORY and $GPHOME variables are set.

Download the MADlib extension package from Pivotal Network, then copy it to the master host.

Install the software package by running the gppkg
command. This example installs the MADlib 1.10 package on a Linux
system:

$ gppkg -i madlib-ossv1.10_pv1.10_gpdb4.3orca-rhel5-x86_64.gppkg

Adding MADlib Functions to a Database

After installing the MADlib package, run the madpack command to add
MADlib functions to Greenplum Database. madpack is in
$GPHOME/madlib/bin.

For example, this command creates MADlib functions in the Greenplum database
testdb running on server mdw on port
5432. The madpack command logs in as the user
gpadmin and prompts for password. The target schema is
madlib.

$ madpack -s madlib -p greenplum -c gpadmin@mdw:5432/testdb install

After installing the functions, The Greenplum Database gpadmin superuser role should
grant all privileges on the target schema (in the example madlib) to
users who will be accessing MADlib functions. Users without access to the functions will
get the error ERROR: permission denied for schema MADlib.

Upgrading MADlib

If you upgrade to MADlib 1.9.1, you must run execute
the MADlib fix_madpack.sh script. MADlib 1.10 does not require the
script. See the Note in "Installing MADlib."

Upgrading a MADlib Package

To upgrade to MADlib 1.10 .Run the gppkg utility with the
-u option. This command upgrades an installed MADlib package to MADlib
1.10.

gppkg -u madlib-ossv1.10_pv1.10_gpdb4.3orca-rhel5-x86_64.gppkg

Upgrading MADlib Functions

After you upgrade the MADlib package, you run the madpack command to
upgrade the MADlib functions in Greenplum Database. For this example command, the MADlib
functions are installed in the schema madlib of the Greenplum Database
test. This command upgrades the MADlib functions in the database
schema.

Uninstalling MADlib

When you remove MADlib support from a database, routines that you created in the database
that use MADlib functionality will no longer work.

Remove MADlib objects from the database

Use the madpack uninstall command to remove MADlib objects from a
Greenplum database. For example, this command removes MADlib objects from the database
testdb.

$ madpack -s madlib -p greenplum -c gpadmin@mdw:5432/testdb uninstall

Uninstall the Greenplum Database MADlib Package

If no databases use the MADlib functions, use the Greenplum gppkg
utility with the -r option to uninstall the MADlib package. When removing
the package you must specify the package and version. This example uninstalls MADlib
package version 1.9.

$ gppkg -r madlib-ossv1.9_pv1.9.5_gpdb4.3orca

You can run the gppkg utility with the options -q --all
to list the installed extensions and their versions.

After you uninstall the package, restart the database.

$ gpstop -r

Example

This example demonstrates the association rules data mining technique on a transactional
data set. Association rule mining is a technique for discovering relationships between
variables in a large data set. This example considers items in a store that are commonly
purchased together. In addition to market basket analysis, association rules are also used
in bioinformatics, web analytics, and other fields.

The example analyzes purchase information for seven transactions that are stored in a table
with the MADlib function MADlib.assoc_rules. The function assumes that the
data is stored in two columns with a single item and transaction ID per row. Transactions
with multiple items consist of multiple rows with one row per item.

References

PivotalR is a first class R package that enables users to interact with data resident in
Greenplum Database and MADLib using an R client.

About MADlib, R, and PivotalR

The R language is an open-source language that is used for statistical computing.
PivotalR is an R package that enables users to interact with data resident in Greenplum
Database using the R client. Using PivotalR requires that MADlib is installed on the
Greenplum Database.

PivotalR allows R users to leverage the scalability and performance of in-database
analytics without leaving the R command line. The computational work is executed
in-database, while the end user benefits from the familiar R interface. Compared with
respective native R functions, there is an increase in scalability and a decrease in
running time. Furthermore, data movement, which can take hours for very large data sets,
is eliminated with PivotalR.

Key features of the PivotalR package:

Explore and manipulate data in the database with R syntax. SQL translation is
performed by PivotalR.