Introducing the Joyent Manta Service

Joyent Manta is a highly scalable, distributed object storage service with integrated compute. Developers can store and process any amount of data at any time where a simple web API call replaces the need for spinning up instances. Joyent Manta Compute is a complete and high performance compute environment including R, Python, node.js, Perl, Ruby, Java, C/C++, ffmpeg, grep, awk and others. Metering is by the second with zero provisioning, data movement or scheduling latency costs.

This page describes the service and how to get started. You can also skip
straight to some compute examples.

Sign Up

To use Joyent Manta Storage Service, you need a Joyent Cloud account. If
you don't already have an account, you can create one at
https://my.joyent.com/signup. You will
not be charged for Joyent Manta Storage Service until you use it.

Once you have signed up, you will need to add an SSH public key to your account. Joyent recommends using RSA keys, as the node-manta CLI programs will work with RSA keys both locally, and with the ssh agent. DSA keys will only work if the
private key is on the same system as the CLI, and not password-protected.

An Integral Part of the Joyent Cloud

The Joyent Manta Service is just one of a family of services. Joyent Compute Services range from instances in our standard Persistent Compute Service (metered by the hour, month, or year) to our ephemeral Joyent Manta Compute Service (by the second). All are designed to seamlessly work with our Object Storage and Data Services.

Getting Started

This tutorial assumes you've signed up for a Joyent account and have an RSA public SSH key added to your account. We will cover installing the node.js SDK and CLI, setting up your shell environment variables, and then working through examples of creating directories, objects, links and finally running compute jobs on your data.

The CLI is the only tool used in these examples, and the instructions assume you're doing this from a Mac OS X, SmartOS, Linux or BSD system, and know how to use SSH and a terminal application such as Terminal.app. It helps to be familiar with basic Unix facilities like the shells, pipes, stdin, and stdout.

Using the Mac OS X installer

If you do not have node.js installed you can use the complete installer for Mac. This installer installs node.js 0.10.x, npm, node-manta and smartdc.

If You Have node.js Installed

If you have at least node.js 0.8.x installed (0.10.x is recommended) you can install the CLI and SDK from an npm package. All of the examples below work with both node.js 0.8.x and 0.10.x.

sudo npm install manta -g

Additionally, as the API is JSON-based, the examples will refer to the
json tool, which helps put JSON output in a more human readable format. You can install from npm:

sudo npm install json -g

Lastly, and while optional, if you want to use verbose debug logging with the
SDK, you will want bunyan:

sudo npm install bunyan -g

Setting Up Your Environment

While you can specify command line switches to all of the node-manta CLI
programs, it is significantly easier for you to set them globally in your
environment. There are four environment variables that all command line tools look for:

or restart your terminal to pick up the changes you made to ~/.bash_profile or ~/.bashrc.

Everything works if typing mls /$MANTA_USER/ returns the top level contents.

mls /$MANTA_USER/
jobs/
public/
reports/
stor/

The shortcut ~~ is equivalent to typing /$MANTA_USER.
Since many operations require full Manta paths,
you'll find it useful. We will use it for the remainder
of this document.

mls ~~/
jobs/
public/
reports/
stor/

CLI

This Getting Started guide uses command line tools that are Joyent Manta Storage Service analogs of common Unix tools (e.g. mls == ls). You can find man pages for these tools in the CLI Utilities Reference

Create Data

Now that you've signed up, have the CLI and have your environment variables
set, you are ready to create data. In this section we will create an object, a
subdirectory for you to place another object in, and create a SnapLink to one of
those objects. These examples are written so that you can copy from here wherever you see a $ and paste directly into Terminal.app

If you're the kind of person who likes understanding "what all this is" before going through examples, you can read about the Storage Architecture in the Object Storage Reference. Feel free to pause here, go read that, and then come right back to this point.

Objects

Objects are the main entity you will use. An object is non-interpreted data of
any size that you read and write to the store. Objects are immutable. You cannot
append to them or edit them in place. When you overwrite an object, you
completely replace it.

By default, objects are replicated to two physical
servers, but you can specify between one and six copies, depending on your
needs. You will be charged for the number of bytes you consume, so specifying
one copy is half the price of two, with the trade-off being a decrease in potential durability and availability.

When you write an object, you give it a name. Object names (keys)
look like Unix file paths. This is how you would create an object named
~~/stor/hello-foo that contains the data in the file hello.txt:

In the example above, we don't have a local file, so mput doesn't attempt to set the MIME type. To make sure our object is properly readable by a browser, we set the HTTP Content-Type header explicitly.

Now, about ~~/stor. Your "namespace" is /:login/stor.
This is where all of your data that you would like to keep private is stored.
In a moment we'll make some directories, but you can create any number of
objects and directories in this namespace without conflicting with other users.

In addition to /:login/stor, there is also /:login/public, which allows for
unauthenticated reads over HTTP and HTTPS. This directory is useful for
you to host world-readable files, such as media assets you would use in a CDN.

Directories

All objects can be stored in Unix-like directories. As you have seen, /:login/stor is the top
level directory. You can logically think of it like / in Unix environments.
You can create any number of directories and sub-directories, but there is a
limit to how many entries can exist in a single directory, which is 1,000,000
entries. In addition to /:login/stor, there are a few other top-level
"directories" that are available to you.

Directory

Description

/:login/jobs

Job reports. Only you can read and destroy them; it is written by the system only.

/:login/public

Public object storage. Anyone can access objects in this directory and its subdirectories. Only you can create and destroy them.

/:login/reports

Usage and Access log reports. Only you can read and destroy them; it is written by the system only.

/:login/stor

Private object storage. Only you can create, destroy, and access objects in this directory and its subdirectories.

Directories are useful when you want to logically group objects (or other
directories) and be able to list them efficiently (including feeding all the objects
in a directory into parallelized compute jobs). Here are a few examples of
creating, listing, and deleting directories:

SnapLinks

SnapLinks are a concept unique to the Joyent Manta Storage Service. SnapLinks
are similar to a Unix hard-link, and because the system is "copy on write," data
changes are not reflected in the SnapLink. This property makes SnapLinks a very
powerful entity that allows you to create any number of alternate names and
versioning schemes that you like.

As a concrete example, note what the following sequence of steps creates in
the objects foo and bar:

Running Compute on Data

You have now seen how to work with objects, directories, and SnapLinks. Now it is
time to do some text processing.

The jobs facility is designed to support operations on an arbitrary number
of arbitrarily large objects. While performance considerations may dictate the
optimal object size, the system can scale to very large datasets.

You perform arbitrary compute tasks in an isolated OS instance, using MapReduce
to manage distributed processing. MapReduce is a technique for dividing work
across distributed servers, and dramatically reduces network bandwidth as the
code you want to run on objects is brought to the physical server that holds the
object(s), rather than transferring data to a processing host.

The MapReduce implementation is unique in that you are given a full
OS environment that allows you to run any code, as opposed to being bound to a
particular framework/language. To demonstrate this, we will compose a MapReduce
job purely using traditional Unix command line tools in the following examples.

Upload some datasets

First, let's get a few more books into our data collection so we're processing
more than one file:

mfind is powerful like Unix find, in that you specify a starting point and use
basic regular expressions to match on names. This is another way to list the
names of all the objects (-t o) that end in txt:

This command instructs the system to run grep -ci vampire on
~~/stor/books/dracula.txt. The -o
flag tells mjob create to wait for the job to complete and then fetch
and print the contents of the output objects. In this example, the result is 32.

In more detail: this command creates a job to run the user scriptgrep -ci vampire on each input object and then submits
~~/stor/books/dracula.txt as the only input to the job. The name of the
job is (in this case) 7b39e12b-bb87-42a7-8c5f-deb9727fc362. When the job completes,
the result is placed in an output object, which you can see with the mjob outputs
command:

Errors

If the grep command exits with a non-zero status (as grep does when it finds
no matches in the input stream) or fails in some other way (e.g., dumps core),
You'll see an error instead of an output object. You can get details on the
error, including a link to stdout, stderr, and the core file (if any), using the
mjob errors command.

Multiple phases and reduce tasks

We've just described the "map" phase of traditional map-reduce computations.
The "map" phase performs the same computation on each of the input objects. The
reduce phase typically combines the outputs from the map phase to produce a
single output.

One of the earlier examples computed the number of times the word "human" appeared
in each book. We can use a simple awk script in the reduce phase to get
the total number the of times "human" appears in all the books.

This job has two phases: the map phase runs grep -ci human on each input
object, then the reduce phase runs the awk script on the concatenated output
from the first phase. awk '{s+=$1} END {print s}' sums a list of numbers, so it sums the list of numbers that come out of the first phase. You can combine several map and reduce phases. The
outputs of any non-final phases become inputs for the next phase, and the
outputs of the final phase become job outputs.

While map phases always create one task for each input, reduce phases have a
fixed number of tasks (just one by default). While map tasks get the contents of
the input object on stdin as well as in a local file, reduce tasks only get a
concatenated stream of all inputs. The inputs may be combined in any order, but
data from separate inputs are never interleaved.

In the next example, we'll also introduce an alternative ^ and ^^ to the -m and -r flags, and see the first appearance of maggr.

Run a MapReduce Job to calculate the average word count

Now we have 5 classic novels uploaded, on which we can perform some basic
data analysis using nothing but Unix utilities. Let's first just see what
the "average" length is (by number of words), which we can do using just the
standard wc and the maggr command.

Let's break down what just happened in that magical one-liner. First, we'll
look at the mjob create command. mjob create -o submits a new job, and then
waits for the job to finish, then fetches and concatenates the outputs for you,
which is very useful for interactive ad-hoc queries. 'wc -w' ^^ 'maggr mean'
is a MapReduce definition that defines a 'map' "phase" of wc -w, and a reduce
"phase" of maggr mean. maggr is one of several tools we have in the compute instances that mirror similar Unix tools.

A "phase" is simply a command (or chain of commands) to execute on data. There
are two types of phases: map and reduce. Map phases run the given command on
every input object and stream the output to the next phase, which may be another
map phase, or likely a reduce phase. Reduce phases are run once, and concatenate
all data output from the previous phase.

The system runs your map-reduce
commands by invoking them in a new
bash shell. By default
your input data is available to your shell over stdin, and if you simply write
output data to stdout, it is captured and moved to the next phase (this is how
almost all standard Unix utilities work).

mjob create uses the symbols ^ and ^^ to act like the standard
Unix | (pipe) operator. The single ^
character indicates that the following command is part of the map phase. The double ^^
indicates that the following command is a reduce phase.

In this syntax, the first phase is
always a map phase. So the string 'wc -w' ^^ 'maggr mean', means
"execute wc -w on all objects given to the job" and
"then run maggr mean on the data output from wc -w."
maggr is a basic math utility function that is
part of the compute environment.

Which would create a mathematical string that bc can use that sums and then
calculates the average by dividing by the number of inputs (which is retrieved
dynamically).

Running Jobs Using Assets

Although the compute facility provides a full Joyent SmartOS environment,
your jobs may require special software, additional configuration information, or
any other static file that is useful. You can make these available as assets,
which are objects that are copied into the compute environment when your
job is run.

For example suppose you want to do a word frequency count using shell scripts
that contain your map and reduce logic. We can do this with two awk scripts, so let's write
them and upload them as assets.

Note that assets are made available to you in the compute environment under the
path /assets/$MANTA_USER/stor/.... A more
sophisticated program would likely use a list of stopwords to get rid of
common words like "and, the" and so on, which could also be mapped in as an
asset.

Advanced Usage

This introduction gave you a basic overview of Joyent Manta Storage Service: how
to work with objects and how to use the system's compute environment to operate on
those objects. The system provides many more sophisticated features, including:

Running custom programs (instead of just the built-in programs)

Emitting multiple outputs from a task (useful for chunking up large objects)

Controlling the names of output objects (useful for jobs with side effects,
like creating image thumbs)

Multiple reducers in a single phase

Emitting objects by reference

Let take you through some simple examples
of running node.js applications directly on the object store.
We'll be using some assets that are present in the mantademo account.
This is also a good example how you can
run compute with and on data people have
made available in their ~~/public directories.

We'll start with a "Hello,Manta" demo using node.js, you can see the script with an mget:

The inputDone field is "false" because we asked mjob to create a map phase, which requires at least one key, but we did not provide any keys. It's sort of an artifact of the hello world example and makes a important point.

Let's cancel this job, in fact, let's cancel all jobs so we can clean up anything we've left running from the examples above.

mjob list -s running | xargs mjob cancel

This also highlights that any CLI tool is normal Unix. The following two commands are equivalent.

mjob get `mjob list`
$ mjob list | xargs mjob get

Back to the node.js example, if we pipe the hello-manta-node.js in as a key and do it as a map phase with the -m flag:

The flag -o </dev/null is that so that we're redirecting from /dev/null and mjob create knows to not attempt to read any additional keys.

Now let's take it up one more level. You can see what's inside a simple node.js application that capitalizes all the text in an input file.

mget /mantademo/public/capitalizer.js
#!/usr/bin/env node
process.stdin.on('data', function(d) {
process.stdout.write(d.toString().replace(/\./g, '!').toUpperCase());
});
process.stdin.resume();
$ mget /mantademo/public/manta-desc.txt
Joyent Manta Storage Service is a cloud service that offers both a highly
available, highly durable object store and integrated compute. Application
developers can store and process any amount of data at any time, from any
location, without requiring additional compute resources.
$ echo /mantademo/public/manta-desc.txt | mjob create -o -s /mantademo/public/capitalizer.js -m 'node /assets/mantademo/public/capitalizer.js'
added 1 input to 2aa8a0a9-92e9-47f3-8b66-acf2a22d25a8
JOYENT MANTA STORAGE SERVICE IS A CLOUD SERVICE THAT OFFERS BOTH A HIGHLY
AVAILABLE, HIGHLY DURABLE OBJECT STORE AND INTEGRATED COMPUTE! APPLICATION
DEVELOPERS CAN STORE AND PROCESS ANY AMOUNT OF DATA AT ANY TIME, FROM ANY
LOCATION, WITHOUT REQUIRING ADDITIONAL COMPUTE RESOURCES!