In this guide you’re going to create a Pachyderm pipeline to process
transaction logs from a fruit stand. We’ll use two standard unix tools, grep
and awk to do our processing. Thanks to Pachyderm’s processing system we’ll
be able to run the pipeline in a distributed, streaming fashion. As new data is
added, the pipeline will automatically process it and materialize the results.

A repo is the highest level primitive in the Pachyderm file system (pfs). Like all primitives in pfs, it shares it’s name with a primitive in Git and is designed to behave analogously. Generally, repos should be dedicated to a single source of data such as log messages from a particular service, a users table, or training data for an ML model. Repos are dirt cheap so don’t be shy about making tons of them.

For this demo, we’ll simply create a repo called
“data” to hold the data we want to process:

$ pachctl create-repo data
# See the repo we just created
$ pachctl list-repo
data

Now that we’ve created a repo it’s time to add some data. In Pachyderm, you write data to an explicit commit (again, similar to Git). Commits are immutable snapshots of your data which give Pachyderm its version control properties. Files can be added, removed, or updated in a given commit and then you can view a diff of those changes compared to a previous commit.

Let’s start by just adding a file to a new commit. We’ve provided a sample data file for you to use in our GitHub repo – it’s a list of purchases from a fruit stand.

We’ll use the put-file command along with two flags, -c and -f. -f can take either a local file or a URL, in our case, the sample data on GitHub.

Unlike Git though, commits in Pachyderm must be explicitly started and finished as they can contain huge amounts of data and we don’t want that much “dirty” data hanging around in an unpersisted state. The -c flag we used above specifies that we want to start a new commit, add data, and finish the commit in a convenient one-liner.

Finally, we can see the data we just added to Pachyderm.

# If we list the repos, we can see that there is now data
$ pachctl list-repo
NAME CREATED SIZE
data 12 minutes ago 874 B
# We can view the commit we just created
pachctl list-commit data
BRANCH REPO/ID PARENT STARTED FINISHED SIZE
master data/master/0 <none> 6 minutes ago 6 minutes ago 874 B
# We can also view the contents of the file that we just added
$ pachctl get-file data master sales
orange 4
banana 2
banana 9
orange 9
...

Now that we’ve got some data in our repo, it’s time to do something with it.
Pipelines are the core primitive for Pachyderm’s processing system (pps) and
they’re specified with a JSON encoding. For this example, we’ve already created the pipeline for you and it can be found at examples/fruit_stand/pipeline.json on Github. Please open a new tab to view the pipeline while we talk through it.

When you want to create your own pipelines later, you can refer to the full Pipeline Specification to use more advanced options. This includes building your own code into a container instead of just using simple shell commands as we’re doing here.

For now, we’re going to create a pipeline with 2 transformations in it. The first transformation filters the sales logs into separate records for apples,
oranges and bananas. The second step sums these sales numbers into a final sales count.

In the first step of this pipeline, we are grepping for the terms “apple”, “orange”, and “banana” and writing that line to the corresponding file. Notice we read data from /pfs/data (/pfs/[input_repo_name]) and write data to /pfs/out/. These are special local directories that Pachyderm creates within the container for you. All the input data will be found in /pfs/[input_repo_name] and your code should always write to /pfs/out.

The second step of this pipeline takes each file, removes the fruit name, and sums up the purchases. The output of our complete pipeline is three files, one for each type of fruit with a single number showing the total quantity sold.

Creating a pipeline tells Pachyderm to run your code on every finished
commit in a repo as well as all future commits that happen after the pipeline is created. Our repo already had a commit, so Pachyderm automatically
launched a job to process that data.

Every pipeline creates a corresponding repo with the same name where it stores its output results. In our example, the “filter” transformation created a repo called “filter” which was the input to the “sum” transformation. The “sum” repo contains the final output files.

Pipelines will also automatically process the data from new commits as they are
created. Think of pipelines as being subscribed to any new commits that are
finished on their input repo(s). Also similar to Git, commits have a parental
structure that track how files change over time. In this case we’re going to be adding more data to the same file “sales.”

In our fruit stand example, this could be making a commit every hour with all the new purchases that happened in that timeframe.

Let’s create a new commit with our previous commit as the parent and add more sample data (set2.txt) to “sales”:

Adding a new commit of data will automatically trigger the pipeline to run on
the new data we’ve added. We’ll see a corresponding commit to the output
“sum” repo with files “apple”, “orange” and “banana” each containing the cumulative total of purchases. Let’s read the “apples” file again and see the new total number of apples sold.

$ pachctl get-file sum e4060e15948c4b7b89947a02eace5dca/1 apple
324

One thing that’s interesting to note is that our pipeline is completely incremental. Since grep is a map operation, Pachyderm will only grep the new data from set2.txt instead of re-filtering all the data. If you look back at the “sum” pipeline, you’ll notice the method and that our code uses /pfs/prev to compute the sum incrementally based upon our previous commit. You can learn more about incrementally in our advanced Incrementality docs.

Another nifty feature of Pachyderm is that you can mount the file system locally to poke around and explore your data using FUSE. FUSE comes pre-installed on most Linux distributions. For OS X, you’ll need to install OSX FUSE.

The first thing we need to do is mount Pachyderm’s filesystem (pfs).

First create the mount point:

$ mkdir ~/pfs

And then mount it:

# We background this process because it blocks.
$ pachctl mount ~/pfs &

This will mount pfs on ~/pfs you can inspect the filesystem like you would any
other local filesystem such as using ls or pointing your browser at it.