Pachyderm File System (PFS) allows you to store arbitrary data in files. These files can be as large as you’d like, and store any kind of information.

We wanted to use an interface to data that is familiar to everyone. Reading/writing data to a file is as familiar as you get.

Doing this on big data sets gets interesting, but having a simple underlying interface makes interacting with the data more intuitive, and more easily accessible to developers no matter what their language of choices.

We store each commit as only the data that changed from the prior commit, which is where PFS differs from Git. Storing your data this way allows us to enable Incrementality and keeps PFS space efficient.

Under the hood, we store your files in sets of Blocks. These are smaller (usually ~8MB) chunks of your file. By storing your data in smaller chunks, we can more efficiently read and write your data in parallel.

Blocks also determine the smallest indivisible chunk of your data. When performing a map job, each File is seen by multiple containers. Each container sees one or more Blocks of a file.

This is important because this also determines the granularity of how the data is exposed as an input. Specifically, during a map job, each container will see a slice of your data file. That slice will be one or more Blocks.

You can see quickly how line delimiting will not work. If a block happens to terminate not at the end of a JSON object, the result during a map job will be a partial / invalid JSON object.

To make sure your JSON data is delimited correctly, just make sure the file in question has a .json suffix. This tells PFS that the data being stored is JSON, and Pachyderm will make sure each Block consists of whole JSON objects.

Since binary data doesn’t always have a static size, and can be quite large, delimiting binary data works a bit different.

We enable this by treating every single write to that file as a separate block, no matter what the size. E.g. if you open /pfs/out/foo.bin and within your code write to it several times, each time you write the data will be treated as a separate block. This guarantees that a map job consuming your data will always see it at least at the granularity you have provided by your writes.

To require PFS to delimit blocks in this fashion, make sure your file as the .bin suffix.

Mounting PFS locally is a great way to debug an issue, or poke around PFS to understand how it works.

To mount locally, run:

$ mkdir ~/pfs
$ pachctl mount ~/pfs &

(If ~/pfs already exists, you may need to umount it first)

Now you can look around the local mount using ls or just point your browser at the local files:

# This is equivalent to `pachctl list-repo`$ls ~/pfs
foo
# This is equiavelent to `pachctl list-commit foo`$ls ~/pfs/foo
master/0 master/1
# This is equiavelent to a call to `pachctl get-file ...`$cat ~/pfs/foo/master/0/test.txt
# And this is similar to `pachctl list-file ...`. It allows you to see all files in a commit:$ls ~/pfs/foo/master/0/
test.txt

Using this interface, you can grep, touch, ls, etc the files as you normally would. The exceptions are that you cannot write data to a commit that is finished.