Seq Artifact – Overview and Guide

About

Seq is a programming language for computational genomics and bioinformatics. With a Python-compatible syntax and a host of domain-specific features and optimizations, Seq makes writing high-performance genomics software as easy as writing Python code, and achieves performance comparable to (and in many cases better than) C/C++.

We plan to release Seq by the end of Summer 2019. In the meantime, you can play around with an alpha version by following this guide!

Overview

This document describes the artifact submitted alongside the Seq paper and how to use it to run the benchmarks given in Section 6, for each of the various evaluated languages/compilers.

While it is difficult to accurately reproduce performance results inside a VM (although in our experience, performance in the VM has been roughly comparable to the paper’s results), we hope that this artifact will at least allow users to get familar with Seq, play around with some real-world code examples, and in general see how it compares to the alternatives.

Vagrantfile Configuration

The Seq VM requires at least 2 GB of RAM to run. For experiments such as fasta and knucleotide, you should grant the VM at least 6 GB of RAM. snap requires 32 GB of RAM, while sga requires 64 GB of RAM. Also, have at least two cores if you wish to run parallel Seq code.

Datasets

Several benchmarks require large datasets in order to reproduce the results from the paper. Specifically, 16mer, rc, cpg, snap and sga require the whole-genome HG00123 reads available at http://cb.csail.mit.edu/cb/seq/HG00123.tar.bz2 (4.5 GB; 24 GB uncompressed). The VM already comes with the truncated HG00123 toy dataset that allows the benchmarks to run; download the full dataset only if you wish to reproduce the exact experiments outlined in the paper.

Note: All files are tarred and bzipped by default. Remember to untar them before running the experiments via tar jxvf arhcive.tar.bz2! Note: All data files must be placed (or untarred) in the $HOME/data directory.

Code

The source code of all benchmarks is located in the benchmarks/<benchmark_name> directory. For example, a C++ implementation of the cpg benchmark is available in benchmarks/cpg/cpg.cc; an idiomatic Seq implementation would be benchmarks/cpg/cpg.id.seq, while a parallel Seq version is located in benchmarks/cpg/cpg.par.seq.

Clang and g++ share the same implementations, as do Python, PyPy and Nuitka. In some cases, the Shedskin implementations are slightly different than Python’s due to the feature gap between Python and Shedskin. Shedskin implementations can be found in <benchmark_name>_shed.py.

Step by Step Instructions

Aside from the datasets/indices described above, the VM comes preloaded with everything needed to run the benchmarks from Section 6 of the Seq paper. Once the VM boots, you can use the run.sh script to run any of the experiments:

./run.sh <experiment_name> <compiler>

experiment_name is one of the benchmarks presented in the paper, namely:

fasta

revcomp

knucleotide

cpg

16mer

rc

sga

snap

compiler is one of the tested compilers, namely:

all (runs the experiment with all compilers)

g++

clang

seq

seq-id

seq-par

python

nuitka

shedskin

pypy

julia

As noted in the paper, seq-id refers to idiomatic Seq code (i.e. using non-Python compatible features). seq-par is for parallel Seq runs for the cpg, 16mer and sga benchmarks. For the snap and sga benchmarks, seq-id indicates the use of prefetching.

Importantly, not all combinations of experiment+compiler exist! In particular (as also shown in the paper’s results):

For fasta, revcomp and knucleotide, we do not have separate idiomatic Seq implementations, so these all run with regular seq.

snap and sga only have Seq (seq, seq-id) and C++ (clang, g++) implementations.

Parallel implementations (seq-par) only exist for cpg, 16mer and sga.

Reproducing the Tables

In the home directory, there is a reproduce/ folder containing scripts to reproduce the tables in the paper. If you don’t see this folder (perhaps due to using an older VM instance), you can download it by running the following in the home directory:

wget -c http://cb.csail.mit.edu/cb/seq/reproduce.tgz -O - | tar -xz

For example, to reproduce Table 1, simply use

reproduce/table1

You can use reproduce/table2 and reproduce/table3, for Tables 2 and 3 respectively. Note that Figure 15 simply shows the results from Tables 1 and 2 as bar charts.

Playing with Seq

If you wish to play with the Seq compiler, you can run Seq code through the JIT by typing:

seq myfile.seq <program args>

Alternatively, you can produce a compiled executable by running

seq-compile myfile.seq myexec

and then run the executable via

./myexec <program args>

All other tools (g++, clang, nuitka, pypy, julia, shedskin and python) are already loaded in the default PATH.

Note: Parallelism will not work with the default Seq compiler. Parallel Seq is provided separately (executables are located in the $HOME/seq-tapir directory) due to ongoing dependency issues (LLVM+Tapir) that we hope to resolve in the near future. The environment variable OMP_NUM_THREADS controls the number of threads that Seq programs are allowed to consume. For more details, consult the run_seq_par function in run.sh.

Note: The Seq build provided in the VM is a debug build, so you may see warnings when e.g. using atomic operations or shadowing variables/functions. It is safe to ignore these!

Troubleshooting

I see an error message about temporary space/directories/files.

The VM is probably out of space. You can remove output files from previous runs via rm -rf ~/out/*; this should free up some space.

I see a Seq assertion failure in list.seq.

The most likely reason for this is not passing enough arguments to a program, and getting an exception when the argv list is accessed.