ARepA Tutorial

ARepA QUICKSTART

Welcome to the quickstart tutorial to ARepA. You should always refer to the ARepA full README (http://huttenhower.sph.harvard.edu/arepa/manual) for a more thorough description of ARepA. Here, we will walkthrough a small toy example to get you used to ARepA's various functionalities. First, you should download and extract ARepA, making sure that all required dependencies are installed on your machine. It is highly recommended that Sleipnir (http://huttenhower.sph.harvard.edu/content/getting-started-sleipnir) is installed if you want the complete set of features available for ARepA.

What is the ARepA build process?

The scons command in the root directory of ARepA launches all processes across all submodules (repositories). For instructional purposes, however, ARepA is better understood by looking at the subcomponents of its complete build process. We will break down the build process into sequential components.

1. Build components necessary for submodules

For you to be able to fetch data from a certain repository, say Bacteriome, you will first need to tell ARepA to build certain components that are shared across all the repositories. This process only needs to be completed once per change in the taxonomy input.

For this tutorial we will be getting E. coli data. This information can be inputted in the etc/taxa file

The command "scons -k tmp/" instructs ARepA to only build files that will be saved in the tmp directory. Any output following "scons:" in the terminal signifies a message from the build process of ARepA (provided by SCons, a make-like software build tool that ARepA utilizes to handle its hierarchical dependency tracking). In particular, you will always see "scons: done building targets" after some process in ARepA has finished.

2. Build an external submodule

An internal submodule is a submodule that is associated with a repository; this is where data handling for a specific repository is done (e.g. Bacteriome). An external submodule is one that performs significant tasks associated globally within ARepA. One such example is an external submodule that is dedicated to the standardization of gene identifiers ("gene mapping"). This module is the "GeneMapper" submodule. As before, a process can be launched by typing the "scons" command

3. Get data from Bacteriome

You should now be familiar with how you can launch a submodule. To get data from Bacteriome, simply (you guessed it) launch scons in the Bacteriome submodule

$ cd Bacteriome
$ scons -kj4

Important: the -k flag ensures that ARepA continues to build when it encounters errors; the -j4 flag tells ARepA to run 4 threads at once.
In general, it is not adviseable to run more threads than the number of cores in the machine. For instance, if you have a dual-core processor, you would type scons -kj2.

The final output data is always the name of the repository (or dataset) with either a .dat or .pcl extension. Output metadata is followed by a .pkl extension. Here we assume that Sleipnir is correctly installed on the machine.

4. Get data from GEO

GEO is the most complex ARepA module, allowing for the construction of very flexible pipelines to download and process data. In particular, you can specify the names of GSE/GDS datasets without having to download the entirity of the datasets from that particular taxonomy (E. coli is the running example). Let's take a look at its configuration file

License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.