To follow along with this tutorial, you should have working knowledge of the Unix command line and Bash.
If you are not familiar with these, we have included links to several free tutorials below.
Note that they are sorted by time investment and in ascending order.

You should also be somewhat familiar with Amazon Web Services (AWS), and have an AWS account.
If you do not feel comfortable with using AWS, we recommend that you consider reading through this tutorial.
An Azure account may also be used.
You can learn more about using Azure here.

To run Aether, you will need a copy of Python (Version 2.7).
If you have not already installed Python, we recommend installing Anaconda, which comes with Python and several core packages.
Instructions for installing Anaconda are available here.

To install Aether, you first need to acquire a copy of it from GitHub.
To do this, you may either clone the git repository, or download an archive.
To clone the git repository, run the following in the command line:

git clone git@github.com:kosticlab/aether.git

Alternatively, you can download an archive of Aether here.
To unpack the archive, run the following:

tar -xvf aether-master.tar.gz
mv aether-master aether

Once you have acquired a copy of Aether, open the directory for Aether.
We herein refer to this directory, or folder, as the "Aether directory".
If you have Anaconda installed, simply run:

Aether can be run in several different modes.
The three primary modes are interactive mode, non-interactive mode, and "dry run" mode.
The interactive mode will prompt the user for the various inputs, whereas the non-interactive mode will pull this information from the command line arguments.
Lastly, the "dry run" mode will show the user what resources Aether would have bid on, as well as their cost.
However, "dry run" mode will not actually run anything on the cloud, and thus will not result in any cloud resources being used.
Information about running Aether with these modes is located below.

When Aether is run in interactive mode, it will prompt you for the various
program parameters.
Because Aether's optimization method requires a number of parameters, the
interactive mode is highly recommended for new users.
To run interactive mode, simply run the following in the Aether directory:

The non-interactive mode will not prompt the user for any input information.
Non-interactive mode is generally not recommended for new users.
To run Aether in non-interactive mode, simply run the following in the Aether
directory:

As always, additional information about command line arguments for may be found by running:

aether --help

This will output the following:

Usage: aether [OPTIONS]
The Aether Command Line Interface
Options:
-I, --interactive Enables interactive mode.
--dry-run Runs Aether in dry-run mode. This shows what
cloud computing resources Aether would use,
but does not actually use them or perform any
computation.
-A, --input-file TEXT The name of a text file, wherein each line
corresponds to an argument passed to one of
the distributed batch jobs.
-L, --provisioning-file TEXT Filename of the provisioning file.
-P, --processors INTEGER The number of cores that each batch job
requires
-M, --memory INTEGER The amount of memory, in Gigabytes, that each
batch job will require.
-N, --name TEXT The name of the project. This should be
unique, as an S3 bucket is created on Amazon
for this project, and they must have unique
names.
-E, --key-ID TEXT Cloud CLI Access Key ID.
-K, --key TEXT Cloud CLI Access Key.
-R, --region TEXT The region/datacenter that the pipeline should
be run in (e.g. "us-east-1").
-B, --bin-dir TEXT The directory with applications runnable on
the cloud image that are dependencies for your
batch jobs. Paths in your scripts must be
reachable from the top level of this
directory.
-S, --script TEXT The script to be run for every line in input-
file and distributed across the cluster.
-D, --data TEXT The directory of any data that the job script
will need to access.
--help Show this message and exit.

Not all of the data that Aether needs to run can be securely accessed automatically.
In particular, we do not access your private AWS account information, and instead require the user to input this information.
We provide details on how to find this data in the sections below.

When you run Aether, you will be prompted for some information on account limits, as AWS does not allow them to be programmatically retrieved.
In the .gif below, we show a demonstration of where to access this information on the AWS website.

These account limits are automatically saved in the instances.p file, and may be entered into the bidder on subsequent runs to save time.

Once you have entered account limits into Aether, it will begin solving the multi-objective optimization problem of selecting the optimal bidding strategy.
This is a computationally intensive process, during which Aether is iteratively performing a number of high-dimensional convex optimizations.
On a lightweight computer (e.g., an older laptop), this may impact performance of other programs that are running.

After running Aether, you will find that it has generated a new file, named prov.psv.
This file contains the provisioning information for the batch processing pipeline, which is the second component of Aether.
We turn now to the details of prov.psv.
Each line in prov.psv is a list delimited by the vertical bar character.
In order, the columns representTYPE, PROCESSORS, RAM, STORAGE, BIDDABLE, which we explain in the table below.

Column Name

Definition

TYPE

The type of instance that is being requested. This is a name defined by the provider.

PROCESSORS

The number of processors that are available to this type of instance.

RAM

The amount of Random Access Memory (RAM) that is available to this type of instance.

STORAGE

Whether there is ephemeral storage available to this instance during use.

BIDDABLE

Whether this instance should be bid upon or purchased at its standard market price.

As an example, if prov.psv contains the following:

c4.large|2|3|false|true
c4.large|2|3|false|false

Then Aether would instruct the batch processing pipeline to spin up two c4.large instances, purchasing one at market cost and bidding on the second one.
As an aside, it is possible to run the batch processing pipeline without the bidder if you already have prov.psv (e.g., generated it manually, reusing it).
To do so, simply replace --file=prov.psv with --file=YourFileHere.psv, where YourFileHere.psv is the name of your .psv file with the provisioning information in it.

Each time you submit a batch of jobs to Aether, it provides some information about the environment to the job that is being executed.
An example of this is seen below.
A copy of this template is also located in examples/template.
Additionally, the bin folder that was uploaded to the cloud should be available to the job script at bin.
Note that this is not the global /bin, but instead relative to the job script's execution location.

The memory fraction of the current node that this task can safely utilize.

PRIMARYHOST

The static IP address of the node controlling batch processing. Will be notified upon task completion.

OUTPUTFOLDER

The location in cloud file system to store the job outputs.

LOCALIP

The local IP of this machine.

Finally, note that any item in the DATA folder may be downloaded from the cloud file system (e.g., S3) during job run-time using the AWS or Azure Command Line Interface.
Locations could be part of pre-generated arguments, as the S3 bucket is named the same as previously chosen project name.

We have included several examples of using Aether below.
Because distributed systems are inherently complex and nondeterministic, we strongly recommend reading through these examples before running Aether on your own.

In examples/basic there exists args.txt and test.sh. If the batch
processing script is run with interactive mode with args.txt as the input
file argument and test.sh as the script argument the bidder will be run and
on the distributed compute that is spun up each replica node will upon receipt of a task write the passed line from
args.txt to a file, wait for 30 seconds, upload the file to S3, and then
communicate to the primary node that the job is complete and that another job
can begin. If there are no new jobs then the instances terminate themselves
automatically.

Output can be accessed in an S3 bucket that bears the same name as the project
passed in via the name argument, e.g. s3://projectname. Logs that are
automatically generated are also placed in this bucket.

This is the example use of the pipeline that is presented in the paper that
demonstrates more complex uses. This
example is located in examples/metagenome_assembly. To run this example in
interactive mode,
place assemble_sample.py in a folder with built versions of
prokka and
megahit and pass the location of this
folder as the bin-dir argument. Pass batch_script.sh, which essentially is
just a wrapper to run the python script in bin, as the script argument. Pass
the location of a folder containing fastq files of paired end metagenomic reads
as the data argument. Finally, for the input file argument, pass a text file
where each line contains the s3 locations (parameterized by name given to
project) of 2 matching paired end reads
separated by a comma, e.g.
s3://nameofproject/data/samplexpe1.fastq.gz,s3://nameofproject/data/samplexpe2.fast1.gz. For the analysis done in the paper 1572 samples were assembled but this script can be used on any number of samples.

This section discusses a number of Aether's useful, but advanced, features.
However, many of these features have complexities or unique aspects that make them less straightforward or more involved than those discussed thus far.
As such, it is recommended that you first read through the documentation above.

Please note that it is a requirement that your hardware has a static IP address and is not being operated under any sort of scheduler.
To add your own hardware to a currently executing Aether pipeline, run
bin/local.sh [args] where [args] are
the following 9 arguments in sequential order:

Argument Number

Information

1

AWS CLI Key ID

2

AWS CLI Key

3

AWS Region

4

Batch Jobs Your Hardware Can Handle Concurrently

5

Processors Needed Per Batch Job

6

Fraction Of Node Memory That One Task Will Utilize

7

S3 Location Where Output Should Be Uploaded To

8

Static IP Of Primary Node

9

For Compatibility With Scripts For AWS Always Make This Argument "false"