microbiomes: Picking Interesting Taxonomic Abundance

microPITA is a computational tool enabling sample selection in two-stage (tiered) studies. Using two-stage designs can more efficiently allocate resources, reducing study costs, and maximizing the use of samples. From a survey study, selection of samples can be performed to target various microbial communities including:

Samples with the most diverse community (maximum diversity);

Samples dominated by specific microbes (targeted feature);

Samples with microbial communities representative of the survey (representative dissimilarity);

Samples with the most extreme microbial communities in the survey (most dissimilar);

Given a phenotype (like disease state), samples at the border of phenotypes (discriminant) or samples typical of each phenotype (distinct).

Additionally, methods can leverage clinical metadata by stratifying samples into groups. This enables the use of microPITA in cohort studies.

Expected input file.

PCL file definition:
Although some defaults can be changed, microPITA expects a PCL file as an input file. Several PCL files are supplied by default in the input directory. A PCL file is a TEXT delimited file similar to an excel spread sheet with the following characteristics.

1. Rows represent metadata and features (bugs), columns represent samples.
2. The first row by default should be the sample ids.
3. Metadata rows should be next.
4. Lastly, rows containing features (bugs) measurements (like abundance) should be after metadata rows.
5. The first column should contain the ID describing the column. For metadata this may be, for example, "Age" for a row containing the age of the patients donating the samples. For measurements, this should be the feature name (bug name).
5. By default the file is expected to be TAB delimited.
6. If a consensus lineage or hierarchy of taxonomy is contained in the feature name, the default delimiter between clades is the pipe ("|").

II. Targeted feature file
If using the targeted feature methodology, you will need to provide a txt file listing the feature(s) of interest. Each feature should be on it's own line and should be written as found in the input PCL file.

Basic unsupervised methods

Please note, all calls to microPITA should work interchangeably with PCL or BIOM files. BIOM files do not require the --lastmeta or --id arguments.

There are four unsupervised methods which can be performed:
diverse (maximum diversity), extreme (most dissimilar), representative (representative dissimilarity) and features (targeted feature).

The first three methods are performed as follows (selecting a default 10 samples):

Each of the previous methods are made up of the following pieces:
1. python MicroPITA.py to call the microPITA script.
2. --lastmeta which indicates the keyword (first column value) of the last row that contains metadata (PCL input only).
3. -m which indicates the method to use in selection.
4. input/Test.pcl or input/Test.biom which is the first positional argument indicating an input file
5. output.txt which is the second positional argument indicating the location to write to the output file.

Selecting specific features has additional arguments to consider --targets (required) and --feature_method (optional).

These additional arguments are described as:
1. --targets The path to the file that has the features (bugs or clades) of interest. Make sure they are written as they appear in your input file!
2. --feature_method is the method of selection used and can be based on ranked abundance ("rank") or abundance ("abundance"). The default value is rank.
To differentiate the methods, rank tends to select samples in which the feature dominates the samples regardless of it's abundance.
Abundance tends to select samples in which the feature is most abundant without a guarantee that the feature is the most abundant feature in the sample.

Basic supervised methods

Two supervised methods are also available:
distinct and discriminant

These methods require an additional argument --label which is the first column keyword of the row used to classify samples for the supervised methods.
These methods can be performed as follows:

Custom alpha- and beta-diversities

The default alpha diversity for the maximum diversity sampling method is inverse simpson; the default beta-diversity for representative and most dissimilar
selection is bray-curtis dissimilarity. There are several mechanisms that allow one to change this. You may:

1. Choose from a selection of alpha-diveristy metrics.
Note when supplying an alpha diversity. This will affect the maximum diveristy sampling method only. Please make sure to use a diversity metric where the larger number indicates a higher diversity. If this is not the case make sure to use the -f or --invertDiversity flag to invert the metric. The inversion is multiplicative (1/alpha-metric).

2. Choose from a selection of beta-diversity metrics.
Note when supplying a beta-diversity. This will effect both the representative and most dissimilar sampling methods. The metric as given will be used for the representative method while 1-beta-metric is used for the most dissimilar.

Note for using Unifrac. Both Weighted and Unweighted unifrac are available for use. Make sure to supply the associated tree (-o, --tree) and environment files
(-i,--envr) as well as indicate using Unifrac with (-b,--beta)

When using a supervised method this indicates how many samples will be selected per class of sample. For example if you are performing supervised selection of 6 samples (-n 6) on a dataset with 2 classes (values) in it's label row, you will get 6 x 2 = 12 samples. If a class does not have 6 samples in it, you will get the max possible for that class. In a scenario where you are selecting 6 samples (-n 6) and have two classes but one class has only 3 samples then you will get 6 + 3 = 9 selected samples.

Stratification:
To stratify any method use the --stratify argument which is the first column keyword of the metadata row used to stratify samples before selection occurs. (Selection will occur independently within each strata). This example stratifies diverse selection by the "Label".

Changing PCL file defaults:
Some PCL files have feature metadata. These are columns of data that comment on bug features (rows) in the file. An example of this could be a certain taxonomy clade for different bug features. If this type of data exists please use -w or --lastFeatureMetadata to indicate the last column of feature metadata before the first column which is a sample. For an example please look in docs for PCL-Description.txt. This only applys to PCL files.

MicroPITA assumes the first row of the input file is the sample IDs, if it is not you may use --id to indicate the row.
--id expects the entry in the first column of your input file that matches the row used as Sample Ids. See the input file and the following command as an example.
This only applys to PCL files.

MicroPITA assumes the input file is TAB delimited, we strongly recommend you use this convention. If not, you can use --delim to change the delimiter used to read in the file.
Here is an example of reading the comma delimited file micropita/input/CommaDelim.pcl
This only applys to PCL files.

MicroPITA assumes the input file has feature names in which, if the name contains the consensus lineage or full taxonomic hierarchy, it is delimited with a pipe "|". We strongly recommend you use this default. The delimiter of the feature name can be changed using --featdelim. Here is an example of reading in a file with periods as the delimiter.
This only applys to PCL files.