The Pack contain 3 workflows that perform and validate bayesian phylogenetic inference that differ from the kind of input. The pack is called short because the worlflow require that the user need to keep taverna engine always on for the the time of the analysis. This could be quite problematic for large analysis. In this case search help in the "Perform Long Bayesian Phylogenetic Inference" pack

3) PartitionFinder define best candidate partitioned evolutionary model based on proposed maximum set of possible partition

Note: a partitioned model is a model that allow different groups of sites (i.e. columns of the MSA) to follow the rule or parameters set of different evolutionary models. MrBayes allows that some of the parameters to be shared across partitions while other being different. The submit workflow 2 based on the GUI allow to fully take advantage of this features while the submit workflow 3 based on PartitionFinder allow to share across partition optionally branch length and always the topology

All 3 workflows perform 2 validations on the inference, one on the numerical integration (GeoKS) and one on the fit of the model (Posterior Predictive Test)

The different step for a complete phylogenetic inference are in this pack divided as following:

1. Define model framework
2. Estimate parameters of the model framework with a Markovian Integration within a Bayesian framework
3. Validate the convergence of the Markovian Integration with a test on the overlap of the tree posterior distribution of two or more independent runs (GeoKS)
4. Test the adequacy or goodness of fit of the model with a posterior predictive test

**3.2.1 Define model framework:**

MrBayes allows to define partitioned model of evolution of the character of the MSA, to perform a given phylogenetic inference. A

Partitioned model is a model that allow different groups of sites (i.e. columns of the MSA) to follow the rule or parameters set of different evolutionary models.

Within MrBayes is possible to define 5 group of parameters that can be shared or not across user defined group of sites

Workflow 1 "All is ready to run". leave the user the duty to define and format a model using the criteria and tool preferred.

Workflow 2 "Select Model" based on a GUI help the user to define a model that share or not across user defined partition independently all 5 groups of parameters, taking care of the formatting.

Workflow 3 "Select Model for Me" based on PartitionFinder ([http://www.robertlanfear.com/partitionfinder/][7]) allows to define the best partitioning of the sites based on a starting maximal partitioning proposed by the user, but decision of the sharing or not across partition is done together for groups 1, 2, and 3 while branch lengths could be shared across all partition or none (no sharing across some partitions) and topology need to be shared across all sites.

**3.2.2 Estimate parameters of the model framework with a Markovian Integration within a Bayesian framework:**

All 3 submit workflows send the input defined and formatted differently to the same service that start a MrBayes run. (ref 1,2,3)

**3.2.3 Validate the convergence of the Markovian Integration with a test on the overlap of the tree posterior distribution of two or more independent runs (GeoKS):**

The result of the MrBayes service are loaded by the retrieval workflow that perform a test of convergence based on the on the overlap of the tree posterior distribution of two or more independent runs. Details on this link

**3.2.4 Test the adequacy or goodness of fit of the model with a posterior predictive test**:

The service written in python read each estimation of set of parameters from a sub-sample of the overall posterior distribution and simulated (using evolver utility from PAML 1.4 package) new MSA. The simulated distribution is compared to the original MSA based on the sum of the sites entropy as proposed by Bollback (2002)[ref 4]

An Histogram is draw for the distribution of the complexities (log of sum of sites entropy or maximal possible loglikelihood score) of the simulated data using the posterior distribution parameters compared with observed data complexity. The 1-alpha high posterior density of the distribution show the region where simulate data complexity match the observed one. Larger observed complexity indicates model too simplistic, while the contrary indicates overparametrization of the model.

The newick representation of the consensus of the inference is sent to the ITOL web service ([see details at http://itol.embl.de/][9]). The service allows to visualize an interactive graphical representation of the tree. After manipulation and editing the tree can be exported in several graphical format or downloaded as newick or phyloxml format. Relaoding the tree with this format togheter with user defined annotation table allows very powerfull graphical representation of the tree (see details on [http://itol.embl.de/help/help.shtml][10])

The all 3 workflow require a local R serve to allow to draw the graph output of the Posterior Predictive Test.

The Workflow 3 based on Partition Finder, requires a python interpret installed on the path and accessible to Taverna engine. No python module is required by the script and although tested only with python 2.7 should work also with older python.

The user depending of the 3 different scenarios presented in 3.2.1 will choose one of the 3 submit workflows.
**5.2. Input data**
**5.2.1. Data preparation/format**

Workflow 1 the format for the input (called"NexusFile") is nexus with data block and mrbayes block. Within the data block the MSA is specified together with the type of possible states, while in the mrbayes block the user need to define the evolutionary model and the parameters for the markovian integration ( number of generation, temperature, number of chain for the metropolis coupled part of the algorithm, and number of replicates runs to control convergence). See details in [http://mrbayes.sourceforge.net/wiki/index.php/Manual_3.2][13]

Workflow 3 require also a text in which user define maximum set of parts fo the alignment, meaning the maximal subdivision in group that sites could have. Mind the Partitionfinder try all combinatorial change possible and is not adivsed to propose more than 14 parts.

Range with a step (i.e.“every third base”) are expressed with a slash (i.e. gene1= 10-30\3;)

So a complex but realistic example could be:

utr5= 1-30;

cds_pos1 = 30-200/3 400-500/3;

cds_pos2 = 31-200/3 399-500/3;

cds_pos3 = 32-200/3 398-500/3;

intron= 200-397;

utr3= 501-600;

**5.2.2. Other input
**

file name: name for the nexus file produced on the basis of partitionFinder results

number of MCMCMC generations: integer indicating number of generation to be used in the markovian integration

number of runs: integer indicating how many indipendent runs of mrbayes need to be performed ( the more the better convergence is detected

branch length are linked or unlinked across partition: see 3.2.1 for details

what criterium to be used by PartitionFinder: the criteria a AIC, AICc and BIC. All of them are information criteria. In general statistical framework AIC should be always prefferrd to AIC but in phylogenetics there are dispute on how to count constant sites. If your MSA have very few constant sites or you selected your locus randomly use AICc without doubt. For large MSA AIC and AICc give similar results.

**5.3. Select Dialogue boxes**

For each web services called a message tell the user the name and the number of the job id to the user. The message disappear after the user would push any of the buttons or if another web services is called before any action is taken. The message allows the user to know at what point of the workflow is and gives the job id number that would allow the service centre to identify the job, in case of failure.

Worklfow 2 have a large and complex self explanatory web page. To start to use paste a aligned multifasta file in the only visible text window and click confirm. Following choice are explained by yellow bottom on the side of each question or pull down menu

Result of the Posterior Predictive test. Histogram of the distribution of the complexities (log of sum of sites entropy or maximal possible loglikelihood score) of the simulated data using the posterior distribution parameters compared with observed data complexity. The 1-alpha high posterior density of the distribution show the region where simulate data complexity match the observed one. Larger observed complexity indicates model too simplistic, while the contrary indicates overparametrization of the model

**Newicktree**

Description:

small xml with tag res with one or more tag tree each one with a body that contain a tree in netwick format.

each tree represent the consensus for a given partition of group of partition in the same order that are cited in the nexus input file of mrbayes

**Geoks**

Description:

Result of the GeoKS test of convergence. XML format

**Phyloinferenceoutput** *

Description:

The output of the consensus service is a path were to obtain a zipped folder that includes all output from phylogenetic inference and the one of the consensus

**Viewtree***

Description:

Link to visualize consensus Tree of phylogenetic inference. It could be more than one link if the model assume more than one tree (combination topology + branch length set).

Use of the figure in the link below should be always accompagned by appropriate citation:

Main output of PartitionFinder, where the preffered partitioned model is described

**Log**

Description:

Log file of all application used in the workflow. For each external website is reported the name of the application , the jobid required to track error on the service provvider and the last 200line of the standardouput+standardError of the application as view by the local system where the job was run.

**Partitionfinderdetails***

Description:

Path were to retrieve all the detials of the partitionfinder software

**Geoksdetails***

Description:

Path were to get all details of convergence test calculation.

Ouput with * are also exposed on a downloadable link in the web browser as soon as they are produced
**5.5. Results analysis**

If the GeoKS test fail, the p-value is lower of the risk accepted by the user, the workflow stop and do not perform the posterior predictive test and the tree consensus calculation. User is invited to look at GeoKS result and increase the generation number of at least 1.5 times.

If the GeoKS test do not fail, the p-value is higher of the risk accepted by the user, the user is invited to check that no more generation are advised by GeoKS. In fact the program advise the user for more generation if less that 300 Effective Sample Size (ESS) Tree are present in the posterior distribution. With less than 300 ESS trees the test do not guarantee to have sufficient power to detect correctly convergence. Then the user should look at the plot output and check if the red line (observed MSA complexity) is within the HSD of the distribution. If the red line is on the right of the distribution a more complex model need to be taken in consideration. if the red line is on the left of the distribution it could be than a more simple model could be entertained, and branch support are excessively conservative.

Once the two tests are positive the user could inspect the consensus tree on ITOL, decorate at will and print on file the tree for a publication. The naked netwick tree could be also extracted using the option save tree from ITOL or directly from the output port newick

Parameter Inputs (7)

Email (text/plain)

File_name (text/plain)

Linking_branches (text/plain)

Should models assume that branches in the tree are linked or unlinked across alignment parts?
Two possible strings:
linked
unlinked

Example value:

linked

Number_of_MCMCMC_generations (text/plain)

Description:

Number of Generation for each run of the markovian integration necessary to the phylogenetic inference. The convergence test (GeoKS) will tell post hoc if the number was sufficient. If is not the case try with a larger number following the educated guess of GeoKS

Example value:

100000

Number_of_runs (text/plain)

Description:

Number of independent runs to check convergence of the markovian integration necessary to perform bayesian phylogenetic inference. Minimum value 2. Always an integer

Example value:

2

SampleSize (text/plain)

Description:

Sample Size from the Posterior Distribution to test fit of model on Data. This integer should be smaller than number of generation divided by 100.