An exercise for learning to use PAUP*, with emphasis on model selection

The data set we will use for this exercise is simulated-dna-data.nex. It is a simulated data set consisting of DNA sequences for 9 organisms and 2500 sites (taxon 9 is the outgroup). I won’t give away much more than that about the simulation conditions for now, but you can assume that the simulation assumed a homogenous Markov process (i.e., all sites evolving according to the same model, and the model does not change over the tree).

In this key, (terse) answers to the questions can be obtained by hovering your mouse momentarily over the word "answer". Lines beginning with “paup> “ are PAUP* input command lines.

Part 1: Basic data manipulation and parsimony analysis

Note: “$” represents the Linux/MacOS prompt. All other commands are entered at the prompt issued by PAUP* ("paup>").

If you are using the cluster version of paup you can download the dataset into your home directory by using

(Note: the "-f" flag used above is not ordinarily needed, but is being used to work around a problem with the installation on the MBL cluster.)

Conduct an exact (branch-and-bound) tree search to the most parsimonious tree under simple parsimony with equal weights.

bandb;

Alternatively, you can do an "exhaustive" search that evaluates all possible trees:

alltrees;

What is the length of the optimal tree? answer

Examine the topology of the tree.

showtrees all;

An alternative command for seeing the tree that provides more information is the DescribeTrees command.

describe all;

Mac and Windows users can also see and manipulate a graphical view of the tree using the Trees:Print/Preview Trees
menu command. Note: this will not work on the MBL cluster (which most of you will be using)

Change the outgroup status so that the only outgroup taxon is taxon 9 and re-examine the tree topology. You can

refer to taxa by either their numbers or their names.

outgroup 9;

Now show the trees again using the ShowTrees or DescribeTrees command. What happens to the shape of the tree?
answer

Use the PScores command (or Trees:Tree Scores:Parsimony menu command) to calculate the length (= parsimony score) of the tree.

pscores;

Did the tree length change? answer

Do a heuristic search using PAUP*’s default settings using the HSearch command or the Analysis:Heuristic Search menu command).

hsearch;

Perform another heuristic search using the random-addition-sequence method with 100 replicates.

hsearch addseq=random nrep=100;

Does it look like the optimal trees are hard to find for this data set? answer

Describe the most-parsimonious tree as a phylogram (branch lengths proportional to number of changes assigned by parsimony).

describe/plot=p;

Note that the slash character ('/') is needed in the command to separate the (optional> tree list from the remaining command options. Many commands in PAUP* work this way.

At this point, it will be useful to make a quick drawing of the parsimony tree for later reference. (Ordinarily, I would have you print the tree, but printing is not available in the MBL course.)

Use the bootstrap method for assessing confidence on the groups inferred by parsimony analysis. We will use 1,000 pseudo-replicated data sets and 10 random-addition-sequence replicates per bootstrap replicate.

bootstrap nreps=1000/addseq=random nreps=10;

Which grouping on the most-parsimonious tree is least well supported as assessed by the bootstrap? answer

Perform a constrained search that forces taxa 1 and 6 to be monophyletic. First we need to define a constrain tree:

constraints 1and6 = ((1,6));

Note that the name 1and6' is a name that you choose to identify the constaint. You can use the ShowConstraints command (or Analysis:Show Constraints menu command) to verify that the constrain tree has been entered correctly:

showconstr;

Note that the above syntax for specifying the topology of the constraint tree is a shorthand for a longer equivalent command:

constraints 1and6 = ((1,6),2,4,5,7,8,9);

The "shorthand" is that taxa connected directly to the root can be omitted from the Newick-format tree description.

What is the shortest tree that makes taxa 1 and 6 monophyletic with respect to the remaining taxa? answer

hsearch enforce constraints=1and6

Part 2: Distance analyses

Note: this section is optional. If time is short, you should probably proceed to the likelihood section and come back to this section later.

Reset all program options to their “factory defaults” (PAUP* remembers most option settings until they are changed again, and sometimes it's easiest to just start "fresh" rather than worrying about each individual setting.)

How does the distance chosen affect the tree found by NJ for this data set? answer

How does the distance chosen affect the magnitude of the pairwise distance estimates? answer

Are smaller distances less affected or more affected by the choice of a distance? answer

Using the Tamura-Nei distance, search for a tree under the minimum evolution (distance) criterion:

set criterion=distance;
dset objective=me dist=tamnei;
hsearch;

What is the tree score under this criterion? answer

Evaluate the same tree under the Fitch-Margoliash weighted least-squares criterion (inverse squared weighting)?

dscores/objective=ls power=2;

What is the WLS score of the tree? answer

("Wtd SS" refers to the "average percent standard deviation" defined by Fitch and Margoliash, and is a normalization of the raw sum of squared deviations.)

Part 3: Likelihood analysis and model comparison

Perform a heuristic search for an optimal tree under the likelihood criterion using PAUP*’s default model settings.

The default model is an HKY model with a transition/transversion ratio of 2.0 and unequal base frequencies (estimated empirically).

set criterion=likelihood;
hsearch;

“Describe” the tree and examine the topology.

describe;

How does it compare to the trees found previously using parsimony and distance methods? answer

What is the likelihood score for this tree? answer

Set up a model with unequal base frequencies (to be estimated as parameters) and two substitution types with ti/tv ratio estimated from the data (LSet command or Analysis:Likelihood Settings menu command). Then use the LScores (Trees:Tree Scores:Likelihood menu command) to estimate the ti/tv ratio on the tree found in the previous step.

lset tratio=estimate basefreq=estimate;
lscores;

Note that you can combine the LSet and LScores commands into a single command, which is convenient when
trying alternative models:

lscores/tratio=estimate basefreq=estimate;

What is the estimate of the ti/tv ratio? answer

How does it compare to the previous value (i.e., the default setting)? answer

What effect did optimization of the ti/tv ratio have on the likelihood score for the tree? answer

Fix the ti/tv ratio and base frequencies to the values you estimated in the step above. Perform another heuristic search using the modified settings.

lset fixall;
hsearch;
describe;

Did the tree topology change? answer

Using the tree topology found in the above search, fit models that assume that rates across sites are (1) gamma distributed, (2) equal rates at variable sites but with some sites invariable, and (3) both gamma-distributed rates and invariable sites.

Now we can assess whether either of these models is justified according to a likelihood ratio test. (I typically prefer using the AIC for model selection, but this just shows how the LRT would be performed.)

The HKY model is strongly rejected in favor of Tamura-Nei. GTR model is unnecessarily complex (fails to reject Tamura-Nei). The AIC would also prefer the Tamura-Nei model (3 fewer parameters, only about 1.5 log likelihood units worse)

Search for an optimal tree using the model identified above. Note that we first fix all model parameters to their maximum-likelihood estimates before doing the search.