Abstract

Some phylogenetic datasets omit data matrix positions at which all taxa share the same state. For sequence data this may be because of a focus on single nucleotide polymorphisms (SNPs) or the use of a technique such as restriction site-associated DNA sequencing (RADseq) that concentrates attention onto regions of differences. With morphological data, it is common to omit states that show no variation across the data studied. It is already known that failing to correct for the ascertainment bias of omitting constant positions can lead to overestimates of evolutionary divergence, as the lack of constant sites is explained as high divergence rather than as a deliberate data selection technique. Previous approaches to using corrections to the likelihood function in order to avoid ascertainment bias have either required knowledge of the omitted positions, or have modified the likelihood function to reflect the omitted data. In this paper we indicate that the technique used to date for this latter approach is a conditional maximum likelihood (CML) method. An alternative approach --- unconditional maximum likelihood (UML) --- is also possible. We investigate the performance of CML and UML and find them to have almost identical performance in the phylogenetic SNP dataset context. We also make some observations about the nucleotide frequencies observed in SNP datasets, indicating that these can differ systematically from the overall equilibrium base frequencies of the substitution process. This suggests that model parameters representing base frequencies should be estimated by maximum likelihood, and not by empirical (counting) methods.

Copyright

The copyright holder for this preprint is the author/funder. All rights reserved. No reuse allowed without permission.