Two R functions for working with DNA alignments

Recently I wrote a couple of small functions as a result of work done by myself and others in my lab group. The first is a function that determines what sites in a sequence alignment are ambiguous (i.e. not A, G, C or T).

require(ape)data(woodmouse)

is.ambig x bases ambig ambig > 0}

is.ambig(woodmouse)

This function utilises the bit-level coding scheme that Emmanuel Paradis developed for encoding sequences in R. The unambiguous bases A, G, C and T have the numeric values 136, 72, 40 and 24 respectively. This function figures out which sites don’t have these values and returns a vector of TRUEs and FALSEs, TRUEs being ambiguous bases.

The second function is an implementation of Tajima’s K, published as equation A3 in Tajima 1983

tajima.K res if(prop) res res}

tajima.K(woodmouse)

This function calculates the mean number of sites that are different between any two sequences. As a default, it returns the result as a proportion of the length of the alignment. Setting prop = FALSE will return the result as the actual number of sites.