By Kevin Keenan

To say I don't like Apple Mac would be unfair. For example, I own an iPhone, and iPod and an iPad (I know I know, that sounds very like the old canard, "I'm not X, some of my best friends are Y", but for tech geeks). The reason I've never owned a Mac computer is probably some combination of their lack of popularly among my peers, and their cost relative to PCs of similar specs. I never thought my lack of familiarity with Mac OS X would cause me any problems though, it is based on UNIX after all. Recently however, I have been struggling with an apparent bug in the diveRsity package source code, and it quickly became clear that the users having trouble were all Mac OS X peeps. I couldn't reproduce their problem on my Windows XP, Windows 7 or Ubuntu (Linux) operating systems, so I asked a colleague to test the problem on their Mac (thanks Deirdre!). Sadly they weren't imagining it. The problem was mine to solve. I quickly went about trying to install Mac OS X on my PC (which isn't the most straight forward process by the way), and manged to get it working using a combination of Ubuntu and virtualbox (instructions can be found here).

After installing Mac OS X I went about the hefty task of finding an ambiguous bug in just shy of 9,000 lines of poorly annotated code (I wrote most of diveRsity late at night when being meticulous was the least of my worries). I finally localised the bug to the chunk of code below:

The code basically ensures that all elements in a given matrix have the same number of characters making downstream processing with 'substr' easier. For example if one element in matrix x is 'NA', when passed to the sprintf function with the argument "%06g", it becomes ' NA'. So the sprintf function simply adds 'padding' to the character string (in this case four space/empty characters making the element a total of six character in length). Well that the way it's was supposed to work anyway.

Apparently because 'sprintf' is just a wrapper for the C-level function 'printf', the arguments "%06g" and "%04g" have undefined behaviour which is OS dependent. The above example holds true in Windows and Lunix (i.e. space/empty character padding), but in Mac OS X, the function results in 'leading zero padding' not leading spaces. This means that where the code downstream was expecting either ' NA' (four spaces and NA) or ' NA' (two spaces and NA), in Mac OS X the actual string was either '0000NA' or '00NA'.

This bug resulted in the creation of two completely new alleles at a locus (where missing data should actually be), leading to erroneous calculation of allele frequencies downstream. To solve the problem without having to modify too much code, I simply added a conditional argument to the above code. See below:

The latest version of diveRsity (v1.5.0) contains a fix for this bug and should be available on CRAN over the next couple of days.

Thanks to Mariah Meek and Andy Jasonowicz for first bring this problem to by attention, and all of their helpful information on the problem.

I forgot to mention that a paper introduction the diveRsity package has been accepted for publication in Method in Ecology and Evolution. Users wishing to cite diveRsity in their own research should use this resource. An 'early view' copy of the manuscript can be downloaded for free from here. R users can also type 'citation("diveRsity")' into the console to see the citation information (incl. bibtex) for the package.

Leave a Reply.

About the authors

Kevin Keenan is currently working towards a PhD in population and evolutionary genetics. His general research interests include; small scale intraspecific genetic divergence, speciation and phylogeography.

Uncle Mick is currently reading genetics at the University of Manchester.His interests include; the role of non-coding RNA in regulatory processes, molecular processes associated with sequencing technology, and the interactions between genotype and phenotype.