Genotype imputation is a common and useful practice that allows GWAS researchers to analyze untyped SNPs without the cost of genotyping millions of additional SNPs. In the Services Department at Golden Helix, we often perform imputation on client data, and we have our own software preferences for a variety of reasons. However, other imputation software packages have their own advantages as well. This motivated us to perform some tests to assess certain performance features, such as accuracy and computation time, of a few common imputation software programs.

Study Design
For this comparison, we tested three imputation softwares: BEAGLE, IMPUTE2, and Minimac. Imputation was performed both with and without pre-phasing the sample data with BEAGLE and IMPUTE2. Minimac is an implementation of the MaCH method that utilizes pre-phasing. We did not run MaCH without pre-phasing due to computational constraints. Pre-phasing is a technique that can significantly improve computation time with a slight accuracy trade-off by phasing the sample data prior to running imputation (as opposed to phasing the sample data during imputation).

The variables measured include imputation accuracy (concordance rates), imputation quality, computation time, and memory usage. Concordance for each SNP is measured by taking the total number of accurate genotypes (comparing the imputed data against the full dataset) over the total number of genotypes or samples. Quality was determined by looking at the per-SNP quality metrics provided by each program. These metrics differed and recommended appropriate thresholds were used separately for each. Computation time was measured based on running each program on a 64-bit Linux computer with 16GB of memory.

The baseline study data included 141 unrelated HapMap samples genotyped on Illumina Omni1, representing the three major HapMap population groups. We imputed these samples based on the 1000 Genomes Phase 1 v3 reference panel as provided on each imputation program’s website. In order to simulate how a researcher would typically perform imputation on their own data, the reference datasets were downloaded directly from each program’s website and were not modified. Each data provider filters the reference data in a slightly different way, so this means that the reference datasets were not identical, even though all were derived from the same original dataset. The sample data was limited to only include SNPs in chromosome 20.

The following Venn diagram represents the overlap of genetic data at the same genomic position between the three reference datasets and the original 1000 Genomes dataset. Therefore, the total number of rows found in each dataset is slightly more than the number displayed on the diagram, since some variants have duplicate positions.

This Venn diagram displays how markers in the reference panels for each imputation program and the original 1000 Genomes data overlap on chromosome 20. Unique lists of genomic position were compared across datasets.The original dataset and the MaCH reference panels came with genomic position in the format of VCF files. For the BEAGLE reference panel, genomic position was determined with the .markers files and with the legend file for IMPUTE2.

An interesting point to note about this diagram is the existence of markers in the IMPUTE2 and BEAGLE reference dataset at genomic positions that were not found in the original 1000 Genomes dataset. Upon further investigation, most of these could be attributed to one-off position differences with some indels reported in the 1000 Genomes dataset. This demonstrates how different data processing pipelines handle complex genotype information in slightly different ways. For the same reason, the total number of markers at unique positions differs in each version of the reference dataset.

Results
All programs outperformed others in certain areas. Based on all of the metrics measured, IMPUTE2 seemed to perform with the greatest accuracy and quality although other programs performed better in other areas.

As expected, pre-phasing the original dataset drastically improved the total compute time. When the data was pre-phased, IMPUTE2 ran the quickest, followed by Minimac, and then BEAGLE. Without pre-phasing, IMPUTE2 was much faster than BEAGLE.

IMPUTE2 also had superior concordance rates, although all software programs performed well in this area. Minimac had the lowest concordance rate at 96.25%.

Software

Total Compute Time*

Mean SNP Concordance

Total # SNPs

# High Quality SNPs

% High Quality Imputed

IMPUTE2

23 hours

99.98%

668,180

620,792

92.9%

BEAGLE

213 hours

98.43%

484,023

320,991

66.3%

IMPUTE2 with Pre-phasing

8 hours

99.92%

668,180

297,196

44.5%

BEAGLE with Pre-phasing

34 hours

98.05%

484,023

293,890

60.7%

Minimac

18 hours

96.25%

667,870

450,790

67.5%

*includes all steps required

Without pre-phasing, IMPUTE2 had the highest quality imputation, but after pre-phasing, the certainty metric provided in the IMPUTE2 output dropped dramatically (see first figure below). The R^2 accuracy value given by BEAGLE was also lower in the output based on pre-phased data, but the change was not nearly as dramatic (see second figure below).

IMPUTE2 certainty metric using unphased data and using pre-phased data.

BEAGLE R^2 metric using unphased data and using pre-phased data.

An unfortunate side effect of IMPUTE2 was the intensive memory usage. IMPUTE2 used all available RAM (16 GB) making it impossible to perform any other tasks. BEAGLE and Minimac, on the other hand, used far less memory (although took longer to finish). BEAGLE was run using the “lowmem” option for more efficient memory usage, which also had the effect of increasing runtime.

All of the 141 test samples are also included in the 1000 Genomes reference panel. We recognize that this may bias the accuracy of the results, but it was acceptable for our purposes. The concordance rates represent how well each imputation program was able to reproduce genotypes for samples where the correct answer was already present in the reference panel. The algorithms used in each program may be more or less appropriate for this situation.

Another metric not discussed previously is the availability of documentation. In this category, BEAGLE wins. Not only do they have a nice PDF manual, we’ve had great success in asking specific questions to the authors and getting thorough responses in a timely manner.

In summary, choosing the most appropriate imputation program to use depends on the qualities most important to the researcher and the hardware available. An important factor in our testing was that we chose to run the entire length of chromosome 20 in a single batch. The performance of the various tools, particularly with regard to compute time, would likely be quite different had we run the imputation in smaller batches.

Hi Matthew, thanks for the question. The reference population included all of the 1092 samples and was thus of mixed race. The sample data also contained samples from each population represented in 1kG.

Hi Autumn. Excellent work, and of great interest since I have used both IMPUTE2 and minimac for different projects. Is it possible to expand on a few?
1) for each imputation what was the total number if input genotyped, and was there a minimum minor allele frequency?
2) what threshold was used for high quality imputation?
3) What was the minor allele frequency characteristics of the ~187k SNPs that are in 1000G but not imputed by any of the programs – am I right in thinking most of them are rare relative to the input genotypes?

The same baseline Illumina dataset was used as input into each imputation program and this dataset contained approximately 23K SNPs in chromosome 20. The SNPs were filtered by call rate (> 95%) but not minor allele frequency.

A high-quality threshold value of 0.5 was used for Beagle and Minimac (R^2) while a threshold value of 1 was used for Impute2 (certainty).

You’re spot on in regards to the ~187K SNPs in 1kG. All of those SNPs had at most 1 copy of the minor allele. Both Impute2 and Mach remove monomorphic SNPs and singletons from their reference panels while Beagle used a more conservative filter (< 5 copies of minor allele) to create its reference panel.

Thanks for your questions. It true that those metrics are correlated (Beagle R^2 seems to be as well) but the values have very different ranges. The R^2 values typically range from 0 to 1 while the certainty metric was observed between approximately 0.7 and 1.

The per-SNP concordance rate is essentially the percentage agreement between over all samples in the Illumina dataset for each SNP. The entire imputed dataset was used to average these values to find the mean concordance over all SNPs.

Thanks for you question. For the imputation parameters, I used the recommended parameters or parameters used in the example documentation for each program. For mach phasing and minimac imputation (http://genome.sph.umich.edu/wiki/Minimac) that means, “–rounds 20 –states 200″ and “–rounds 5 –states 200″ respectively.

Thanks for the suggestion too. I’m running some more tests and I’ll play with this parameter to see if I can improve my results.

When you ran IMPUTE2 prephasing, did you use IMPUTE2′s own prephasing, the shapeit program, or the shapeit2 program? We have been running shapeit as the pre-phasing and did not observe this drop in quality.

Very nice piece of work! Just a comment: the difference of imputation quality you observe between the two scenarios using Impute2, is likely due to, I quote: “An important factor in our testing was that we chose to run the entire length of chromosome 20 in a single batch”. Performance of the Impute2 phasing machinery decreases dramatically as the length of the studied region increases. On the website of the authors, it is advised not to go beyond ~5Mb chunks. Impute2 chooses the best conditioning haplotypes locally using Hamming distance: this strategy performs really well when the region is smaller than 5Mb, but very poorly at the whole chromosome scale. Two solutions to avoid this problem: (1) run prephasing with Impute2 in chunks smaller than 5Mb or run shapeit2 whole chromosome (the performance is independent of the length of the region studied). Even your Impute2 results obtained by integrating over uncertainty will be better.

There’s a problem with this analysis: the HapMap samples are all part of 1000 Genomes, so you’re trying to impute samples that have a perfect match in the reference panel. I think this explains why the results for IMPUTE2 are impossibly good — 93% of all SNPs imputed with a quality of 1. The results for Beagle and Minimac are closer to what I would expect; I guess these algorithms are less able to exploit very long-range matches between the test data and the reference panel. But all the results will be biased due to not using an out-of-sample test set.