Under favourable conditions, DNA can survive in tissue remainsfor several millennia and in some cases over 100 000 years (1). Over such long time periods, DNA is invariably affectedby degradation and modifications. A ubiquitous feature of ancientDNA is the presence of miscoding lesions that causes incorrectnucleotides to be incorporated during DNA amplification. Theidentities and underlying biochemical processes of nucleotidemisincorporations in ancient DNA have been the subject of somedebate (2–5). However, it has now been established (4,6,7) that the vast majority of damage-derived errors in ancientDNA sequences are caused by hydrolytic deamination of cytosineto uracil or possibly hydroxyuracil, which leads to apparentCT or GA substitutions in the DNA sequences determined.

Uracil–DNA glycosylase (UDG), which removes uracil residuesfrom DNA to leave abasic sites (8), has been shown to reduceC/GT/A misincorporations from ancient DNA (4). However, abasicsites prevent replication by Taq polymerase used in the polymerasechain reaction (PCR) (9). Consequently, the use of UDG has notbeen widely adopted in ancient DNA studies since template moleculesin ancient DNA extracts are often limited in number such thatUDG treatment may destroy all amplifiable templates availablein an extract (4). This is particularly the case when directPCR is used to retrieve DNA sequences since comparatively longtemplates are needed to accommodate two primers; furthermore,investigators often target the longest possible templates byPCR in order to reduce work and costs. UDG treatment is thenparticularly likely to be detrimental since longer templatemolecules are especially rare in ancient DNA (10) and are morelikely than short molecules to contain at least one uracil residue.

High-throughput direct sequencing (11,12), which sequences DNAmolecules of all lengths, has opened new possibilities to analyzeancient DNA (13–15). In this approach, the ends of ancientDNA fragments are made amenable to ligation by treatment withT4 DNA polymerase and T4 polynucleotide kinase (PNK). Subsequently,DNA adaptors are ligated to the fragment ends and then usedto amplify individual molecules and initiate sequencing reactionson a highly parallelized platform. In theory, UDG treatmentwould be less harmful with this method than with direct PCR,since most ancient DNA molecules are very short and so are lesslikely to contain uracil than the few longer molecules accessibleto direct PCR. However, direct sequencing of ancient DNA moleculesgenerates many C/GT/A errors at fragment ends (6). These misincorporationsprobably derive from single-stranded overhangs of a few basesin the ancient DNA (6,7), where cytosine deamination is muchfaster than in double-stranded DNA (16). Thus, in some extracts,up to 60% of all endogenous DNA fragments are estimated to containat least one uracil (Supplementary Table S1) and would be excludedfrom sequencing if the template was treated with UDG. It wouldtherefore be valuable if DNA fragments could be repaired followingthe removal of uracils by UDG.

Here, we demonstrate that a simple modification to high-throughputsequencing library preparation removes uracil residues fromancient DNA and subsequently repairs the DNA fragments, greatlyincreasing the accuracy of the DNA sequences determined whilemaintaining DNA sequence yield from precious DNA sources. Inaddition, we show that remaining C/GT/A misincorporations aredue to in vivo methylation of cytosine in Neandertal DNA, demonstratingthe survival of this epigenetic modification in DNA over 38000 years old and thus potentially allowing its study in ancientorganisms.

Figure 1. Predicted activity of UDG and endoVIII in 454 library preparation of ancient DNA. T4 PNK phosphorylates 5′-ends leaving 5′-phosphate groups. UDG removes uracils, which are concentrated in short 5′- and 3′-overhangs in ancient DNA, leaving abasic sites. EndoVIII then cleaves on both sides of the abasic sites, leaving 5′- and 3′-phosphate groups. T4 polymerase fills in remaining 5′-overhangs and chews back 3′-overhangs, possibly aided by the 3′-phosphatase activity of PNK. Blunt-end ligation and fill-in of sequencing adaptors can then take place.

Figure 2. Demonstration of UDG/endoVIII repair on synthetic oligonucleotides. 1 µg of each of four synthetic double-stranded oligos A-D (top) was subjected to 454 adaptor ligation after enzymatic repair in four conditions: (1) Incubation with PNK followed by addition of T4 polymerase; (2) PNK and UDG followed by T4 polymerase; (3) PNK, UDG and endoVIII followed by T4 polymerase; (4) UDG and endoVIII followed by T4 polymerase (i.e. no PNK). Products were first visualized on agarose gels (middle). The first lane on each gel after the ladder is the untreated oligo. Major bands in the other lanes correspond to the unligated oligos (62–67 bp), the oligos plus one 44-base adaptor and the oligos plus two 44-base adaptors. For some products higher weight bands are visible that probably indicate end-to-end chimeras of the oligos and adaptors. The cause of the faint, diffuse bands seen between 150 and 200 bp in the untreated oligos are unknown but may be artifacts of oligo synthesis. Ligated products were also quantified by qPCR without (dark brown) or with (light brown) prior incubation with UDG. The products marked a and b were sequenced directly on the 454 platform (bottom).

Adaptor fill-in
Each purified ligation product of 12 µl was added to a40 µl adaptor fill-in reaction containing 1x Thermopolbuffer (New England Biolabs), 300 µM each of dATP, dCTP,dGTP and dTTP, and 16 U Bst DNA polymerase (New England Biolabs).The reaction was incubated for 30 min at 37°C. Note thatunlike previous studies (17) we performed fill-in in solutionand did not use streptavidin beads, as they are unnecessary(18) and likely to cause some loss of material.

Product visualization, quantification and sequencing
Twenty microliters of products were visualized on a 2.5% agarosegel (Figure 2). In order to quantify ligation success and tomeasure the extent to which uracils had been successfully removedfrom the templates, all products were quantified by quantitative(q) PCR using 454 adaptor primers, in the presence or absenceof 1 U UDG. Apart from the addition of UDG and a 30 min, 37°Cincubation step at the start of the qPCR, conditions for theqPCR were exactly as described (19). Two of the products (Figure 2 and Supplementary Figure S2) were subjected to emulsion PCRand sequenced on a 16th lane of the 454/Roche FLX platform accordingto the manufacturer’s instructions.

Mammoth DNA
DNA was extracted from 280 mg of a 43 000-year-old mammoth bonefrom Siberia (20) as described (21). The DNA extract was preparedfor 454 adaptor ligation and sequencing under different conditionsas follows.

Ligation
All purified products of the standard blunt end repair, no-CIP-treatmentdamage repair and CIP-treatment damage repair reactions weresubjected to the same 454 adaptor ligation and fill-in procedureas follows: 12 µl of each purified repair product wasmixed with 1 µl 454 A and B adaptor mix (12), diluted1:10 relative to the manufacturer’s instructions due tothe low amounts of template DNA in this experiment. The mixturewas added to 27 µl ligation mix, making a 40 µlreaction containing 1x NEB Quick Ligation buffer and 1 µlNEB Quick ligase. The mixture was incubated at 25°C for15 min, after which products were purified with QIAGEN MinElutecolumns and eluted in 15 µl buffer EB.

Adaptor fill-in
Each purified ligation product of 12 µl was added to a40 µl adaptor fill-in reaction containing 1x NEB Thermopolbuffer, 300 µM each of dATP, dCTP, dGTP and dTTP, and16 U Bst DNA polymerase (New England Biolabs). The reactionwas incubated for 30 min at 37°C followed by 20 min at 80°Cto inactivate the Bst polymerase. As with the synthetic oligonucleotideexperiment described above, streptavidin beads were not usedin this step.

Quantification and sequencing
Each product of 1 µl was quantified directly using qPCRas described (19). Products were then subjected to emPCR at2 copies per bead, and each sequenced on one 16th lane of the454/Roche FLX platform. Sequences were aligned to the draftelephant genome (Loxafr2.0, AAGU00000000.2/GI:202071911) usinga custom mapper, ANFO (22) (freely available for download athttp://bioinf.eva.mpg.de/anfo/).

Neandertal DNA
DNA was extracted from 150 mg of a 38 000-year-old Neandertalbone from Vindija cave, Croatia (23) as described (21). Thirty-twomicroliters of the 100 µl total extract was split equallyinto four 50 µl reactions containing 1x NE Buffer 2, 1mM ATP, 0.1 mg ml–1 BSA, 250 uM each of dATP, dCTP, dGTPand dTTP, 20 U PNK, and either (1) no repair enzyme or (2) 5U UDG and 20 U endoVIII. Reactions were incubated for 3 h at37°C, before 6 U T4 DNA polymerase was added, followed by30 min at 25°C. Products were purified with QIAGEN MinElutecolumns and eluted in 14 µl EB. Ligation and fill-in wereperformed as described in (17), including use of a modified454 adaptor containing a Neandertal project-specific key (6).Note that in this experiment, streptavidin beads were used inthe fill-in step unlike in the synthetic oligonucleotide andmammoth DNA experiments; this was because we had not switchedat that point to the more efficient no-beads protocol (18).To make use of the greater parallel sequencing capacity of theIllumina Genome Analyzers (GA) platform relative to 454/Roche,the library was first universally amplified for 5 PCR cyclesusing the protocol described in (24). The amplified productwas subsequently converted into an Illumina-amenable libraryby a 7 cycle re-amplification under similar conditions exceptthat tailed PCR primers were used that attach the Illumina P5and P7 grafting sequences outside the 454 adaptor sequences(sequences available on request). This allowed subsequent bridgeamplification and 2 x 51 bp paired end sequencing runs to beperformed on the GAII platform, following to the manufacturer’sinstructions except that sequencing primers were used that annealto the 454 adaptor sequences. The sequencing run was analyzedstarting from raw images using the Illumina GA pipeline 1.3.2.To overcome issues introduced by identical key sequences atthe beginning of the first read, for cluster identificationthe first five sequencing cycles were used. The Ibis basecallingprogram was used (25). Raw sequences for the two paired endreads of each sequencing cluster were merged by checking fora minimum 11nt overlap between the first and the second read.For bases in the overlapping sequence, a consensus was calledby considering the base with the higher quality score or, incase of agreement, summing up the quality scores observed. Forfurther analysis, only successfully merged sequences were consideredand aligned to the human genome (NCBI 36.1/hg18), all CpG islandsannotated by UCSC (http://genome.ucsc.edu/cgi-bin/hgTables;table: cpgIslandExt) and the mtDNA sequence of this individual(AM948965) using a custom mapper, ANFO (22). Custom scriptswere used to count alignment mismatch frequencies.

A UDG/endoVIII repair scheme
In library preparation protocols for high-throughput sequencingof ancient DNA, PNK and T4 DNA polymerase are used to generateends amenable to DNA ligation (17). This is achieved by phosphorylationof non-phosphorylated 5′-ends by PNK, and removal of 3′-overhangsand fill-in of 5′-overhangs by T4 polymerase, producing bluntended, 5′-phosphorylated molecules. A large proportion of nucleotidemisincorporations generated from ancient DNA libraries are causedby uracils present in short 5′-overhangs in ancient DNA fragments,which are filled in during this end-repair reaction (6). Wereasoned that if DNA is incubated with UDG and endoVIII priorto T4 DNA polymerase treatment, this would result in the removalof uracil residues from DNA by UDG, and cleavage on the 5′-and 3′-sides of the resulting abasic sites by endoVIII. PNKand T4 polymerase would then remove 3′-phosphate groups andgenerate blunt ends (Figure 1), thus avoiding the loss of thesemolecules from the library.

Repair of oligonucleotides
In order to test the ability of UDG and endoVIII to repair ancientDNA, we designed four double-stranded oligonucleotides thatcarry short overhanging ends (Figure 2). Whereas oligos A andB contain exclusively the four natural DNA bases, C and D carryuracil bases in their overhanging ends as predicted to frequentlyoccur in ancient DNA (6). Oligos A and C carry 5′-overhangswhereas oligos B and D carry 3′-overhangs. Oligo C carries auracil base in the second position of the 5′-overhangs and istherefore the type of fragment that the UDG/endoVIII protocolis designed to repair. Oligo D carries a uracil base in thesecond position of the 3′-overhangs. In 3′-overhangs, deaminatedcytosine should not lead to nucleotide misincorporations inancient DNA sequences because T4 polymerase removes 3′-overhangsbefore adaptor ligation. However, we wanted to investigate ifT4 polymerase activity may be blocked by the abasic sites leftafter UDG treatment or the 3′-phosphates left by endoVIII treatment(26) (Figure 1).

The three oligonucleotides were treated in four different ways:(i) With PNK to phosphorylate 5′-ends, followed by T4 polymeraseto generate blunt ends; (ii) as 1 with the addition of UDG togetherwith PNK; (iii) as 2 with the addition of endoVIII with theUDG and PNK and (iv) as 3 except that PNK was not used. Subsequently,454 sequencing adaptors were ligated to the oligos. The ligationreactions were visualized on ethidium-stained agarose gels,and analyzed by quantitative (q) PCR with primers specific forthe 454 adaptors, either with or without prior treatment withUDG to gauge the amount of uracil residues in the ligation product(Figure 2).

For oligos A and B, which contain no uracils, gel and qPCR resultsshow that adaptor ligation was similarly efficient for treatments1–3 (Figure 2). This indicates that UDG and endoVIII donot interfere with the generation of blunt ends in non-damagedDNA. The observed level of between-treatment variation is expectedgiven that two spin column purification steps were performedduring the procedure.

Oligo C, which contains uracil residues in 5′-overhanging ends,showed high ligation efficiency following treatment 1. However,the ligated product showed greatly reduced copy number in qPCRif treated with UDG prior to amplification. This is expectedgiven that the ligated product should contain one uracil oneach strand. Ligation efficiency was very low after treatment2, consistent with UDG creating abasic sites and preventingT4 polymerase fill-in of ends. After treatment 3, ligation efficiencywas high, and in this case the product was insensitive to UDGtreatment prior to qPCR. This indicates that the uracils hadbeen removed prior to ligation, as predicted by the scheme inFigure 1.

Oligo D, which contains uracil residues in 3′-overhanging ends,showed a pattern similar to oligos A, B and C in that ligationefficiency was high after treatments 1–3. This indicatesthat uracil-containing 3′-overhangs can be successfully removedby T4 polymerase after UDG and endoVIII treatment. The productswere largely insensitive to UDG treatment prior to qPCR, asexpected since uracils in 3′-overhangs should be removed beforeligation. The apparent 40% sensitivity to UDG after treatment1 is surprising, but may be due to uracil misincorporation duringmanufacture of this oligo. Since oligos are synthesized in the3′- to 5′-direction, uracil was added in the first couplingstep and carry over to later synthesis steps would lead to UDGsensitivity of the ligated product.

When PNK is omitted (treatment 4), only oligo C shows a full-lengthligation product. This demonstrates that the UDG/endoVIII repairreaction generates ligatable phosphorylated 5′-ends after removalof the uracils and cleavage of phosphodiester bonds by endoVIII(Figure 1).

To confirm that UDG/endoVIII repair of oligo C removes uracil-derivederrors from DNA sequences as predicted (Figure 1), we sequencedoligo C products of treatments 1 and 3. As expected, after treatment1, the oligo C product is full length and carries nucleotidemisincorporations where uracil residues have been replaced bythymine residues. In contrast, after treatment 3, the oligoC product has been shortened by two bases at both ends, removingthe uracil positions from the sequence (Figure 2).

Repair of mammoth DNA
We next tested the UDG/endoVIII protocol on DNA extracted froma 43 000-year-old woolly mammoth bone. The extract was subjectedto various repair reactions in duplicate, followed by 454 adaptorligation, qPCR quantification and determination of between 783and 20 935 sequences per sample on the 454 platform. The resultswere analyzed with respect to numbers of molecules in the 454libraries generated as well as the percent of sequences in thelibraries that show similarity to the elephant genome and arethus assumed to be of mammoth origin. In most ancient remains,only a small proportion of sequences retrieved are from theorganism under study. For this extract, it is 3%, the rest presumablystemming from microorganisms that have colonized the sampleafter the death of the individual. If the endogenous mammothDNA carried more DNA modifications than the environmental DNA,the relative amount of mammoth DNA might be changed by the repairreactions. Other relevant aspects such as the length of themammoth sequences and the frequency and type of nucleotide misincorporationswere also analyzed.

One issue that potentially complicates the application of theUDG/endoVIII protocol to ancient DNA extracts is that DNA modificationsother than deaminated cytosines may be present (27,28). Therefore,we first tested if mammoth DNA retrieval was affected by thelonger incubation (3 h) at a higher temperature (37°C) thatdiffers from the standard library preparation (30 min at 25°C)and thus could conceivably cause DNA degradation by affectingsome unknown modifications. Neither in terms of numbers of moleculesin the resultant 454 library, the percent of mammoth sequences,the length of mammoth sequences or patterns of misincorporationsdid the incubation conditions strongly influence the results(Supplementary Figure S3).

We next tested the three treatments previously performed onthe oligonucleotides, i.e. (i) PNK + T4 polymerase; (ii) PNK+ T4 polymerase + UDG; (iii) PNK + T4 polymerase + UDG + endoVIII.In a second series of experiments, the mammoth DNA extract wasdephosphorylated with CIP, purified, and then subjected to thesame three treatments except that PNK was never used. Dephosphorylationshould prevent the DNA from successful 454 adaptor ligationexcept in cases where UDG and endoVIII generates phosphorylated5′-ends where uracil bases exist close to the 5′-ends (c.f.oligo B in Figure 2). Thus, these experiments tested if endoVIIIis active on the mammoth DNA.

Total library yield, which includes mammoth and microbial sequences,is reduced by 30% upon UDG treatment (Figure 3a), consistentwith uracil removal from damaged DNA fragments and blockageof T4 polymerase fill-in or Taq polymerase amplification bythe resulting abasic sites. When endoVIII is added, librarycopy numbers increase by 40% relative to the standard condition.This is unexpected since endoVIII is expected to rescue moleculescarrying abasic sites generated by UDG but not to increase copynumbers relative to the standard treatment. We repeated thisexperiment with more replicates per condition, and again observeda reduction in yield upon UDG treatment but an increase in yieldafter UDG/endoVIII treatment only to approximately the no repaircondition (data not shown). We conclude that UDG reduces copynumber of a mammoth DNA library relative to the no repair condition,while UDG together with endoVIII does not, suggesting the repairof uracil-containing DNA fragments. None of the treatments differedin the percent of mammoth DNA sequence, suggesting that uracilor any other unknown modifications that might be affected bythe treatments did not differ substantially in frequency betweenthe mammoth DNA and the environmental DNA.

Figure 3. The UDG/endoVIII repair protocol applied to mammoth DNA. (a) A mammoth DNA extract was subjected to 454 adaptor ligation after enzymatic repair in replicates in three conditions: (1) Incubation with PNK followed by T4 polymerase; (2) PNK and UDG followed by T4 polymerase; (3) PNK, UDG and endoVIII followed by T4 polymerase. The resulting libraries were quantified by qPCR. The same reactions without PNK were applied to mammoth DNA that had first been dephosphorylated by CIP (treatments 4–6). (b) All libraries were sequenced on 454. The proportions of sequences that aligned to the elephant genome are shown. (c) For the sequences resulting from treatments 1–3 and 6 the base composition of aligned elephant sequences is shown for 10 bases either side of the 5′-start point of mammoth sequences.

When the DNA was dephosphorylated prior to the experiments,a small amount of amplifiable material was present after treatments1 and 2, probably representing both incomplete dephosphorylationof naturally phosphorylated 5′-ends by CIP and 454 adaptor artifacts.This is supported by the observation that only a very smallproportion of the sequences determined after this treatmentare of mammoth origin (Figure 3b). In contrast, treatment 3yielded 3.5 times more amplifiable library molecules than treatments1 and 2, consistent with direct selection of DNA fragments thathad uracil near the 5′-ends of both strands. Notably, aftertreatment 3, the proportion of mammoth sequences was 10–12%,four times higher than in the libraries made from non-CIP treatedmammoth DNA in the presence of PNK. This suggests that the mammothDNA carried more uracil residues on average than the environmentalDNA, although this is probably a modest difference because selectingfor damage at both fragment ends will multiply the effect ofany difference in damage between sequence populations. Thus,the observed 4-fold enrichment of mammoth DNA suggests a maximum2-fold difference between the fraction of mammoth DNA strandsand environmental DNA strands that carry uracil residues closeto their 5′-ends. This is a small enough effect to be compatiblewith the fact that no obvious differences in mammoth DNA percentagewere observed among the non-CIP-treated libraries where PNKwas used.

The 5′-start points of the mammoth sequences represent the terminalbases of the ancient DNA fragments at the point of adaptor ligation(6). We plotted the base composition of the elephant referencesequence aligned to the mammoth sequences for 10 bases on eitherside of the mammoth 5′-ends. UDG/endoVIII treatment should leadto an increase in cytosine frequency at the base position immediatelypreceding 5′-ends (i.e. the –1 position in Figure 3c),as DNA fragments containing deaminated cytosine will be truncatedto the 3′-side of such positions (Figure 1). Consistent withthis prediction, we observed in the non-CIP-treated librariesan increased frequency of cytosine at position –1 in theUDG/endoVIII condition relative to the no-repair and UDG-treatedconditions. This effect was much stronger in the CIP-treated,UDG/endoVIII repaired sample, where 84% of 5′-ends are precededby cytosines. This is predicted since this library should predominantlycontain fragments where ligatable ends were generated by excisionand repair at positions of deaminated cytosine.

The average length of mammoth sequences is around 70–80nt (Supplementary Figure S4), as is typical of ancient DNA (29). After UDG treatment, fragments are 10 nt shorter. Thisis expected since longer fragments are more likely to containuracil and thus to be left with an abasic site after UDG treatment.When endoVIII is added, length seems to be intermediate, perhapsrepresenting rescue of some longer fragments by endoVIII treatment.If the DNA is dephosphorylated before UDG and endoVIII treatment,length is 10 nt shorter than after UDG and endoVIII treatmentwithout CIP-treatment, as expected if all of the ligated fragmentsin this condition have been truncated at both ends due to UDG/endoVIIIrepair, as opposed to the non-CIP-treated libraries where onlya fraction of the fragments will have been truncated.

We investigated the effect of the repair protocol on nucleotidemisincorporations, by plotting for each library the frequencyof each of the 12 possible mismatches observed in the mammoth-to-elephantalignments. Ancient DNA sequences generally show an excess ofCT and GA substitutions due to uracil-derived nucleotide misincorporations(4,6,7). As expected, this pattern is seen when the sample isnot subjected to UDG treatment (Supplementary Figure S4). Bytaking the excess of C/GT/A substitutions relative to T/AC/Gsubstitutions as an estimate of the amount of uracil-derivedmisincorporations in each sample, we found that uracil-derivedmisincorporations were reduced 4- to 10-fold in samples whereUDG was used (Supplementary Figure S4).

In summary, UDG–endoVIII treatment allows more mammothDNA sequences to be retrieved than UDG treatment alone, whilegenerating much lower rates of nucleotide misincorporationsthan if UDG is not used.

Repair of Neandertal DNA
Mammoth and African elephant DNA diverged from each other morethan 7 million years ago (30,31). Therefore genuine sequencedifferences will dominate in pairwise alignments, reducing theability to study patterns of nucleotide misincorporations. Incontrast, Neandertals and humans diverged much more recently.Furthermore, in contrast to the elephant genome the human genomeis of finished quality allowing better resolution of nucleotidemisincorporations. We therefore applied the UDG/endoVIII repairprotocol to DNA extracted from a 38 000-year-old Neandertalbone from Croatia and performed deep sequencing of the libraries.An additional advantage of this specimen is that its completemitochondrial (mt) sequence is known (23), allowing detailedanalysis of sequence errors in mtDNA. This may be particularlyinteresting as mtDNA does not carry 5-methylcytosine (5′-m-C),a naturally occurring modification in vertebrate nuclear DNAwhich when deaminated is converted to thymine (32). Since UDGwill not remove thymine from DNA, deamination-derived misincorporationsmay still occur at sites of cytosine methylation after UDG treatmentof the ancient DNA.

Between 10 and 11 million raw DNA sequence reads were generatedon the Illumina GAII platform from four Neandertal DNA librariesprepared in the standard condition or with the UDG/endoVIIIprotocol (each in two replicates). In order to improve sequencingaccuracy, which decreases substantially toward the 3′-end ofIllumina reads (25), we sequenced into both ends of each moleculesand merged the paired reads, requiring overlaps of at least11 bases, and discarded all sequences that could not be mergedto their paired read. Since the single read length was 48 bases,this creates a maximum read length of 85 bp. However, previouswork has shown that the majority of Neandertal fragments inthis bone are shorter than this (33).

Sequences were aligned to the previously published completemtDNA sequence of this Neandertal (23), and to the human nucleargenome (hg18) using a custom mapping program (ANFO) (22). Ineach library, 0.007–0.009% of sequences aligned well tothe mtDNA and 2.0–2.2% to the nuclear genome. Variationin these proportions was as high within as between treatments.Data from replicates were then pooled for each treatment andpatterns of nucleotide mismatches between the sequences andthe references were analyzed.

When analyzing Neandertal DNA sequences, contamination of experimentswith contemporary human DNA is a potential problem (10,34).However, the level of such contamination in a Neandertal DNAlibrary can be assessed by counting the ratio of Neandertalversus contaminant fragments at nucleotide positions where Neandertalsdiffer from all or almost all present-day humans (33). The mtDNAof this Neandertal carries 133 such diagnostic positions (23).The ‘no repair’ dataset yielded 139 mtDNA fragmentsthat overlapped such positions; 138 carried the Neandertal basewhile one matched modern human mtDNA. The UDG/endoVIII treateddataset yielded 128 informative fragments, of which all werethe Neandertal type. Thus, the mtDNA in all libraries was almostcompletely free of contamination by modern human mtDNA, evenafter treatment with UDG and endoVIII. Since the ratio of mitochondrialto nuclear DNA may differ between the contaminating and theNeandertal DNA, this estimate is strictly applicable only tothe mtDNA (33). However, the estimate of mtDNA contaminationin these libraries is low enough that within even a few-foldvariation in mtDNA:nuclear DNA ratios between the Neandertaland contaminating DNA, sequences aligning to the human nucleargenome will be predominantly of Neandertal origin.

To analyze patterns of nucleotide misincorporations, we plottedthe frequency of each of the 12 possible nucleotide differencesin alignments as a function of distance from the ends of DNAfragments for each dataset (Figure 4, top panels). In the ‘norepair’ results, the transitions CT and GA are drasticallyelevated toward the 5′- and 3′-ends ends of fragments, respectively,with up to 40–50% error frequencies at the fragment ends.This occurs in both mtDNA and nuclear DNA fragments, as shownpreviously (6,23). The GA substitutions at 3′-ends of fragmentsoriginate during the fill-in of 5′-overhangs containing deaminatedcytosines (6,7). After UDG/endoVIII treatment C/GT/A differenceswere drastically reduced, consistent with the removal of uracilsby UDG. In mtDNA this removal seems to be almost complete, whereasa small amount remains in nuclear DNA (see below).

Figure 4. Effect of UDG/endoVIII treatment on all nucleotide misincorporation rates and on CT rates at CpN dinucleotides along Neandertal DNA sequences. DNA from a Neandertal bone was subjected to 454 adaptor ligation after enzymatic repair with PNK and T4 polymerase (‘no repair’) as well as those enzymes plus UDG and endoVIII. Approximately 12 million sequences were generated on the Illumina GAII platform for each condition, and sequences were identified that aligned best to the complete mtDNA sequence of this bone AM948969 [GenBank] (947 and 1084 sequences for the no repair and UDG/endoVIII conditions, respectively, left panels) and to the human nuclear genome sequence hg18 (242 070 and 286 737 sequences, right panels). (a) All 12-nt mismatch frequencies are plotted as a function of position along the aligned fragments for the no repair and UDG/endoVIII conditions. (b) The rate of CT misincorporations along DNA fragments is shown separately for the four dinucleotides that contain cytosine in the 5′-position.

Neandertal CpG methylation
In vertebrates, DNA methylation occurs at the 5′-position ofcytosine residues at 80% of CpG dinucleotides in somatic tissues(35). When 5′-m-C is deaminated it becomes a thymine residuethat is not recognized as an unnatural base by UDG in vertebratecells. As a consequence, CpG dinucleotides are preferentiallylost from vertebrate genomes with the result that the frequencywith which CpG dinucleotides occur genome wide is 1%, while4–5% would be expected from the base composition of thegenome (36). In some regions of the genome, CpG content is lessreduced and can be 60% or more of the statistically expectedfrequency. Such ‘CpG islands’ represent areas wheremethylation is reduced or absent. About half of CpG islandsoverlap transcriptional start sites where their methylationin a tissue is positively correlated with suppression of transcription.

We noted that after UDG/endoVIII treatment 5% of C/GT/A substitutionsremained at the first and last positions of Neandertal DNA fragmentsin nuclear DNA whereas this did not seem to be the case in themtDNA (Figure 4). Since cytosine methylation does not occurin mtDNA we speculated that this might be due to cytosine methylationat CpG dinucleotides. We therefore analyzed the rates of CTmismatches seen along the DNA molecules separately for the fourdinucleotides that carry cytosine at their first position (Figure 4, lower panels). Without UDG/endoVIII treatment, CT differencesto the human nuclear genome are common in all four dinucleotidecontexts, but slightly more common in CpG dinucleotides (e.g.52.4% at position 1 of DNA fragments) than in the other dinucleotides(38%–40% at position 1). This is likely due to the higherdeamination susceptibility of 5′-m-C relative to unmethylatedcytosine (32). With UDG/endoVIII treatment, CT substitutionsin the three non-CpG dinucleotides are greatly reduced (to 2–5%at position 1 of DNA fragments), whereas they remain abundantin CpG dinucleotides (40.6% at position 1 of DNA fragments).Thus, at CpG sites, 80% of CT substitutions are resistant toUDG/endoVIII treatment, consistent with findings that 80% ofCpG sites are methylated (35). At CpG sites in mtDNA, no suchresistance of substitutions to UDG/endoVIII treatment is seen.This is consistent with the absence of CpG methylation in mitochondria.

In order to further investigate if the UDG/endoVIII protocolcan be used to detect CpG methylation present in the Neandertalindividual when alive, we analyzed CpGTpG mismatch rates inCpG islands, where methylation is reduced relative to the genome-wideaverage (37) (Figure 5). As described, across the genome 52.4%of CpGs at 5′-ends of fragments are read as TpG without UDG/endoVIIItreatment, and this is reduced to 40.6% upon UDG/endoVIII treatment.At CpG islands, 51.7% of 5′-terminal CpGs are read as TpG withoutUDG/endoVIII treatment and this is reduced to 23.4% with UDG/endoVIIItreatment. Thus, deaminated CpG sites in CpG islands are moresusceptible to UDG/endoVIII treatment than deaminated CpG siteselsewhere in the genome. This is consistent with a reduced frequencyof cytosine methylation in CpG islands. Together with the observationthat no resistance to UDG treatment is observed at deaminatedCpG sites in mtDNA, we conclude that in vivo Neandertal DNAmethylation patterns have survived over 38 000 years in thebone, and can be detected by the UDG/endoVIII protocol.

Figure 5. CpGTpG misincorporation rate along Neandertal DNA molecules, genome-wide and inside CpG islands. DNA from a Neandertal bone was subjected to 454 adaptor ligation after enzymatic repair with PNK and T4 polymerase (‘no repair’) as well as those enzymes plus UDG and endoVIII. Approximately 12 million sequences were generated on the Illumina GAII platform for each condition, and sequences were identified that aligned best to the human nuclear genome sequence hg18 (left panels) or specifically to CpG islands in the human genome (right panels) (35). The rates of CpGTpG mismatches are shown for the first 30 bases of DNA fragments.

Neandertal DNA sequence accuracy
Although UDG/endoVIII treatment of ancient DNA clearly reducesnucleotide misincorporations, a remaining limitation to obtainingmaximal sequence accuracy from ancient DNA molecules is thehigh rate of machine sequencing errors of current high-throughputplatforms. The ‘merged-paired-end’ sequencing approachused here helps to decrease machine errors by requiring thatthe 3′-ends of fragments, where errors are most common (25),are sequenced in two directions. However, near-elimination ofmachine sequencing errors should be possible if libraries arefirst amplified (24,29), as they were here, and sequenced deeplyenough so that multiple copies of each original molecule aresequenced, allowing a consensus to be generated from such replicatesequences, which can be identified by their strand orientation,start and end points (6,13).

To investigate the accuracy, that can be achieved when combininga deep sequencing approach with UDG/endoVIII repair, we determinedthe overall rates of each Neandertal nucleotide misassignmentin (i) all sequences from the ‘no-repair’ dataset;(ii) all sequences from the UDG/endoVIII dataset; and (iii)all ‘multi-pass’ sequences from the UDG/endoVIIIdataset, which we generated by taking consensus sequences ofall distinct fragments that were sequenced more than once. Figure 6 shows that for mitochondrial sequences, overall error rateper base (Figure 6) was 2.20% for ‘no repair’ sequences;0.40% for UDG/endoVIII sequences and 0.09% for multi-pass UDG/endoVIII-treatedsequences. Thus, UDG/endoVIII treatment alone results in a 5.5-foldreduction in error rates while deep sequencing results in anadditional 4.4-fold reduction. In combination this results ina 22-fold reduction in errors. In nuclear DNA, for which weremoved CpG sites from the analysis due to the effect of methylationdescribed above, a similar pattern is observed (Figure 6) althoughthe true error rate at these low levels cannot be accuratelycalculated due to genuine sequence differences between thisNeandertal and the human reference.

Figure 6. Effects of UDG/endoVIII treatment and multiple-pass sequencing on ancient DNA sequence error rate. DNA from a Neandertal bone was subjected to 454 adaptor ligation after enzymatic repair with PNK and T4 polymerase (no repair) as well as those enzymes plus UDG and endoVIII. Approximately 12 million sequences were generated on the Illumina GAII platform for each condition, and sequences were aligned to the complete mtDNA sequence of this bone and to the human nuclear genome (hg18). Overall rates of each alignment mismatch type, and total mismatch rates, are shown for sequences from each condition, plus the result achieved when only fragments that were sequenced more than once (‘multi-pass’) are considered. For nuclear DNA, positions of CpG in the reference were excluded from the analysis since misincorporations remain frequent at these sites even after UDG/endoVIII treatment (see Figure 4).

The damaged state of DNA recovered from ancient remains causestwo major problems for the determination of ancient DNA sequences.First, the DNA quantities recovered are often extremely low.Second, chemical modification of bases, in particular deaminationof cytosine to uracil, leads to frequent sequence errors. Becauseeven in highly damaged DNA, only a minority of cytosines aredeaminated (4,6,7), and deaminated cytosines are largely randomlydistributed in ancient DNA sequences (4), deamination-derivederrors can usually be excluded at positions where three or moreoverlapping molecules are sequenced, and a consensus is used(4,38). However, in sequencing libraries made from ancient DNA,sequence coverage is often so low that deamination-derived errorsrepresent a problem.

When nuclear DNA sequences in ancient diploid organisms aredetermined, a further problem is caused by the need to accuratelydetermine both alleles at heterozygous positions, or to selectone correctly determined allele in an unbiased way. When sequencingancient nuclear DNA without removing uracil, it will be hardto distinguish true C/T or G/A heterozygote sites from sitesof cytosine deamination. Since homozygote sites greatly outnumberheterozygote sites [1000:1 in present-day humans (39)], eventswhere multiple damaged and undamaged templates are sequencedat a true homozygote site may be more common than the occurrenceof true heterozygote sites. This could create a high proportionof false-positive heterozygote calls. If just one allele isrequired to be accurately called, errors could in theory beavoided by always taking the C or G allele at apparently C/T-or G/A-heterozygote sites. However, this will introduce a severebias in population genetic analyses by systematically callingonly one possible allele at heterozygote sites. Thus, it wouldbe very valuable to reduce error rates when ancient nuclearDNA sequences are determined.

The UDG/endoVIII protocol presented here removes the vast majorityof errors caused by miscoding lesions. When combined with machineerror removal by sequencing multiple copies of each originalmolecule, the overall per-base error rate is reduced over 20-foldrelative to single-pass sequencing of unrepaired DNA, at leastfor Neandertal mtDNA. We are unable to measure this reductionaccurately for nuclear DNA due to genuine sequence differencesbetween the Neandertal and the human reference sequence. However,outside CpG sites the error rate in nuclear DNA is presumablysimilar to that in mtDNA, at roughly 1 error per 1000 basesin UDG/endoVIII treated, multi-pass sequences.

It has been previously reported (4) that UDG treatment removesC/GT/A errors from ancient DNA sequences. However, since UDGtreatment causes molecules to be lost, use of the enzyme hasnot been widely adopted, with some exceptions (40). UDG is particularlyproblematic for direct PCR studies, which usually target thelongest possible template molecules; longer templates are morelikely to contain at least one uracil and so be destroyed byUDG treatment. Fortunately, the introduction of direct high-throughputsequencing of ancient DNA has allowed very short (<100 bp)DNA fragments, which make up the vast majority of ancient DNA,to be efficiently sequenced in large quantities. Since uracilbases are concentrated close to the 5′- and 3′-ends in single-strandedoverhangs (6,23,29), endoVIII will rescue many of the fragmentsthat would have been excluded from sequencing after UDG treatmentalone (Figure 1). The UDG/endoVIII protocol also repairs sitesof uracil in double-stranded parts of molecules, although inour hands this repair is only 20% efficient (data not shown).Further work exploring the effects of other repair enzymes onancient DNA library yield and sequence accuracy may be useful.However, it will be desirable for future repair approaches toavoid any extra purification steps, as these inevitably causeloss of material.

The UDG/endoVIII protocol has the additional advantage thatany remaining C/GT/A misincorporations can be attributed todeamination of methylated cytosine residues (provided that UDGtreatment is complete). The observation that CT substitutionsin CpG dinucleotides, but not in other dinucleotides, are resistantto repair by UDG and that this pattern does not occur in mtDNAbut in nuclear DNA where it is reduced in CpG islands, stronglysuggests that the signal stems from in vivo Neandertal methylationas opposed to postmortem DNA modifications. In order to determinethe methylation status of any particular CpG site by this approach,very high sequence coverage would obviously be needed. However,the methylation status of a region such as a particular geneor CpG island will be measurable with more moderate sequencecoverage. Furthermore, emerging techniques such as bisulphitetreatment of small DNA quantities (e.g. Genetic Signatures’MethylEasy Xceed kit) or single-molecule sequencing technologiesthat can distinguish 5′-m-C from C (41) may soon allow high-resolutionanalysis of methylation in ancient DNA. It might therefore becomepossible to investigate the activity of genes and phenomenasuch as X chromosomal inactivation and genetic imprinting inextinct species, provided that the signals are manifested incells present in bones.

Received September 28, 2009. Revised October 30, 2009. Accepted
November 24, 2009.

DNA sequences determined from ancient organisms have high error rates,
primarily due to uracil bases created by cytosine deamination. We use
synthetic oligonucleotides, as well as DNA extracted from mammoth and
Neandertal remains, to show that treatment with uracil–DNA–glycosylase
and endonuclease VIII removes uracil residues from ancient DNA and
repairs most of the resulting abasic sites, leaving undamaged parts of
the DNA fragments intact. Neandertal DNA sequences determined with
this protocol have greatly increased accuracy. In addition, our
results demonstrate that Neandertal DNA retains in vivo patterns of
CpG methylation, potentially allowing future studies of gene
inactivation and imprinting in ancient organisms.