Abstract

Cell-free fetal DNA is present in the plasma of pregnant women. It consists of short DNA fragments among primarily maternally derived DNA fragments. We sequenced a maternal plasma DNA sample at up to 65-fold genomic coverage. We showed that the entire fetal and maternal genomes were represented in maternal plasma at a constant relative proportion. Plasma DNA molecules showed a predictable fragmentation pattern reminiscent of nuclease-cleaved nucleosomes, with the fetal DNA showing a reduction in a 166–base pair (bp) peak relative to a 143-bp peak, when compared with maternal DNA. We constructed a genome-wide genetic map and determined the mutational status of the fetus from the maternal plasma DNA sequences and from information about the paternal genotype and maternal haplotype. Our study suggests the feasibility of using genome-wide scanning to diagnose fetal genetic disorders prenatally in a noninvasive way.

Introduction

During pregnancy, a median of 10% of the DNA in the plasma of pregnant women is fetally derived (1, 2), offering opportunities for noninvasive prenatal diagnosis (3). Thus far, detection of paternally inherited traits [for example, sex (4) and rhesus D blood group status (5)] and fetal chromosomal aneuploidies (6, 7) is the main application. Yet, little is known about the physical and biological characteristics of fetal DNA in maternal plasma. Circulating fetal DNA is consistently reported to be shorter than maternal DNA (8), but the molecular basis of this observation is not known. Better understanding of this size difference might allow one to develop methods for the selective enrichment of fetal DNA from maternal plasma. It is also not known whether the entire fetal genome is represented in maternal plasma. Complete representation might make it possible to deduce a whole-genome genetic map, or even the entire genomic sequence, of a fetus noninvasively. However, this is a technically challenging task because most (about 90%) of the DNA in maternal plasma is derived from the mother, and the DNA molecules in plasma are short fragments (8). Here, we have used paired-end (PE) massively parallel sequencing to study the genomic sequence and size distribution of fetal DNA in maternal plasma. We further constructed a genome-wide genetic map of a fetus from the maternal plasma DNA sequences and from information about the paternal genotype and maternal haplotype.

Results

Clinical case

We recruited a pregnant couple attending an obstetrics clinic for the prenatal diagnosis of β-thalassemia. The father was a carrier of the -CTTT 4–base pair (bp) deletion of codons 41/42, and the pregnant mother was a carrier of the A→G mutation at nucleotide −28 of the HBB gene (9). Blood samples were taken from the father and from the mother before chorionic villus sampling (CVS) at 12 weeks of gestation. A portion of the CVS DNA was stored for the study.

Single-nucleotide polymorphism genotyping

Genome-wide single-nucleotide polymorphism (SNP) genotyping for ~900,000 SNPs was performed for DNA extracted from paternal and maternal buffy coat samples, and the CVS sample, with the Affymetrix Genome-Wide Human SNP Array 6.0 system (table S1). The SNPs were classified into different categories (Fig. 1). We defined category 1 SNPs as those for which the father and mother were both homozygous, but for a different allele each. Category 2 SNPs were those in which the father and mother were both homozygous, but for the same allele. Category 3 SNPs were those in which the father was heterozygous and the mother was homozygous. Category 4 SNPs were those in which the father was homozygous and the mother was heterozygous. Category 5 SNPs were those in which both the father and the mother were heterozygous.

Noninvasive fetal genomic analysis from maternal plasma DNA. Parental SNP combinations can be grouped into five categories. Categories 1, 2, and 3 allow the basic parameters for maternal plasma DNA sequencing to be established, including the percentage coverage of the fetal genome, fractional concentration of fetal DNA, and sequencing error rate. Category 3 also allows the fetal inheritance status of SNP alleles unique to the father to be studied. Mutations uniquely carried by the father can be regarded as category 3. Category 4 allows the inheritance status of the maternal haplotype to be studied. One application is the tracking of fetal inheritance of a haplotype block close to a mutation carried by the mother. Here, noninvasive fetal genomic analysis was carried out for a family undergoing prenatal diagnosis for β-thalassemia. Asterisk denotes that information on the maternal haplotype is required for the RHDO analysis. Category 5 SNPs were not analyzed in this study, but might be useful for the prenatal diagnosis of autosomal recessive disorders with consanguineous parents or genetic diseases with a strong founder effect.

Sequencing of plasma DNA

We performed PE sequencing, 50 bp for each end, on DNA extracted from maternal plasma. Reads (3.931 billion), equivalent to an average of 65-fold coverage of a human genome, were aligned to the non–repeat-masked reference human genome (Hg18 NCBI.36). For each of the 45,392 category 1 SNPs in this family, the fetus was an obligate heterozygote. The fetal SNP allele inherited from the father should be readily detected as a unique sequence in maternal plasma and could be used for studying the distribution of fetal DNA sequences across the genome in maternal plasma.

Figure 2A shows the number of times the paternally inherited fetal alleles for the category 1 SNPs were observed in maternal plasma as the depth of sequencing increased. With data from 3.931 billion reads, a fetal allele was observed at least once for 93.94% of these SNPs (table S2). These results were consistent with Poisson distribution predictions assuming that the whole fetal genome was evenly distributed in maternal plasma (fig. S1).

Sequencing of fetal and total DNA in maternal plasma. (A) Depth of coverage of fetal-specific SNP alleles versus the number of sequenced reads. (B) Sequencing depth and GC content across the whole genome. Chromosome ideograms (outer ring) are oriented pter-qter in a clockwise direction (centromeres are shown in yellow). Other tracks (from outside to inside): GC content (green; range, 30 to 55%), total sequencing depth (red; range, 40 to 100 reads per SNP), and fetal-specific read sequencing depth (blue; range, 1 to 8 reads per SNP). (C) Size distribution of fetal DNA (blue curve), total DNA (red curve), and mitochondrial DNA (green broken curve). Numbers denote the DNA size at the peaks. Schematic illustrations of the structural organization of a nucleosome are shown above the graph. From left to right, DNA double helix wound around a nucleosomal core unit with the sites for nuclease cleavage shown; a nucleosome core unit with ~146 bp of DNA (red tape) wound around it; and a nucleosomal core unit with an intact ~20-bp linker sequence.

The fractional fetal DNA concentration in the maternal plasma, f, can be calculated from the sequencing data:f=2pp+qwhere p is the number of sequenced reads of the fetal-specific allele (the A allele for the category 1 SNP in Fig. 1) and q is the read count of the other allele, which is shared by the maternal and fetal genomes (the C allele for the category 1 SNP in Fig. 1). The values of f determined for every chromosome were highly consistent (Table 1). The depth of coverage of fetal and maternal sequences (in 1-Mb windows) across the genome is plotted in Fig. 2B. It correlated with the GC content of each genomic window (fig. S2). The number of fetal sequences as a proportion of the total sequences in each window was consistent with the fractional fetal DNA concentration determined on a chromosomal level (Table 1). These data indicated that the relative proportion of fetal and maternal DNA was largely constant across the entire genome. Previous data have suggested that the GC bias affecting the measurement of total DNA in maternal plasma is likely to be an analytical artifact related to the sequencing platform used (6, 7, 10, 11), rather than an indication of the differential representation of DNA molecules of different GC content. Our data therefore suggest that the distribution of the fetal and maternal genomes is relatively even in maternal plasma.

Table 1

Fractional concentrations of fetal DNA calculated based on the analysis of category 1 SNPs for different chromosomes.

High-resolution plasma DNA size analysis

The size of each sequenced plasma DNA molecule can be deduced from the genome coordinates of the ends of the PE reads. The sizes of the fetal and total sequences were determined for the whole genome (Fig. 2C) and individually for each chromosome (fig. S3). The most abundant total sequences (predominantly maternal) were 166 bp in length. The most significant difference in the size distribution between the fetal and the total DNA was that fetal DNA exhibited a reduction in the 166-bp peak (Fig. 2C) and a relative prominence of the 143-bp peak. The latter likely corresponded to the trimming of a ~20-bp linker fragment from a nucleosome to its core particle of ~146 bp (12). From ~143 bp and below, the distributions of both fetal and total DNA demonstrated a 10-bp periodicity reminiscent of nuclease-cleaved nucleosomes (12). These data suggest that plasma DNA fragments are derived from the enzymatic processing of DNA from apoptotic cells. In contrast, size analysis of reads that mapped to the non–histone-bound mitochondrial genome did not show this nucleosomal pattern (Fig. 2C). These results provide a molecular explanation for the previously reported size differences between fetal and maternal DNA using Y chromosome and selected polymorphic genetic markers (8, 13, 14), and show that such size differences exist across the entire genome.

General principles for constructing a fetal genetic map

After having demonstrated that the entire fetal genome was evenly represented in maternal plasma, we attempted to construct a genome-wide genetic map of the fetus. Maternal plasma DNA molecules are short fragments and the fetal sequences are in the minority. Here, we used the genetic structure of the parental genomes as scaffolds for assembling the fetal genetic map from the maternal plasma DNA sequences. The map resolution depends on the known resolution of the parental genomes.

First, we used the category 2 SNPs (Fig. 1), in which the father and mother were both homozygous for the same allele, to estimate the error rate of plasma DNA sequencing. For the 500,457 category 2 SNPs in this family, the fetus would be homozygous for the alleles concerned. The sequencing error rate was expressed as the number of reads with an unexpected allele as a proportion of all reads covering the category 2 SNP loci and was 0.303% (99,467/32,828,899). These unexpected alleles were seen in 4.04% of the category 2 SNP loci. Suppose that an allele must be seen more than once to be scored, only 0.55% of such SNPs had a false allele seen in at least two reads, resulting in a specificity of 99.45%. However, when applied to the detection of fetal-specific alleles, the requirement of two reads reduced the fetal allele detection sensitivity. The paternally inherited fetal allele was seen at least twice in only 81.06% of category 1 SNPs, where both parents were homozygous for different alleles (table S2).

Deducing the paternal inheritance of the fetus

We deduced the fetal inheritance from each parent in a stepwise fashion. To determine the paternal allele that the fetus had inherited, we studied the 129,835 category 3 SNPs where the father was heterozygous and the mother was homozygous for one of the alleles. Each of the two paternal alleles had a 50% chance of being inherited by the fetus.

The paternal-specific allele (as illustrated by the C allele in SNP category 3; Fig. 1) was detected at least once among the sequenced reads covering 63,962 category 3 loci and at least twice covering 53,070 loci. The CVS genotype data indicated that the fetus inherited the paternal-specific alleles in 65,018 category 3 SNPs. Such paternally inherited fetal alleles were observed at least once in 61,049 (that is, 93.90%) and at least twice in 52,697 (that is, 81.05%) loci, in good agreement with the category 1 SNP values (table S2). If we assume that the genotyping was perfect, the differences in genotyping and sequencing meant that sequenced paternal-specific alleles observed once or twice in 2913 and 373 category 3 loci, respectively, were false positives. Given that the CVS genotyping data indicated that the fetus inherited the same allele from the father as the homozygous maternal allele in 64,817 loci, the specificities for fetal allele detection using the one- and two-read criteria were 95.51 and 99.42%, respectively (table S2). These error rates are consistent with the category 2 SNP results.

Deducing the maternal inheritance of the fetus

For maternal inheritance, we analyzed the category 4 SNPs (Fig. 1), where the mother was heterozygous and the father was homozygous, and asked whether a slight allelic imbalance was present in maternal plasma. An imbalance would indicate that the fetus was homozygous for one maternal allele. This analysis could, in principle, be carried out for each SNP with locus-specific approaches such as digital polymerase chain reaction (PCR) (15). However, for genome-wide random sequencing, the depth of coverage needed and hence the costs would be prohibitive for clinical use. Using nearby SNP alleles on the same maternal chromosome as a haplotype, we developed a new approach to determine whether there was a relative haplotype dosage (RHDO) imbalance in maternal plasma. Because of meiotic recombination, the final maternally derived haplotype inherited by the fetus is a mosaic of the two original maternal haplotypes. Using RHDO analysis, the combination of alleles inherited by the fetus from its mother can then be deduced as a series of inheritance blocks. The resolution for detecting this depends on the number and distribution of genetic markers known for the mother’s genome.

In this proof-of-concept study, we deduced the maternal haplotype information needed for RHDO analysis with genotype information obtained from microarray analysis of the CVS. This precluded the direct observation of maternal meiotic recombinations, but we do show later in the study that the approach can detect “artificial” maternal meiotic recombinations. If RHDO is used clinically, the maternal haplotype can be deduced without any fetal information by comparison with genotype information for other family members.

Figure 3 shows the RHDO process. The two maternal haplotypes are Hap I and Hap II (Fig. 3A). Hap I is the actual recombinant maternal chromosome inherited by the fetus. We investigated if the sequencing data proved that the fetus had maternal Hap I. Haplotype information from the father was not necessary for this analysis.

Relative haplotype dosage (RHDO) analysis. (A) In type α SNPs, paternal alleles are identical to the maternal alleles on Hap I. In type β SNPs, paternal alleles are identical to the maternal alleles on Hap II. If the fetus inherits Hap I from the mother, it is homozygous for type α and heterozygous for type β SNPs. (B) For type α SNPs, Hap I is overrepresented in maternal plasma. (C) For type β SNPs, there is no significant difference between the cumulative counts for Hap I and Hap II SNPs. Given that the fetus in this case inherits Hap II from the father, the sequential probability ratio test (SPRT) deduces the inheritance of Hap I from the mother.

The category 4 SNPs were divided into two types that required different analysis. We defined α SNPs as those in which the paternal alleles were the same as those on maternal Hap I (Fig. 3A). Fetal inheritance of Hap I caused an overrepresentation of Hap I, relative to Hap II, in maternal plasma (Fig. 3B). If the fetus inherited Hap II, no overrepresentation would be seen. We defined type β SNPs as those in which the paternal alleles were the same as those on maternal Hap II (Fig. 3A) and fetal inheritance of Hap I maintained an equal representation of Hap I and Hap II in maternal plasma (Fig. 3C). However, if the fetus inherited Hap II, Hap II would be overrepresented. A sequential probability ratio test (SPRT) determines the statistical significance of any allelic imbalance seen (16). SPRT allows hypothesis testing as data accumulate (17, 18). When the classification threshold for SPRT is reached, the fetal inheritance of a particular regional maternal haplotype would be established (Fig. 4).

SPRT classification. (A and B) SPRT classification process for RHDO analysis of (A) type α and (B) type β SNPs in a region close to the pter of chromosome 1. The classification process runs in the direction from the telomeric end to the centromere. See also Tables 2 and 3.

Data consistency was examined in two ways. Haplotype assignments from separately analyzed type α and β SNPs should be the same and should be independent of the direction along the chromosome for which RHDO was used. RHDO analyses of a segment of type α and a segment of type β SNPs close to the telomeric end of chromosome 1 are shown in Fig. 4 and Tables 2 and 3. Both the type α and the type β SNP analyses deduced the inheritance of Hap I by the fetus. Tables S3 and S4 illustrate the RHDO analysis for chromosomes 1 and 22. The chromosome 1 analysis proceeded from the telomeric end of the p arm (pter) to the telomeric end of the q arm (qter), and the reverse. The chromosome 22 analysis proceeded from the centromere to qter, and the reverse. Haplotype classifications for the RHDO segments analyzed from both directions were consistent. The pter to qter chromosome 1 RHDO analysis required an average of 17 type α SNPs and 18 type β SNPs to determine the maternal haplotype inherited by the fetus. Three hundred and fourteen type α and 267 type β RHDO classifications were made for chromosome 1 (table S3).

Table 2

SPRT classification process for RHDO analysis of type α SNPs in a region close to the pter of chromosome 1.

Table 4 summarizes the RHDO data for the whole genome. There were 3863 and 3469 classified segments for type α and β SNPs, respectively. Because Hap I was the maternal haplotype passed on to the fetus, any Hap II RHDO segments had been classified incorrectly. If all the RHDO classifications were interpreted directly as the haplotype inherited by the fetus, there were 25 and 43 wrong RHDO classifications for the type α and β SNPs, respectively (0.6 and 1.2% of these classifications). With the current sequencing coverage, the mean sizes of type α and β classification segments were 659,000 and 768,000 bp, respectively. The presence of two meiotic recombinations within such distances would be unlikely (19). Therefore, we proposed to accept a switch in haplotype only when two consecutive RHDO segments of the same type (that is, α or β) showed the same haplotype classification. Using this consecutive-block algorithm, we obtained three and six incorrect classifications for the type α and β SNPs, respectively (0.08 and 0.18% of these classifications) (Table 4). This degree of resolution is sufficient for most purposes, because meiotic recombinations occur about once per chromosome arm per generation (19).

To demonstrate that the RHDO process could potentially detect maternal recombination, we introduced two arbitrary artificial recombinations on chromosome 1 (at positions 163,000,000 and 204,815,000) and one on chromosome 22 (at position 34,835,000) by changing the fetally derived maternal haplotypes at these positions. Figure 5 is a schematic illustration of how the location of recombination is pinpointed by RHDO analysis. RHDO analysis was carried out for a chromosomal region in both directions (from the telomeric end to the centromere, and from the centromere to the telomeric end). When RHDO analysis was performed in the first direction (telomeric end to the centromere, as shown in Fig. 5), the RHDO classifications for a number of segments would indicate one particular haplotype (that is, Hap I) (Fig. 5). When the RHDO segment encompassing the recombination site was reached, the RHDO classification would change to the other haplotype (that is, Hap II), as shown in Fig. 5. However, the ending SNP for the RHDO segment just before (that is, more telomeric to) the change in RHDO classification was located before the actual recombinant site. RHDO analysis was then performed in the reverse direction (that is, centromere to telomeric end). On this occasion, the RHDO segments would initially reveal the classification of Hap II. When the recombinant RHDO segment was reached, the classification would change to Hap I. Again, the ending SNP for the RHDO segment just before (that is, centromeric to) the change in RHDO classification was located before the actual recombination site. The genomic locations of the ending SNP in each RHDO segment before the change in classification identified by the forward and reverse directions would enclose the actual recombination site.

Principles of determining the recombination site using RHDO analysis. (A) RHDO analysis is carried out for a chromosomal region in both directions. Small arrows, SPRT-classified segments; block arrows, segments classified as having one identical maternal haplotype; blue block arrows, Hap I; red block arrows, Hap II. (B) SPRT curve for the segment inside the green oval. Sequential changes in the fraction of total reads contributed by Hap I alleles. For data points on the left, the alleles on Hap I are inherited by the fetus, leading to an overrepresentation of Hap I. Therefore, the fraction would approach the upper classification threshold when data points accumulate. However, for SNPs distal to the recombination site, the alleles on Hap II are inherited by the fetus, leading to an equal representation of Hap I and Hap II reads in maternal plasma. Thus, the fraction would approach 0.5 when more data points accumulate. When the fraction is lower than the lower classification threshold, an SPRT classification of Hap II is made. The recombination site is located between the last SNP of the RHDO segments just before the change in haplotype classification as identified by the RHDO analyses performed in both directions.

The RHDO analyses for the recombinant chromosomes 1 and 22 (tables S5 and S6) correctly revealed changes in haplotype classifications from the RHDO segment encompassing the introduced recombination sites. In the chromosome 1 RHDO analysis, the fetus appeared to have inherited Hap I from the mother from the centromere to SNP rs1489331 located at 162,956,807 (table S5) but maternal Hap II distal to this point. This indicated that a recombination had occurred. The full RHDO analysis for chromosome 1 showed two recombinations between positions 162,956,807 and 163,133,120, and between 204,791,360 and 204,869,063 (table S5). The RHDO analysis for chromosome 22 showed one recombination between positions 34,636,212 and 34,835,716 (table S6). For both chromosomes, the deduced recombination spots were close to the ones artificially introduced.

Noninvasive prenatal diagnosis of β-thalassemia

We next applied the approach outlined above for the prenatal diagnosis of β-thalassemia, an autosomal recessive blood disease. The disease, characterized by severe anemia, is due to mutations in the HBB gene on chromosome 11 that encodes the β subunit of hemoglobin. An affected fetus must inherit mutant alleles from both parents. Figure 6A illustrates the location of the paternal and maternal mutations in the HBB gene. The DNA sequencing data showed 12 reads with the paternal codon 41/42 mutation (Fig. 6B), indicating that the fetus had inherited this mutation. One hundred and eighteen wild-type sequence reads at codons 41/42 (fig. S4) were seen. The apparent fractional fetal DNA concentration was 18%, compatible with category 1 SNP estimates (Table 1).

Prenatal diagnosis of β-thalassemia by sequencing from maternal plasma. (A) Locations of the maternal (nucleotide −28 A→G) and paternal (-CTTT deletion at codons 41/42) mutations. Blue lines, blocks of type α SNPs; pink lines, blocks of type β SNPs; green boxes, genes in the β-globin cluster. (B) Observed sequences carrying the -CTTT deletion. (C) RHDO analysis. Blue and pink boxes represent SPRT-classified blocks for type α (blue) and type β (pink) SNPs. Within each box, the number of SNPs (first row), numbers of cumulated reads for Hap I and Hap II within the RHDO block (second row), and the classification result (third row) are shown. The locations of the boxes correspond to the regions shown in (A).

RHDO analysis was performed to determine whether the fetus had inherited the maternal nucleotide −28 A→G mutation. In this family, the −28 mutation was on maternal Hap II; the wild-type allele was on Hap I. Details of the type α and β RHDO analyses (table S7) both showed fetal inheritance of Hap I from the mother (Fig. 6C). This indicated that the fetus had inherited the maternal wild-type allele. Thus, the fetus was a heterozygous β-thalassemia carrier.

Discussion

The discovery of cell-free fetal DNA in maternal plasma in 1997 has opened up new possibilities for noninvasive prenatal diagnosis (1). However, previous work in this field had typically focused on the detection of one or a small number of fetal DNA targets in maternal plasma (5, 20). Before the present work, it was not clear whether the entire fetal genome is represented in maternal plasma, and whether the relative proportions of fetal and maternal DNA in maternal plasma are constant. For example, Puszyk et al. used real-time PCR to assess the amounts of a small number of genomic loci in human plasma and reported that their relative amounts were different (21). In another example, Fan et al. also cautioned that the representation of sequences from different loci in maternal plasma might not be equal (7). In this report, we used massively parallel sequencing to demonstrate that the entire fetal genome is represented in maternal plasma, and is present in a constant relative proportion to maternal DNA in maternal plasma (Table 1). This information is important because it demonstrates that a noninvasive genome-wide scan of the fetal genome from maternal plasma is possible. Here, we used genome-wide sequencing of maternal plasma DNA for the prenatal diagnosis of β-thalassemia in a noninvasive way. Thus, a genome-wide scan for diagnosing fetal genetic disorders is also possible. With further reductions in the error rate of massively parallel sequencing platforms, it may be possible that de novo mutations in the fetal genome could be detected in a cost-effective manner by maternal plasma DNA sequencing.

The genome-wide fetal genetic analysis described here is much more complicated than the use of massively parallel sequencing for the detection of fetal chromosomal aneuploidies (6, 7). For aneuploidy detection, one is essentially only interested in two parameters, namely, the number of sequence tags and the mapping of these tags to various chromosomes. In comparison, to deduce a genome-wide fetal genetic map, one has to analyze the sequencing data in the context of the parental genetic maps and to assemble these data in a series of fetal inheritance blocks.

The resolution of such fetal genetic analysis is limited only by the depth of the sequencing and the resolution of the parental genetic maps. In this proof-of-concept study, the maternal haplotype was deduced from the fetal genotype information, which was obtained through analysis of a chorionic villus sample. In real diagnostic scenarios, the latter information would not be available, and the maternal haplotype could be deduced in a number of ways: (i) by genotype information from other family members; (ii) by probabilistic deduction using the known haplotype information within a population; and (iii) by methods that allow direct haplotype information to be generated, for example, by single molecule analysis (22, 23). With the increasing accessibility of individual whole-genome sequencing, it is possible that the complete genomic sequences of the parents would be available, or could be generated relatively inexpensively with third-generation sequencing technologies. In this case, a close-to-complete genomic sequence of the fetus could be deduced. The recent advances in generating haplotype information from whole-genome sequence data would facilitate developments in this direction (24).

The RHDO approach for elucidating fetal inheritance from the mother has significant advantages over the relative mutation dosage (RMD) method that is based on digital PCR (15). The digital PCR–based method attempts to make a classification based on a single mutation or polymorphism. It would typically require several thousand-fold coverage of a target region to achieve this. This would be impractical for application on a genome-wide scale, because the costs of producing such genome-wide coverage would be prohibitive. Conversely, in RHDO analysis, the statistical power of the method has been enhanced by combining the counts from all alleles present in the same haplotype. As a result, we were able to achieve a genome-wide analysis with 65-fold genomic coverage.

Analysis of the category 5 SNPs (both parents heterozygous) (Fig. 1) was not performed for the current data set because of the lack of paternal haplotype information. However, for future studies in which both parental haplotypes are known, such SNPs would be useful for prenatal diagnosis of autosomal recessive disorders with consanguineous parents or genetic diseases having a strong founder effect. Because of their genetic relatedness in the disease locus, each parent would have one copy of the disease haplotype. RHDO analysis in maternal plasma would reveal which haplotype(s) the fetus has inherited.

The noninvasive nature of our approach makes it safer than conventional procedures that require invasive sampling of fetal tissue (for example, amniocentesis and CVS). However, the new approach also raises a number of ethical, legal, and social issues that require active discussion among clinicians, scientists, ethicists, and the community. Examples of such issues include the most appropriate way to provide genetic counseling for such a relatively complex test, informed consent, the spectrum of fetal genetic characteristics or abnormalities that can be ethically tested, and equity issues concerning the provision of a relatively expensive test. Such discussions are already under way (25–27), but will need to be expanded.

One direction for future development would be to apply the approach described here specifically to multiple disease-related genomic regions by targeted sequencing approaches (28, 29). In this way, one could target genetic diseases that are prevalent in a particular population. Such a targeted approach would greatly reduce the cost of sequencing and would allow more samples to be analyzed per sequencing run. Such a development would also allow the test subjects to be counseled specifically for a focused group of disorders.

Our study helps to elucidate the biology of plasma DNA by revealing the size difference between circulating fetal and total DNA with high resolution. It is well established that circulating fetal DNA molecules are generally shorter than maternal DNA in maternal plasma (8). However, the molecular basis of this observation has been unclear. Fan et al. recently used PE sequencing to study the size distributions of fetal and total DNA in maternal plasma (14). However, because their study only involved 1 × 107 reads per sample and used a bioinformatics algorithm that analyzed data in 20-bp “bins,” they were only able to replicate the established observation that fetal DNA was shorter than maternal DNA in maternal plasma (8), but were not able to arrive at new mechanistic insights. In our current study, we generated 3.931 × 109 reads in the plasma study sample and used 1-bp bins in our bioinformatics analysis. As a consequence of this higher resolution, we were able to observe that the most significant difference between fetal and maternal DNA in maternal plasma is the reduction in the 166-bp peak relative to the 143-bp peak (Fig. 2C). The most likely explanation for this difference is that circulating fetal DNA consists of more molecules in which the ~20-bp linker fragment has been trimmed from a nucleosome. Because histone H1 binds to the linker fragment, it would be interesting to explore whether antibodies targeting H1 might preferentially bind to the maternally derived DNA in maternal plasma, allowing enrichment of circulating fetal DNA by negative selection. Furthermore, H1 is known to have a number of variants, some of them exhibiting tissue-specific variations in expression (30). These variants might be further exploited to differentiate fetal DNA (predominantly placental) from maternal DNA (predominantly hematopoietic) (31).

Finally, our data potentially have implications for other fields involving the detection of plasma nucleic acids, such as cancer diagnosis (32) and the monitoring of tissue transplants (33). For example, it would be interesting to investigate whether key features of the high-resolution size profile for circulating fetal DNA can also be seen in circulating tumor DNA and donor graft–derived DNA. Variations on the RHDO method might also have applications for the detection of tumor-associated genetic alterations.

Materials and Methods

Samples and processing

The project was approved by the Joint Chinese University of Hong Kong–Hospital Authority New Territories East Cluster Clinical Research Ethics Committee. The blood samples were collected from the pregnant woman and her husband before CVS with informed consent. Peripheral blood samples were centrifuged at 1600g for 10 min at 4°C, and the plasma portion was recentrifuged at 16,000g for 10 min at 4°C (34, 35). The blood cell portion was recentrifuged at 2500g, and any residual plasma was removed. DNA from plasma (4 ml) and buffy coat was extracted following the blood and body fluid protocol of the QIAamp DSP DNA Blood Mini Kit (Qiagen). The plasma DNA was concentrated by a SpeedVac Concentrator (Thermo, SAVANT DNA120) into a final volume of 40 μl per case for subsequent DNA sequencing library preparation. DNA was extracted from the CVS sample with the QIAamp Tissue Kit (Qiagen).

Microarray-based genotyping

The buffy coat DNA of the couple and the CVS DNA were genotyped with the Affymetrix Genome-Wide Human SNP Array 6.0 system as previously described (36). Briefly, genomic DNA was first digested with the restriction enzyme Sty I. It was then ligated to a common adaptor with T4 DNA ligase. After ligation, the template was PCR-amplified. The products were subjected to another round of digestion with Nsp I and linker PCR. The PCR products were then biotin-labeled and hybridized to arrays, which were then fluorescently labeled and scanned to yield a measurement of hybridization intensity for each probe with the GeneChip Scanner 3000 7G (Affymetrix).

The Affymetrix GeneChip Genome-Wide SNP 6.0 arrays were used in conjunction with Affymetrix Genotyping Console version 2.1. For the paternal, maternal, and CVS samples, 99.1, 94.5, and 98.6%, respectively, of the quality control probes passed the default quality control parameters. The microarray signal for the 906,600 SNP loci for each of the three samples was then analyzed with the Birdseed v2.0 algorithm (37). The Birdseed call rate was >99% for each sample. SNPs (99.09%, 898,365/906,600) were successfully called for all members of the trio (table S1A). These SNPs were then subjected to further filtering for potential genotyping errors. First, SNPs with biologically impossible genotype combinations among the trio have been removed (details described in the legend to table S1). Because most of the DNA molecules in maternal plasma are derived from the mother, we additionally filtered SNPs with evidence of genotyping errors for the mother as deduced by the sequencing data (details described in the legend to table S1). Deduction of such errors was performed separately for SNPs initially labeled according to the Affymetrix data as categories 1, 3, and 4. A total of 0.60% (5407/898,365) of the SNPs initially called by Birdseed was filtered. The data reported in this study were derived from the resultant list of 892,958 SNPs (99.40% of the SNPs successfully called by Birdseed) (table S1).

Massively parallel sequencing

Sequencing libraries were constructed from the extracted plasma DNA with the Paired-End Sequencing Sample Preparation Kit (Illumina) mostly according to the manufacturer’s instructions. Because plasma DNA molecules are short fragments by nature (8), we omitted the steps of fragmentation and size selection by gel electrophoresis. Briefly, DNA molecules were end-repaired with T4 DNA polymerase and Klenow polymerase, and with T4 polynucleotide kinase to phosphorylate the 5′ ends. A 3′ overhang was created with a 3′-5′ exonuclease–deficient Klenow fragment. Adaptor oligonucleotides were ligated to the sticky ends. The adaptor-ligated DNA was purified directly with spin columns of the QIAquick PCR purification kit (Qiagen) and enriched by a 15-cycle PCR with Illumina primers.

The DNA library was diluted so that 36 pM of the DNA library was subjected to hybridization onto the PE sequencing flow cells. DNA clusters were generated with an Illumina cluster station with Paired-End Cluster Generation Kit v2 (Illumina), followed by 51 × 2 cycles of sequencing on a Genome Analyzer IIx (Illumina) with Sequencing Kit v3 (Illumina). Genome Analyzer Sequencing Control Software (SCS) v2.5, which could perform real-time image analysis and base calling, was used to carry out the image processing and base calling during the chemistry and imaging cycles of a sequencing run. We used the default parameters within the data analysis software (SCS v2.5) from Illumina to filter poor-quality reads. Chastity, which refers to the brightest intensity obtained over the sum of the brightest and second-brightest intensities, was calculated for each base/cycle. In the default setting, a read would be removed if a chastity of less than 0.6 is observed on two or more bases among the first 25 bases.

Sequence alignment and filtering

The PE sequencing data were analyzed by means of the Short Oligonucleotide Alignment Program 2 (SOAP2) in the PE mode (38). For each PE read, 50 bp from each end was aligned to the non–repeat-masked reference human genome (Hg18 NCBI.36). For the alignment of each end, up to two nucleotide mismatches were allowed. The genomic coordinates of these potential alignments for the two ends were then analyzed to determine whether any combination would allow the two ends to be aligned to the same chromosome with the correct orientation, spanning an insert size no more than 600 bp, and mapping to a single location in the reference human genome.

We then sorted for duplicated reads among these PE reads. A duplicated read was defined as a PE read where the insert DNA molecule showed identical start and end locations on the human genome. We removed all but one of the duplicated PE reads because they were likely to be generated by the PCR process. As a result, 16.7% of the total (maternal plus fetal) reads and 17.4% of the fetal-specific reads were removed as duplicate reads. For reads that spanned an SNP locus, we further removed reads that revealed a biologically impossible allele according to the paternal and maternal genotypes determined by Affymetrix microarray genotyping.

RHDO analysis

RHDO was performed on category 4 SNPs where the mother was heterozygous and the father was homozygous (Fig. 3). RHDO analysis could start from any genomic location. At the starting location, the number of sequenced reads covering the two alleles at the starting SNP locus was first examined. SPRT (see later section below) (16–18) was performed at the starting SNP locus to determine whether the ratio between the counts for the two alleles was in allelic balance or imbalance, or the amount of data was insufficient to reach a statistical conclusion at this point. If the data indicated allelic balance or imbalance, the SNP locus would be scored as Hap I or Hap II depending on whether it is a type α or β SNP (see below). However, given the depth of sequencing performed (tens of reads per SNP locus), there would be insufficient statistical confidence for a genotype call to be made with data from just one SNP. Hence, the data would be deemed “unclassified” by SPRT. The SPRT process would then continue to accumulate read counts from the SNP locus next on the haplotype. The process continued until the cumulative data for SNPs along an inheritance block reached sufficient statistical confidence for Hap I or Hap II to be scored. The RHDO analysis near the chromosome 1p telomere illustrates the SPRT process (Fig. 4 and Tables 2 and 3). In short, evidence for allelic imbalance strengthens by accumulating data along the chromosome from multiple SNPs.

The known genomic order of SNPs included in the haplotype to be interrogated was used to move unidirectionally along the chromosome. Type α and β SNPs were analyzed separately (Fig. 3). In type α SNPs, the paternal alleles were the same as those on maternal Hap I. Fetal inheritance of Hap I caused an overrepresentation of Hap I, relative to Hap II, in maternal plasma. If the fetus inherited Hap II, no overrepresentation would be seen. In type β SNPs, the paternal alleles were the same as those on maternal Hap II, and fetal inheritance of Hap I maintained an equal representation of Hap I and Hap II in maternal plasma. However, if the fetus inherited Hap II, Hap II would be overrepresented. After a haplotype call was made for a segment of SNPs, the next SNP of the same type, α or β, was used to restart the RHDO analysis. The process was repeated until the entire chromosome or genomic region of interest had been analyzed.

SPRT analysis

The SPRT-based classification of RHDO was performed with a program written in Python (http://www.python.org/). An odds ratio of 1200 was used for the calculation of the threshold for accepting the null or the alternative hypotheses. The null hypothesis for each SPRT analysis was the absence of dosage imbalance between the read counts for the two maternal haplotypes, Hap I and Hap II. For type α SNPs, the alternative hypothesis was the overrepresentation of Hap I. For type β SNPs, the alternative hypothesis was the underrepresentation of Hap I. The reason of using a high odds ratio for SPRT classification was to minimize the chance of incorrect classification considering the large number of SPRT classifications to be made for the whole genome. The calculation of the upper and lower boundaries of the SPRT curves was as previously described (16, 39). The equations for calculating the upper and lower boundaries of the SPRT curves are as follows:upper boundary=(ln1200)/N−lndlngandlower boundary=(ln1/1200)/N−lndlngwhered=1−q11−q0andwhereg=q1(1−q0)q0(1−q1)q0 is the proportion of the total counts contributed by read counts from Hap I if the fetus had inherited Hap II from the mother. q1 is the proportion of the total counts contributed by read counts from Hap I if the fetus had inherited Hap I from the mother. N is the total number of read counts for the classified segment. ln denotes the natural logarithm (that is, loge).

For type α SNPs, the fetus would be heterozygous if it had inherited Hap II from the mother. Hence, q0 was equal to 0.5. Alternatively, if the fetus had inherited Hap I from the mother, it would be homozygous for the allele inherited from the father. Because the mother was heterozygous for the SNPs used for the RHDO analysis, the maternal allele the fetus had inherited would be overrepresented in maternal plasma. The expected degree of overrepresentation was determined by the fractional concentration of fetal DNA in the plasma. In this case, the fetal DNA concentration was 11.43%. Therefore, the value of q1 would be 0.5572 (= 0.5 + 0.1143/2).

For type β SNPs, the fetus would be heterozygous if it had inherited Hap I from the mother. Hence, q1 was equal to 0.5. If the fetus had inherited Hap II from the mother, the alleles on Hap I would be underrepresented and the degree of underrepresentation would be dependent on the fractional fetal DNA concentration in maternal plasma. Therefore, the value of q0 would be 0.4429 (= 0.5 − 0.1143/2).

Funding: This study was supported by the University Grants Committee of the Government of the Hong Kong Special Administrative Region, China, under the Areas of Excellence Scheme (AoE/M-04/06), the General Research Fund Scheme of the Hong Kong Research Grants Council (CUHK463109), a sponsored research agreement with Sequenom (San Diego, CA), and the Private Practice Fund of the Department of Obstetrics and Gynaecology of The Chinese University of Hong Kong. Y.M.D.L. is supported by an endowed chair from the Li Ka Shing Foundation. Author contributions: Y.M.D.L., K.C.A.C., and R.W.K.C. conceived and designed the study. C.R.C. refined the study design. H.S., E.Z.C., and P.J. performed the primary bioinformatics analysis. F.M.F.L. and Y.W.Z. performed the sequencing and other bench work. T.Y.L. and T.K.L. performed the clinical characterization and clinical-molecular correlation of the studied family. All authors contributed to the data analysis and preparation of the manuscript, and approved the final version. Competing interests: Y.M.D.L., K.C.A.C., F.M.F.L., Y.W.Z., C.R.C., and R.W.K.C. have filed patent applications and hold patents on the analysis of fetal nucleic acids in maternal plasma. Part of this patent portfolio has been licensed to Sequenom. Y.M.D.L. is a consultant to, on the clinical advisory board of, and holds equities in Sequenom. C.R.C. is the Chief Scientific Officer of Sequenom and holds equities in Sequenom. R.W.K.C. has received travel support from, and Y.M.D.L. has spoken at conferences sponsored by, Illumina and Life Technologies, manufacturers of next-generation DNA sequencing machines. Accession numbers: An application has been made to deposit the sequence data at the National Center for Biotechnology Information database of Genotypes and Phenotypes (dbGaP). Until the data are available in dbGaP, sequence information can be obtained from the authors.