Contact

Links

Research & Scholarship

Current Research and Scholarly Interests

Research in our laboratory develops and applies statistical methods for analyzing patterns of human genetic variation, which underlie the phenotypic diversity of our species. We are collaborating on various genome-wide studies focusing on stratified or recently admixed populations. These studies offer unique opportunities to elucidate the evolutionary forces that have shaped the patterns of genetic variation in humans, to uncover the genetic basis of complex traits, and to shed light on the mechanisms that lead to diverse phenotypes and disparate disease risks among populations.

Abstract

Blood lipid concentrations are heritable risk factors associated with atherosclerosis and cardiovascular diseases. Lipid traits exhibit considerable variation among populations of distinct ancestral origin as well as between individuals within a population. We performed association analyses to identify genetic loci influencing lipid concentrations in African American and Hispanic American women in the Women's Health Initiative SNP Health Association Resource. We validated one African-specific high-density lipoprotein cholesterol locus at CD36 as well as 14 known lipid loci that have been previously implicated in studies of European populations. Moreover, we demonstrate striking similarities in genetic architecture (loci influencing the trait, direction and magnitude of genetic effects, and proportions of phenotypic variation explained) of lipid traits across populations. In particular, we found that a disproportionate fraction of lipid variation in African Americans and Hispanic Americans can be attributed to genomic loci exhibiting statistical evidence of association in Europeans, even though the precise genes and variants remain unknown. At the same time, we found substantial allelic heterogeneity within shared loci, characterized both by population-specific rare variants and variants shared among multiple populations that occur at disparate frequencies. The allelic heterogeneity emphasizes the importance of including diverse populations in future genetic association studies of complex traits such as lipids; furthermore, the overlap in lipid loci across populations of diverse ancestral origin argues that additional knowledge can be gleaned from multiple populations.

Abstract

Variation in human skin and eye color is substantial and especially apparent in admixed populations, yet the underlying genetic architecture is poorly understood because most genome-wide studies are based on individuals of European ancestry. We study pigmentary variation in 699 individuals from Cape Verde, where extensive West African/European admixture has given rise to a broad range in trait values and genomic ancestry proportions. We develop and apply a new approach for measuring eye color, and identify two major loci (HERC2[OCA2] P = 2.3 × 10(-62), SLC24A5 P = 9.6 × 10(-9)) that account for both blue versus brown eye color and varying intensities of brown eye color. We identify four major loci (SLC24A5 P = 5.4 × 10(-27), TYR P = 1.1 × 10(-9), APBA2[OCA2] P = 1.5 × 10(-8), SLC45A2 P = 6 × 10(-9)) for skin color that together account for 35% of the total variance, but the genetic component with the largest effect (~44%) is average genomic ancestry. Our results suggest that adjacent cis-acting regulatory loci for OCA2 explain the relationship between skin and eye color, and point to an underlying genetic architecture in which several genes of moderate effect act together with many genes of small effect to explain ~70% of the estimated heritability.

Abstract

Pigmentation of the skin, hair, and eyes varies both within and between human populations. Identifying the genes and alleles underlying this variation has been the goal of many candidate gene and several genome-wide association studies (GWAS). Most GWAS for pigmentary traits to date have been based on subjective phenotypes using categorical scales. But skin, hair, and eye pigmentation vary continuously. Here, we seek to characterize quantitative variation in these traits objectively and accurately and to determine their genetic basis. Objective and quantitative measures of skin, hair, and eye color were made using reflectance or digital spectroscopy in Europeans from Ireland, Poland, Italy, and Portugal. A GWAS was conducted for the three quantitative pigmentation phenotypes in 176 women across 313,763 SNP loci, and replication of the most significant associations was attempted in a sample of 294 European men and women from the same countries. We find that the pigmentation phenotypes are highly stratified along axes of European genetic differentiation. The country of sampling explains approximately 35% of the variation in skin pigmentation, 31% of the variation in hair pigmentation, and 40% of the variation in eye pigmentation. All three quantitative phenotypes are correlated with each other. In our two-stage association study, we reproduce the association of rs1667394 at the OCA2/HERC2 locus with eye color but we do not identify new genetic determinants of skin and hair pigmentation supporting the lack of major genes affecting skin and hair color variation within Europe and suggesting that not only careful phenotyping but also larger cohorts are required to understand the genetic architecture of these complex quantitative traits. Interestingly, we also see that in each of these four populations, men are more lightly pigmented in the unexposed skin of the inner arm than women, a fact that is underappreciated and may vary across the world.

Abstract

For most of the world, human genome structure at a population level is shaped by interplay between ancient geographic isolation and more recent demographic shifts, factors that are captured by the concepts of biogeographic ancestry and admixture, respectively. The ancestry of non-admixed individuals can often be traced to a specific population in a precise region, but current approaches for studying admixed individuals generally yield coarse information in which genome ancestry proportions are identified according to continent of origin. Here we introduce a new analytic strategy for this problem that allows fine-grained characterization of admixed individuals with respect to both geographic and genomic coordinates. Ancestry segments from different continents, identified with a probabilistic model, are used to construct and study "virtual genomes" of admixed individuals. We apply this approach to a cohort of 492 parent-offspring trios from Mexico City. The relative contributions from the three continental-level ancestral populations-Africa, Europe, and America-vary substantially between individuals, and the distribution of haplotype block length suggests an admixing time of 10-15 generations. The European and Indigenous American virtual genomes of each Mexican individual can be traced to precise regions within each continent, and they reveal a gradient of Amerindian ancestry between indigenous people of southwestern Mexico and Mayans of the Yucatan Peninsula. This contrasts sharply with the African roots of African Americans, which have been characterized by a uniform mixing of multiple West African populations. We also use the virtual European and Indigenous American genomes to search for the signatures of selection in the ancestral populations, and we identify previously known targets of selection in other populations, as well as new candidate loci. The ability to infer precise ancestral components of admixed genomes will facilitate studies of disease-related phenotypes and will allow new insight into the adaptive and demographic history of indigenous people.

Abstract

Human genetic diversity is shaped by both demographic and biological factors and has fundamental implications for understanding the genetic basis of diseases. We studied 938 unrelated individuals from 51 populations of the Human Genome Diversity Panel at 650,000 common single-nucleotide polymorphism loci. Individual ancestry and population substructure were detectable with very high resolution. The relationship between haplotype heterozygosity and geography was consistent with the hypothesis of a serial founder effect with a single origin in sub-Saharan Africa. In addition, we observed a pattern of ancestral allele frequency distributions that reflects variation in population dynamics among geographic regions. This data set allows the most comprehensive characterization to date of human genetic variation.

Abstract

A chromosome in an individual of recently admixed ancestry resembles a mosaic of chromosomal segments, or ancestry blocks, each derived from a particular ancestral population. We consider the problem of inferring ancestry along the chromosomes in an admixed individual and thereby delineating the ancestry blocks. Using a simple population model, we infer gene-flow history in each individual. Compared with existing methods, which are based on a hidden Markov model, the Markov-hidden Markov model (MHMM) we propose has the advantage of accounting for the background linkage disequilibrium (LD) that exists in ancestral populations. When there are more than two ancestral groups, we allow each ancestral population to admix at a different time in history. We use simulations to illustrate the accuracy of the inferred ancestry as well as the importance of modeling the background LD; not accounting for background LD between markers may mislead us to false inferences about mixed ancestry in an indigenous population. The MHMM makes it possible to identify genomic blocks of a particular ancestry by use of any high-density single-nucleotide-polymorphism panel. One application of our method is to perform admixture mapping without genotyping special ancestry-informative-marker panels.

Abstract

Gene expression differs among individuals and populations and is thought to be a major determinant of phenotypic variation. Although variation and genetic loci responsible for RNA expression levels have been analysed extensively in human populations, our knowledge is limited regarding the differences in human protein abundance and the genetic basis for this difference. Variation in messenger RNA expression is not a perfect surrogate for protein expression because the latter is influenced by an array of post-transcriptional regulatory mechanisms, and, empirically, the correlation between protein and mRNA levels is generally modest. Here we used isobaric tag-based quantitative mass spectrometry to determine relative protein levels of 5,953 genes in lymphoblastoid cell lines from 95 diverse individuals genotyped in the HapMap Project. We found that protein levels are heritable molecular phenotypes that exhibit considerable variation between individuals, populations and sexes. Levels of specific sets of proteins involved in the same biological process covary among individuals, indicating that these processes are tightly regulated at the protein level. We identified cis-pQTLs (protein quantitative trait loci), including variants not detected by previous transcriptome studies. This study demonstrates the feasibility of high-throughput human proteome quantification that, when integrated with DNA variation and transcriptome information, adds a new dimension to the characterization of gene expression regulation.

Abstract

Both genes and environment have been implicated in determining the complex body composition phenotypes in individuals of European ancestry; however, few studies have been conducted in other race/ethnic groups.We conducted a genome-wide admixture mapping study in an attempt to localize novel genomic regions associated with genetic ancestry.We selected a sample of 842 African-American women from the Women's Health Initiative single nucleotide polymorphism (SNP) Health Association Resource for whom several dual-energy X-ray absorptiometry (DXA)-derived bone mineral density (BMD) and fat mass phenotypes were available.We derived both global and local ancestry estimates for each individual from Affymetrix 6.0 data and analyzed the correlation of DXA phenotypes with global African ancestry. For each phenotype, we examined the association of local genetic ancestry (number of African ancestral alleles at each marker) and each DXA phenotype at 570 282 markers across the genome in additive models with adjustment for important covariates. Results: We identified statistically significant correlations of whole-body fat mass, trunk fat mass, and all 6 measures of BMD with a proportion of African ancestry. Genome-wide (admixture) significance for femoral neck BMD was achieved across 2 regions ∼3.7 MB and 0.3 MB on chromosome 19q13; similarly, total hip and intertrochanter BMD were associated with local ancestry in these regions. Trunk fat was the most significant fat mass phenotype showing strong, but not genomewide significant associations on chromosome Xp22.Our results suggest that genomic regions in postmenopausal African-American women contribute to variance in BMD and fat mass existence and warrant further study.

Abstract

Genetic variants in 296 genes in regions identified through admixture mapping of hypertension, BMI, and lipids were assessed for association with hypertension, blood pressure (BP), BMI, and high-density lipoprotein cholesterol (HDL-C).This study identified coding SNPs identified from HapMap2 data that were located in genes on chromosomes 5, 6, 8, and 21, wherein ancestry association evidence for hypertension, BMI, or HDL-C was identified in previous admixture mapping studies. Genotyping was performed in 1733 unrelated African-Americans from the National Heart, Lung and Blood Institute's Family Blood Pressure Project, and gene-based association analyses were conducted for hypertension, SBP, DBP, BMI, and HDL-C. A gene score based on the number of minor alleles of each SNP in a gene was created and used for gene-based regression analyses, adjusting for age, age, sex, local marker ancestry, and BMI, as applicable. An individual's African ancestry estimated from 2507 ancestry-informative markers was also adjusted for to eliminate any confounding due to population stratification.CXADR (rs437470) on chromosome 21 was associated with SBP and DBP with or without adjusting for local ancestry (P

Abstract

Personalized medicine is expected to benefit from combining genomic information with regular monitoring of physiological states by multiple high-throughput methods. Here, we present an integrative personal omics profile (iPOP), an analysis that combines genomic, transcriptomic, proteomic, metabolomic, and autoantibody profiles from a single individual over a 14 month period. Our iPOP analysis revealed various medical risks, including type 2 diabetes. It also uncovered extensive, dynamic changes in diverse molecular components and biological pathways across healthy and diseased conditions. Extremely high-coverage genomic and transcriptomic data, which provide the basis of our iPOP, revealed extensive heteroallelic changes during healthy and diseased states and an unexpected RNA editing mechanism. This study demonstrates that longitudinal iPOP can be used to interpret healthy and diseased states by connecting genomic information with additional dynamic omics activity.

Abstract

The outcome of exposure to infectious microbes or their toxins is influenced by both microbial and host genes. Some host genes encode defense mechanisms, whereas others assist pathogen functions. Genomic analyses have associated host gene mutations with altered infectious disease susceptibility, but evidence for causality is limited. Here we demonstrate that human genetic variation affecting capillary morphogenesis gene 2 (CMG2), which encodes a host membrane protein exploited by anthrax toxin as a principal receptor, dramatically alters toxin sensitivity. Lymphoblastoid cells derived from a HapMap Project cohort of 234 persons of African, European, or Asian ancestry differed in sensitivity mediated by the protective antigen (PA) moiety of anthrax toxin by more than four orders of magnitude, with 99% of the cohort showing a 250-fold range of sensitivity. We find that relative sensitivity is an inherited trait that correlates strongly with CMG2 mRNA abundance in cells of each ethnic/geographical group and in the combined population pool (P = 4 × 10(-11)). The extent of CMG2 expression in transfected murine macrophages and human lymphoblastoid cells affected anthrax toxin binding, internalization, and sensitivity. A CMG2 single-nucleotide polymorphism (SNP) occurring frequently in African and European populations independently altered toxin uptake, but was not statistically associated with altered sensitivity in HapMap cell populations. Our results reveal extensive human diversity in cell lethality dependent on PA-mediated toxin binding and uptake, and identify individual differences in CMG2 expression level as a determinant of this diversity. Testing of genomically characterized human cell populations may offer a broadly useful strategy for elucidating effects of genetic variation on infectious disease susceptibility.

Abstract

Current genome-wide association studies (GWAS) often involve populations that have experienced recent genetic admixture. Genotype data generated from these studies can be used to test for association directly, as in a non-admixed population. As an alternative, these data can be used to infer chromosomal ancestry, and thus allow for admixture mapping. We quantify the contribution of allele-based and ancestry-based association testing under a family-design, and demonstrate that the two tests can provide non-redundant information. We propose a joint testing procedure, which efficiently integrates the two sources information. The efficiencies of the allele, ancestry and combined tests are compared in the context of a GWAS. We discuss the impact of population history and provide guidelines for future design and analysis of GWAS in admixed populations.

Abstract

We sought to replicate the association between the kinesin-like protein 6 (KIF6) Trp719Arg polymorphism (rs20455), and clinical coronary artery disease (CAD).Recent prospective studies suggest that carriers of the 719Arg allele in KIF6 are at increased risk of clinical CAD compared with noncarriers.The KIF6 Trp719Arg polymorphism (rs20455) was genotyped in 19 case-control studies of nonfatal CAD either as part of a genome-wide association study or in a formal attempt to replicate the initial positive reports.A total of 17,000 cases and 39,369 controls of European descent as well as a modest number of South Asians, African Americans, Hispanics, East Asians, and admixed cases and controls were successfully genotyped. None of the 19 studies demonstrated an increased risk of CAD in carriers of the 719Arg allele compared with noncarriers. Regression analyses and fixed-effects meta-analyses ruled out with high degree of confidence an increase of ≥2% in the risk of CAD among European 719Arg carriers. We also observed no increase in the risk of CAD among 719Arg carriers in the subset of Europeans with early-onset disease (younger than 50 years of age for men and younger than 60 years of age for women) compared with similarly aged controls as well as all non-European subgroups.The KIF6 Trp719Arg polymorphism was not associated with the risk of clinical CAD in this large replication study.

Abstract

Morphological diversity within closely related species is an essential aspect of evolution and adaptation. Mutations in the Melanocortin 1 receptor (Mc1r) gene contribute to pigmentary diversity in natural populations of fish, birds, and many mammals. However, melanism in the gray wolf, Canis lupus, is caused by a different melanocortin pathway component, the K locus, that encodes a beta-defensin protein that acts as an alternative ligand for Mc1r. We show that the melanistic K locus mutation in North American wolves derives from past hybridization with domestic dogs, has risen to high frequency in forested habitats, and exhibits a molecular signature of positive selection. The same mutation also causes melanism in the coyote, Canis latrans, and in Italian gray wolves, and hence our results demonstrate how traits selected in domesticated species can influence the morphological diversity of their wild relatives.

Abstract

Accurate, high-throughput genotyping allows the fine characterization of genetic ancestry. Here we applied recently developed statistical and computational techniques to the question of African ancestry in African Americans by using data on more than 450,000 single-nucleotide polymorphisms (SNPs) genotyped in 94 Africans of diverse geographic origins included in the HGDP, as well as 136 African Americans and 38 European Americans participating in the Atherosclerotic Disease Vascular Function and Genetic Epidemiology (ADVANCE) study. To focus on African ancestry, we reduced the data to include only those genotypes in each African American determined statistically to be African in origin.From cluster analysis, we found that all the African Americans are admixed in their African components of ancestry, with the majority contributions being from West and West-Central Africa, and only modest variation in these African-ancestry proportions among individuals. Furthermore, by principal components analysis, we found little evidence of genetic structure within the African component of ancestry in African Americans.These results are consistent with historic mating patterns among African Americans that are largely uncorrelated to African ancestral origins, and they cast doubt on the general utility of mtDNA or Y-chromosome markers alone to delineate the full African ancestry of African Americans. Our results also indicate that the genetic architecture of African Americans is distinct from that of Africans, and that the greatest source of potential genetic stratification bias in case-control studies of African Americans derives from the proportion of European ancestry.

Abstract

A susceptibility locus for coronary artery disease (CAD) at chromosome 9p21 has recently been reported, which may influence the age of onset of CAD. We sought to replicate these findings among white subjects and to examine whether these results are consistent with other racial/ethnic groups by genotyping three single nucleotide polymorphisms (SNPs) in the risk interval in the Atherosclerotic Disease, Vascular Function, and Genetic Epidemiology (ADVANCE) study. One or more of these SNPs was associated with clinical CAD in whites, U.S. Hispanics and U.S. East Asians. None of the SNPs were associated with CAD in African Americans although the power to detect an odds ratio (OR) in this group equivalent to that seen in whites was only 24-30%. ORs were higher in Hispanics and East Asians and lower in African Americans, but in all groups the 95% confidence intervals overlapped with ORs observed in whites. High-risk alleles were also associated with increased coronary artery calcification in controls and the magnitude of these associations by racial/ethnic group closely mirrored the magnitude observed for clinical CAD. Unexpectedly, we noted significant genotype frequency differences between male and female cases (P = 0.003-0.05). Consequently, men tended towards a recessive and women tended towards a dominant mode of inheritance. Finally, an effect of genotype on the age of onset of CAD was detected but only in men carrying two versus one or no copy of the high-risk allele and presenting with CAD at age >50 years. Further investigations in other populations are needed to confirm or refute our findings.

Abstract

Estimation of the allele frequency at genetic markers is a key ingredient in biological and biomedical research, such as studies of human genetic variation or of the genetic etiology of heritable traits. As genetic data becomes increasingly available, investigators face a dilemma: when should data from other studies and population subgroups be pooled with the primary data? Pooling additional samples will generally reduce the variance of the frequency estimates; however, used inappropriately, pooled estimates can be severely biased due to population stratification. Because of this potential bias, most investigators avoid pooling, even for samples with the same ethnic background and residing on the same continent. Here, we propose an empirical Bayes approach for estimating allele frequencies of single nucleotide polymorphisms. This procedure adaptively incorporates genotypes from related samples, so that more similar samples have a greater influence on the estimates. In every example we have considered, our estimator achieves a mean squared error (MSE) that is smaller than either pooling or not, and sometimes substantially improves over both extremes. The bias introduced is small, as is shown by a simulation study that is carefully matched to a real data example. Our method is particularly useful when small groups of individuals are genotyped at a large number of markers, a situation we are likely to encounter in a genome-wide association study.

Abstract

Recent studies have used dense markers to examine the human genome in ancestrally homogeneous populations for hallmarks of selection. No genomewide studies have focused on recently admixed groups--populations that have experienced admixing among continentally divided ancestral populations within the past 200-500 years. New World admixed populations are unique in that they represent the sudden confluence of geographically diverged genomes with novel environmental challenges. Here, we present a novel approach for studying selection by examining the genomewide distribution of ancestry in the genetically admixed Puerto Ricans. We find strong statistical evidence of recent selection in three chromosomal regions, including the human leukocyte antigen region on chromosome 6p, chromosome 8q, and chromosome 11q. Two of these regions harbor genes for olfactory receptors. Interestingly, all three regions exhibit deficiencies in the European-ancestry proportion.

Abstract

Integrated liquid-chromatography mass-spectrometry (LC-MS) is becoming a widely used approach for quantifying the protein composition of complex samples. The output of the LC-MS system measures the intensity of a peptide with a specific mass-charge ratio and retention time. In the last few years, this technology has been used to compare complex biological samples across multiple conditions. One challenge for comparative proteomic profiling with LC-MS is to match corresponding peptide features from different experiments. In this paper, we propose a new method--Peptide Element Alignment (PETAL) that uses raw spectrum data and detected peak to simultaneously align features from multiple LC-MS experiments. PETAL creates spectrum elements, each of which represents the mass spectrum of a single peptide in a single scan. Peptides detected in different LC-MS data are aligned if they can be represented by the same elements. By considering each peptide separately, PETAL enjoys greater flexibility than time warping methods. While most existing methods process multiple data sets by sequentially aligning each data set to an arbitrarily chosen template data set, PETAL treats all experiments symmetrically and can analyze all experiments simultaneously. We illustrate the performance of PETAL on example data sets.

Abstract

Obligate pathogenic bacteria lose more genes relative to facultative pathogens, which, in turn, lose more genes than free-living bacteria. It was suggested that the increased gene loss in obligate pathogens may be due to a reduction in the effectiveness of purifying selection. Less attention has been given to the causes of increased gene loss in facultative pathogens.We examined in detail the rate of gene loss in two groups of facultative pathogenic bacteria: pathogenic Escherichia coli, and Shigella. We show that Shigella strains are losing genes at an accelerated rate relative to pathogenic E. coli. We demonstrate that a genome-wide reduction in the effectiveness of selection contributes to the observed increase in the rate of gene loss in Shigella.When compared with their closely related pathogenic E. coli relatives, the more niche-limited Shigella strains appear to be losing genes at a significantly accelerated rate. A genome-wide reduction in the effectiveness of purifying selection plays a role in creating this observed difference. Our results demonstrate that differences in the effectiveness of selection contribute to differences in rate of gene loss in facultative pathogenic bacteria. We discuss how the lifestyle and pathogenicity of Shigella may alter the effectiveness of selection, thus influencing the rate of gene loss.

Abstract

While high-throughput genotyping technologies are becoming readily available, the merit of using these technologies to perform genome-wide association studies has not been established. One major concern is that for studies of complex diseases and traits, the whole-genome approach requires such large sample sizes that both recruitment and genotyping pose considerable challenge. Here we propose a novel statistical method that boosts the effective sample size by combining data obtained from several studies. Specifically, we consider a situation in which various studies have genotyped non-overlapping subjects at largely non-overlapping sets of markers. Our approach, which exploits the local linkage disequilibrium structure without assuming an explicit population model, opens up the possibility of improving statistical power by incorporating existing data into future association studies.

Abstract

As wild organisms adapt to the laboratory environment, they become less relevant as biological models. It has been suggested that a commonly used S. cerevisiae strain has rapidly accumulated mutations in the lab. We report a low-to-intermediate rate of protein evolution in this strain relative to wild isolates.

Abstract

The transmission/disequilibrium test statistic has been used for assessing genetic association in affected-parent trios. In the presence of multiple tightly linked marker loci where local dependency may exist, haplotypes are reconstructed statistically to estimate the joint effects of these markers. In this manuscript, we propose an alternative to the haplotype approach by taking a weighted average of multiple loci, where the weight is proportional to the product of (1-2X recombination fraction) and the linkage disequilibrium between markers. As an illustration, we applied the method to the simulated Aipotu data.

Abstract

The Gypsies (a misnomer, derived from an early legend about Egyptian origins) defy the conventional definition of a population: they have no nation-state, speak different languages, belong to many religions and comprise a mosaic of socially and culturally divergent groups separated by strict rules of endogamy. Referred to as "the invisible minority", the Gypsies have for centuries been ignored by Western medicine, and their genetic heritage has only recently attracted attention. Common origins from a small group of ancestors characterise the 8-10 million European Gypsies as an unusual trans-national founder population, whose exodus from India played the role of a profound demographic bottleneck. Social and economic pressures within Europe led to gradual fragmentation, generating multiple genetically differentiated subisolates. The string of population bottlenecks and founder effects have shaped a unique genetic profile, whose potential for genetic research can be met only by study designs that acknowledge cultural tradition and self-identity.

Abstract

The genome of an admixed individual represents a mixture of alleles from different ancestries. In the United States, the two largest minority groups, African-Americans and Hispanics, are both admixed. An understanding of the admixture proportion at an individual level (individual admixture, or IA) is valuable for both population geneticists and epidemiologists who conduct case-control association studies in these groups. Here we present an extension of a previously described frequentist (maximum likelihood or ML) approach to estimate individual admixture that allows for uncertainty in ancestral allele frequencies. We compare this approach both to prior partial likelihood based methods as well as more recently described Bayesian MCMC methods. Our full ML method demonstrates increased robustness when compared to an existing partial ML approach. Simulations also suggest that this frequentist estimator achieves similar efficiency, measured by the mean squared error criterion, as Bayesian methods but requires just a fraction of the computational time to produce point estimates, allowing for extensive analysis (e.g., simulations) not possible by Bayesian methods. Our simulation results demonstrate that inclusion of ancestral populations or their surrogates in the analysis is required by any method of IA estimation to obtain reasonable results.

Abstract

We have analyzed genetic data for 326 microsatellite markers that were typed uniformly in a large multiethnic population-based sample of individuals as part of a study of the genetics of hypertension (Family Blood Pressure Program). Subjects identified themselves as belonging to one of four major racial/ethnic groups (white, African American, East Asian, and Hispanic) and were recruited from 15 different geographic locales within the United States and Taiwan. Genetic cluster analysis of the microsatellite markers produced four major clusters, which showed near-perfect correspondence with the four self-reported race/ethnicity categories. Of 3,636 subjects of varying race/ethnicity, only 5 (0.14%) showed genetic cluster membership different from their self-identified race/ethnicity. On the other hand, we detected only modest genetic differentiation between different current geographic locales within each race/ethnicity group. Thus, ancient geographic ancestry, which is highly correlated with self-identified race/ethnicity--as opposed to current residence--is the major determinant of genetic structure in the U.S. population. Implications of this genetic structure for case-control association studies are discussed.

Abstract

Human genetic linkage maps are based on rates of recombination across the genome. These rates in humans vary by the sex of the parent from whom alleles are inherited, by chromosomal position, and by genomic features, such as GC content and repeat density. We have examined--for the first time, to our knowledge--racial/ethnic differences in genetic maps of humans. We constructed genetic maps based on 353 microsatellite markers in four racial/ethnic groups: whites, African Americans, Mexican Americans, and East Asians (Chinese and Japanese). These maps were generated using 9,291 subjects from 2,900 nuclear families who participated in the National Heart, Lung, and Blood Institute-funded Family Blood Pressure Program, the largest sample used for map construction to date. Although the maps for the different groups are generally similar, we did find regional and genomewide differences across ethnic groups, including a longer genomewide map for African Americans than for other populations. Some of this variation was explained by genotyping artifacts--namely, null alleles (i.e., alleles with null phenotypes) at a number of loci--and by ethnic differences in null-allele frequencies. In particular, null alleles appear to be the likely explanation for the excess map length in African Americans. We also found that nonrandom missing data biases map results. However, we found regions on chromosome 8p and telomeric segments with significant ethnic differences and a suggestive interval on chromosome 12q that were not due to genotype artifacts. The difference on chromosome 8p is likely due to a polymorphic inversion in the region. The results of our investigation have implications for inferences of possible genetic influences on human recombination as well as for future linkage studies, especially those involving populations of nonwhite ethnicity.

Abstract

The presence of four lysosomal storage diseases (LSDs) at increased frequency in the Ashkenazi Jewish population has suggested to many the operation of natural selection (carrier advantage) as the driving force. We compare LSDs and nonlysosomal storage diseases (NLSDs) in terms of the number of mutations, allele-frequency distributions, and estimated coalescence dates of mutations. We also provide new data on the European geographic distribution, in the Ashkenazi population, of seven LSD and seven NLSD mutations. No differences in any of the distributions were observed between LSDs and NLSDs. Furthermore, no regular pattern of geographic distribution was observed for LSD versus NLSD mutations-with some being more common in central Europe and others being more common in eastern Europe, within each group. The most striking disparate pattern was the geographic distribution of the two primary Tay-Sachs disease mutations, with the first being more common in central Europe (and likely older) and the second being exclusive to eastern Europe (primarily Lithuania and Russia) (and likely much younger). The latter demonstrates a pattern similar to two other recently arisen Lithuanian mutations, those for torsion dystonia and familial hypercholesterolemia. These observations provide compelling support for random genetic drift (chance founder effects, one approximately 11 centuries ago that affected all Ashkenazim and another approximately 5 centuries ago that affected Lithuanians), rather than selection, as the primary determinant of disease mutations in the Ashkenazi population.

Abstract

A debate has arisen regarding the validity of racial/ethnic categories for biomedical and genetic research. An epidemiologic perspective on the issue of human categorization in biomedical and genetic research strongly supports the continued use of self-identified race and ethnicity.

Abstract

This article proposes a method of estimating the time to the most recent common ancestor (TMRCA) of a sample of DNA sequences. The method is based on the molecular clock hypothesis, but avoids assumptions about population structure. Simulations show that in a wide range of situations, the point estimate has small bias and the confidence interval has at least the nominal coverage probability. We discuss conditions that can lead to biased estimates. Performance of this estimator is compared with existing methods based on the coalescence theory. The method is applied to sequences of Y chromosomes and mtDNAs to estimate the coalescent times of human male and female populations.

Abstract

In the comparison of DNA and protein sequences between species or between paralogues or among individuals within a species or population, there is often some indication that different regions of the sequence are divergent or polymorphic to different degrees, indicating differential constraint or diversifying selection operating in different regions of the sequence. The problem is to test statistically whether the observed regional differences in the density of variant sites represent real differences and then to estimate as accurately as possible the location of the differential regions. A method is given for testing and locating regions of differential variation. The method consists of calculating G(x(k)) = k/n - x(k)/N, where x(k) is the position of the kth variant site along the sequence, n is the total number of variant sites, and N is the total sequence length. The estimated region is the longest stretch of adjacent sequence for which G(x(k)) is monotonically increasing (a hot spot) or decreasing (a cold spot). Critical values of this length for tests of significance are given, a sequential method is developed for locating multiple differential regions, and the power of the method against various alternatives is explored. The method locates the endpoints of hot spots and cold spots of variation with high accuracy.