Contact

Links

Research & Scholarship

Current Research and Scholarly Interests

Statistical models and reasoning are key to our understanding of the genetic basis of human traits. Modern high-throughput technology presents us with new opportunities and challenges. We develop statistical approaches for high dimensional data in the attempt of improving our understanding of the molecular basis of health related traits.

Clinical Trials

This pilot clinical trial studies perfusion computed tomography (CT) in predicting response
to treatment in patients with advanced kidney cancer. Comparing results of diagnostic
procedures done before, during, and after targeted therapy may help doctors predict a
patient's response to treatment and help plan the best treatment.

Abstract

Abnormalities in sleep and circadian rhythms are central features of bipolar disorder (BP), often persisting between episodes. We report here, to our knowledge, the first systematic analysis of circadian rhythm activity in pedigrees segregating severe BP (BP-I). By analyzing actigraphy data obtained from members of 26 Costa Rican and Colombian pedigrees [136 euthymic (i.e., interepisode) BP-I individuals and 422 non-BP-I relatives], we delineated 73 phenotypes, of which 49 demonstrated significant heritability and 13 showed significant trait-like association with BP-I. All BP-I-associated traits related to activity level, with BP-I individuals consistently demonstrating lower activity levels than their non-BP-I relatives. We analyzed all 49 heritable phenotypes using genetic linkage analysis, with special emphasis on phenotypes judged to have the strongest impact on the biology underlying BP. We identified a locus for interdaily stability of activity, at a threshold exceeding genome-wide significance, on chromosome 12pter, a region that also showed pleiotropic linkage to two additional activity phenotypes.

Abstract

We consider resequencing studies of associated loci and the problem of prioritizing sequence variants for functional follow-up. Working within the multivariate linear regression framework helps us to account for the joint effects of multiple genes; and adopting a Bayesian approach leads to posterior probabilities that coherently incorporate all information about the variants' function. We describe two novel prior distributions that facilitate learning the role of each variable site by borrowing evidence across phenotypes and across mutations in the same gene. We illustrate their potential advantages with simulations and reanalyzing a data set of sequencing variants.

Abstract

Using genome-wide genotypes, we characterized the genetic structure of 103,006 participants in the Kaiser Permanente Northern California multi-ethnic Genetic Epidemiology Research on Adult Health and Aging Cohort and analyzed the relationship to self-reported race/ethnicity. Participants endorsed any of 23 race/ethnicity/nationality categories, which were collapsed into seven major race/ethnicity groups. By self-report the cohort is 80.8% white and 19.2% minority; 93.8% endorsed a single race/ethnicity group, while 6.2% endorsed two or more. Principal component (PC) and admixture analyses were generally consistent with prior studies. Approximately 17% of subjects had genetic ancestry from more than one continent, and 12% were genetically admixed, considering only nonadjacent geographical origins. Self-reported whites were spread on a continuum along the first two PCs, indicating extensive mixing among European nationalities. Self-identified East Asian nationalities correlated with genetic clustering, consistent with extensive endogamy. Individuals of mixed East Asian-European genetic ancestry were easily identified; we also observed a modest amount of European genetic ancestry in individuals self-identified as Filipinos. Self-reported African Americans and Latinos showed extensive European and African genetic ancestry, and Native American genetic ancestry for the latter. Among 3741 genetically identified parent-child pairs, 93% were concordant for self-reported race/ethnicity; among 2018 genetically identified full-sib pairs, 96% were concordant; the lower rate for parent-child pairs was largely due to intermarriage. The parent-child pairs revealed a trend toward increasing exogamy over time; the presence in the cohort of individuals endorsing multiple race/ethnicity categories creates interesting challenges and future opportunities for genetic epidemiologic studies.

Abstract

Recent theories regarding the pathophysiology of bipolar disorder suggest contributions of both neurodevelopmental and neurodegenerative processes. While structural neuroimaging studies indicate disease-associated neuroanatomical alterations, the behavioural correlates of these alterations have not been well characterized. Here, we investigated multi-generational families genetically enriched for bipolar disorder to: (i) characterize neurobehavioural correlates of neuroanatomical measures implicated in the pathophysiology of bipolar disorder; (ii) identify brain-behaviour associations that differ between diagnostic groups; (iii) identify neurocognitive traits that show evidence of accelerated ageing specifically in subjects with bipolar disorder; and (iv) identify brain-behaviour correlations that differ across the age span. Structural neuroimages and multi-dimensional assessments of temperament and neurocognition were acquired from 527 (153 bipolar disorder and 374 non-bipolar disorder) adults aged 18-87 years in 26 families with heavy genetic loading for bipolar disorder. We used linear regression models to identify significant brain-behaviour associations and test whether brain-behaviour relationships differed: (i) between diagnostic groups; and (ii) as a function of age. We found that total cortical and ventricular volume had the greatest number of significant behavioural associations, and included correlations with measures from multiple cognitive domains, particularly declarative and working memory and executive function. Cortical thickness measures, in contrast, showed more specific associations with declarative memory, letter fluency and processing speed tasks. While the majority of brain-behaviour relationships were similar across diagnostic groups, increased cortical thickness in ventrolateral prefrontal and parietal cortical regions was associated with better declarative memory only in bipolar disorder subjects, and not in non-bipolar disorder family members. Additionally, while age had a relatively strong impact on all neurocognitive traits, the effects of age on cognition did not differ between diagnostic groups. Most brain-behaviour associations were also similar across the age range, with the exception of cortical and ventricular volume and lingual gyrus thickness, which showed weak correlations with verbal fluency and inhibitory control at younger ages that increased in magnitude in older subjects, regardless of diagnosis. Findings indicate that neuroanatomical traits potentially impacted by bipolar disorder are significantly associated with multiple neurobehavioural domains. Structure-function relationships are generally preserved across diagnostic groups, with the notable exception of ventrolateral prefrontal and parietal association cortex, volumetric increases in which may be associated with cognitive resilience specifically in individuals with bipolar disorder. Although age impacted all neurobehavioural traits, we did not find any evidence of accelerated cognitive decline specific to bipolar disorder subjects. Regardless of diagnosis, greater global brain volume may represent a protective factor for the effects of ageing on executive functioning.

Abstract

Obsessive-compulsive disorder (OCD) and Tourette's syndrome are highly heritable neurodevelopmental disorders that are thought to share genetic risk factors. However, the identification of definitive susceptibility genes for these etiologically complex disorders remains elusive. The authors report a combined genome-wide association study (GWAS) of Tourette's syndrome and OCD.The authors conducted a GWAS in 2,723 cases (1,310 with OCD, 834 with Tourette's syndrome, 579 with OCD plus Tourette's syndrome/chronic tics), 5,667 ancestry-matched controls, and 290 OCD parent-child trios. GWAS summary statistics were examined for enrichment of functional variants associated with gene expression levels in brain regions. Polygenic score analyses were conducted to investigate the genetic architecture within and across the two disorders.Although no individual single-nucleotide polymorphisms (SNPs) achieved genome-wide significance, the GWAS signals were enriched for SNPs strongly associated with variations in brain gene expression levels (expression quantitative loci, or eQTLs), suggesting the presence of true functional variants that contribute to risk of these disorders. Polygenic score analyses identified a significant polygenic component for OCD (p=2×10(-4)), predicting 3.2% of the phenotypic variance in an independent data set. In contrast, Tourette's syndrome had a smaller, nonsignificant polygenic component, predicting only 0.6% of the phenotypic variance (p=0.06). No significant polygenic signal was detected across the two disorders, although the sample is likely underpowered to detect a modest shared signal. Furthermore, the OCD polygenic signal was significantly attenuated when cases with both OCD and co-occurring Tourette's syndrome/chronic tics were included in the analysis (p=0.01).Previous work has shown that Tourette's syndrome and OCD have some degree of shared genetic variation. However, the data from this study suggest that there are also distinct components to the genetic architectures of these two disorders. Furthermore, OCD with co-occurring Tourette's syndrome/chronic tics may have different underlying genetic susceptibility compared with OCD alone.

Abstract

IMPORTANCE Genetic factors contribute to risk for bipolar disorder (BP), but its pathogenesis remains poorly understood. A focus on measuring multisystem quantitative traits that may be components of BP psychopathology may enable genetic dissection of this complex disorder, and investigation of extended pedigrees from genetically isolated populations may facilitate the detection of specific genetic variants that affect BP as well as its component phenotypes. OBJECTIVE To identify quantitative neurocognitive, temperament-related, and neuroanatomical phenotypes that appear heritable and associated with severe BP (bipolar I disorder [BP-I]) and therefore suitable for genetic linkage and association studies aimed at identifying variants contributing to BP-I risk. DESIGN, SETTING, AND PARTICIPANTS Multigenerational pedigree study in 2 closely related, genetically isolated populations: the Central Valley of Costa Rica and Antioquia, Colombia. A total of 738 individuals, all from Central Valley of Costa Rica and Antioquia pedigrees, participated; among them, 181 have BP-I. MAIN OUTCOMES AND MEASURES Familial aggregation (heritability) and association with BP-I of 169 quantitative neurocognitive, temperament, magnetic resonance imaging, and diffusion tensor imaging phenotypes. RESULTS Of 169 phenotypes investigated, 126 (75%) were significantly heritable and 53 (31%) were associated with BP-I. About one-quarter of the phenotypes, including measures from each phenotype domain, were both heritable and associated with BP-I. Neuroimaging phenotypes, particularly cortical thickness in prefrontal and temporal regions as well as volume and microstructural integrity of the corpus callosum, represented the most promising candidate traits for genetic mapping related to BP based on strong heritability and association with disease. Analyses of phenotypic and genetic covariation identified substantial correlations among the traits, at least some of which share a common underlying genetic architecture. CONCLUSIONS AND RELEVANCE To our knowledge, this is the most extensive investigation of BP-relevant component phenotypes to date. Our results identify brain and behavioral quantitative traits that appear to be genetically influenced and show a pattern of BP-I association within families that is consistent with expectations from case-control studies. Together, these phenotypes provide a basis for identifying loci contributing to BP-I risk and for genetic dissection of the disorder.

Abstract

Genome-wide association studies (GWAS) have identified >500 common variants associated with quantitative metabolic traits, but in aggregate such variants explain at most 20-30% of the heritable component of population variation in these traits. To further investigate the impact of genotypic variation on metabolic traits, we conducted re-sequencing studies in >6,000 members of a Finnish population cohort (The Northern Finland Birth Cohort of 1966 [NFBC]) and a type 2 diabetes case-control sample (The Finland-United States Investigation of NIDDM Genetics [FUSION] study). By sequencing the coding sequence and 5' and 3' untranslated regions of 78 genes at 17 GWAS loci associated with one or more of six metabolic traits (serum levels of fasting HDL-C, LDL-C, total cholesterol, triglycerides, plasma glucose, and insulin), and conducting both single-variant and gene-level association tests, we obtained a more complete understanding of phenotype-genotype associations at eight of these loci. At all eight of these loci, the identification of new associations provides significant evidence for multiple genetic signals to one or more phenotypes, and at two loci, in the genes ABCA1 and CETP, we found significant gene-level evidence of association to non-synonymous variants with MAF<1%. Additionally, two potentially deleterious variants that demonstrated significant associations (rs138726309, a missense variant in G6PC2, and rs28933094, a missense variant in LIPC) were considerably more common in these Finnish samples than in European reference populations, supporting our prior hypothesis that deleterious variants could attain high frequencies in this isolated population, likely due to the effects of population bottlenecks. Our results highlight the value of large, well-phenotyped samples for rare-variant association analysis, and the challenge of evaluating the phenotypic impact of such variants.

Abstract

Tourette's syndrome (TS) is a developmental disorder that has one of the highest familial recurrence rates among neuropsychiatric diseases with complex inheritance. However, the identification of definitive TS susceptibility genes remains elusive. Here, we report the first genome-wide association study (GWAS) of TS in 1285 cases and 4964 ancestry-matched controls of European ancestry, including two European-derived population isolates, Ashkenazi Jews from North America and Israel and French Canadians from Quebec, Canada. In a primary meta-analysis of GWAS data from these European ancestry samples, no markers achieved a genome-wide threshold of significance (P<5 × 10(-8)); the top signal was found in rs7868992 on chromosome 9q32 within COL27A1 (P=1.85 × 10(-6)). A secondary analysis including an additional 211 cases and 285 controls from two closely related Latin American population isolates from the Central Valley of Costa Rica and Antioquia, Colombia also identified rs7868992 as the top signal (P=3.6 × 10(-7) for the combined sample of 1496 cases and 5249 controls following imputation with 1000 Genomes data). This study lays the groundwork for the eventual identification of common TS susceptibility variants in larger cohorts and helps to provide a more complete understanding of the full genetic architecture of this disorder.

Abstract

Genomic copy number variations (CNVs) and increased parental age are both associated with the risk to develop a variety of clinical neuropsychiatric disorders such as autism, schizophrenia and bipolar disorder. At the same time, it has been shown that the rate of transmitted de novo single nucleotide mutations is increased with paternal age. To address whether paternal age also affects the burden of structural genomic deletions and duplications, we examined various types of CNV burden in a large population sample from the Netherlands. Healthy participants with parental age information (n = 6,773) were collected at different University Medical Centers. CNVs were called with the PennCNV algorithm using Illumina genome-wide SNP array data. We observed no evidence in support of a paternal age effect on CNV load in the offspring. Our results were negative for global measures as well as several proxies for de novo CNV events in this unique sample. While recent studies suggest de novo single nucleotide mutation rate to be dominated by the age of the father at conception, our results strongly suggest that at the level of global CNV burden there is no influence of increased paternal age. While it remains possible that local genomic effects may exist for specific phenotypes, this study indicates that global CNV burden and increased father's age may be independent disease risk factors.

Abstract

Variations in DNA copy number carry information on the modalities of genome evolution and mis-regulation of DNA replication in cancer cells. Their study can help localize tumor suppressor genes, distinguish different populations of cancerous cells, and identify genomic variations responsible for disease phenotypes. A number of different high throughput technologies can be used to identify copy number variable sites, and the literature documents multiple effective algorithms. We focus here on the specific problem of detecting regions where variation in copy number is relatively common in the sample at hand. This problem encompasses the cases of copy number polymorphisms, related samples, technical replicates, and cancerous sub-populations from the same individual.We present a segmentation method named generalized fused lasso (GFL) to reconstruct copy number variant regions. GFL is based on penalized estimation and is capable of processing multiple signals jointly. Our approach is computationally very attractive and leads to sensitivity and specificity levels comparable to those of state-of-the-art specialized methodologies. We illustrate its applicability with simulated and real data sets.The flexibility of our framework makes it applicable to data obtained with a wide range of technology. Its versatility and speed make GFL particularly useful in the initial screening stages of large data sets.

Abstract

Temperament has a strongly heritable component, yet multiple independent genome-wide studies have failed to identify significant genetic associations. We have assembled the largest sample to date of persons with genome-wide genotype data, who have been assessed with Cloninger's Temperament and Character Inventory. Sum scores for novelty seeking, harm avoidance, reward dependence and persistence have been measured in over 11,000 persons collected in four different cohorts. Our study had >80% power to identify genome-wide significant loci (P<1.25 × 10(-8), with correction for testing four scales) accounting for ?0.4% of the phenotypic variance in temperament scales. Using meta-analysis techniques, gene-based tests and pathway analysis we have tested over 1.2 million single-nucleotide polymorphisms (SNPs) for association to each of the four temperament dimensions. We did not discover any SNPs, genes, or pathways to be significantly related to the four temperament dimensions, after correcting for multiple testing. Less than 1% of the variability in any temperament dimension appears to be accounted for by a risk score derived from the SNPs showing strongest association to the temperament dimensions. Elucidation of genetic loci significantly influencing temperament and personality will require potentially very large samples, and/or a more refined phenotype. Item response theory methodology may be a way to incorporate data from cohorts assessed with multiple personality instruments, and might be a method by which a large sample of a more refined phenotype could be acquired.

Abstract

Since 2008, multiple studies have reported on copy number variations (CNVs) in schizophrenia. However, many regions are unique events with minimal overlap between studies. This makes it difficult to gain a comprehensive overview of all CNVs involved in the etiology of schizophrenia. We performed a systematic CNV study on the basis of a homogeneous genome-wide dataset aiming at all CNVs ? 50 kilobase pair. We complemented this analysis with a review of cytogenetic and chromosomal abnormalities for schizophrenia reported in the literature with the purpose of combining classical genetic findings and our current understanding of genomic variation.We investigated 834 Dutch schizophrenia patients and 672 Dutch control subjects. The CNVs were included if they were detected by QuantiSNP (http://www.well.ox.ac.uk/QuantiSNP/) as well as PennCNV (http://www.neurogenome.org/cnv/penncnv/) and contain known protein coding genes. The integrated identification of CNV regions and cytogenetic loci indicates regions of interest (cytogenetic regions of interest [CROIs]).In total, 2437 CNVs were identified with an average number of 2.1 CNVs/subject for both cases and control subjects. We observed significantly more deletions but not duplications in schizophrenia cases versus control subjects. The CNVs identified coincide with loci previously reported in the literature, confirming well-established schizophrenia CROIs 1q42 and 22q11.2 as well as indicating a potentially novel CROI on chromosome 5q35.1.Chromosomal deletions are more prevalent in schizophrenia patients than in healthy subjects and therefore confer a risk factor for pathogenicity. The combination of our CNV data with previously reported cytogenetic abnormalities in schizophrenia provides an overview of potentially interesting regions for positional candidate genes.

Abstract

Glioblastoma (GBM) is among the most lethal of all cancers. GBM consist of a heterogeneous population of tumor cells among which a tumor-initiating and treatment-resistant subpopulation, here termed GBM stem cells, have been identified as primary therapeutic targets. Here, we describe a high-throughput small molecule screening approach that enables the identification and characterization of chemical compounds that are effective against GBM stem cells. The paradigm uses a tissue culture model to enrich for GBM stem cells derived from human GBM resections and combines a phenotype-based screen with gene target-specific screens for compound identification. We used 31,624 small molecules from 7 chemical libraries that we characterized and ranked based on their effect on a panel of GBM stem cell-enriched cultures and their effect on the expression of a module of genes whose expression negatively correlates with clinical outcome: MELK, ASPM, TOP2A, and FOXM1b. Of the 11 compounds meeting criteria for exerting differential effects across cell types used, 4 compounds showed selectivity by inhibiting multiple GBM stem cells-enriched cultures compared with nonenriched cultures: emetine, n-arachidonoyl dopamine, n-oleoyldopamine (OLDA), and n-palmitoyl dopamine. ChemBridge compounds #5560509 and #5256360 inhibited the expression of the 4 mitotic module genes. OLDA, emetine, and compounds #5560509 and #5256360 were chosen for more detailed study and inhibited GBM stem cells in self-renewal assays in vitro and in a xenograft model in vivo. These studies show that our screening strategy provides potential candidates and a blueprint for lead compound identification in larger scale screens or screens involving other cancer types.

Abstract

Phenotype mining is a novel approach for elucidating the genetic basis of complex phenotypic variation. It involves a search of rich phenotype databases for measures correlated with genetic variation, as identified in genome-wide genotyping or sequencing studies. An initial implementation of phenotype mining in a prospective unselected population cohort, the Northern Finland 1966 Birth Cohort (NFBC1966), identifies neurodevelopment-related traits-intellectual deficits, poor school performance and hearing abnormalities-which are more frequent among individuals with large (>500 kb) deletions than among other cohort members. Observation of extensive shared single nucleotide polymorphism haplotypes around deletions suggests an opportunity to expand phenotype mining from cohort samples to the populations from which they derive.

Abstract

Recent advances in genomics have underscored the surprising ubiquity of DNA copy number variation (CNV). Fortunately, modern genotyping platforms also detect CNVs with fairly high reliability. Hidden Markov models and algorithms have played a dominant role in the interpretation of CNV data. Here we explore CNV reconstruction via estimation with a fused-lasso penalty as suggested by Tibshirani and Wang [Biostatistics 9 (2008) 18-29]. We mount a fresh attack on this difficult optimization problem by the following: (a) changing the penalty terms slightly by substituting a smooth approximation to the absolute value function, (b) designing and implementing a new MM (majorization-minimization) algorithm, and (c) applying a fast version of Newton's method to jointly update all model parameters. Together these changes enable us to minimize the fused-lasso criterion in a highly effective way.We also reframe the reconstruction problem in terms of imputation via discrete optimization. This approach is easier and more accurate than parameter estimation because it relies on the fact that only a handful of possible copy number states exist at each SNP. The dynamic programming framework has the added bonus of exploiting information that the current fused-lasso approach ignores. The accuracy of our imputations is comparable to that of hidden Markov models at a substantially lower computational cost.

Abstract

Plasma concentrations of total cholesterol, low-density lipoprotein cholesterol, high-density lipoprotein cholesterol and triglycerides are among the most important risk factors for coronary artery disease (CAD) and are targets for therapeutic intervention. We screened the genome for common variants associated with plasma lipids in >100,000 individuals of European ancestry. Here we report 95 significantly associated loci (P < 5 x 10(-8)), with 59 showing genome-wide significant association with lipid traits for the first time. The newly reported associations include single nucleotide polymorphisms (SNPs) near known lipid regulators (for example, CYP7A1, NPC1L1 and SCARB1) as well as in scores of loci not previously implicated in lipoprotein metabolism. The 95 loci contribute not only to normal variation in lipid traits but also to extreme lipid phenotypes and have an impact on lipid traits in three non-European populations (East Asians, South Asians and African Americans). Our results identify several novel loci associated with plasma lipids that are also associated with CAD. Finally, we validated three of the novel genes-GALNT2, PPP1R3B and TTC39B-with experiments in mouse models. Taken together, our findings provide the foundation to develop a broader biological understanding of lipoprotein metabolism and to identify new therapeutic opportunities for the prevention of CAD.

Abstract

In many organisms the expression levels of each gene are controlled by the activation levels of known "Transcription Factors" (TF). A problem of considerable interest is that of estimating the "Transcription Regulation Networks" (TRN) relating the TFs and genes. While the expression levels of genes can be observed, the activation levels of the corresponding TFs are usually unknown, greatly increasing the difficulty of the problem. Based on previous experimental work, it is often the case that partial information about the TRN is available. For example, certain TFs may be known to regulate a given gene or in other cases a connection may be predicted with a certain probability. In general, the biology of the problem indicates there will be very few connections between TFs and genes. Several methods have been proposed for estimating TRNs. However, they all suffer from problems such as unrealistic assumptions about prior knowledge of the network structure or computational limitations. We propose a new approach that can directly utilize prior information about the network structure in conjunction with observed gene expression data to estimate the TRN. Our approach uses L(1) penalties on the network to ensure a sparse structure. This has the advantage of being computationally efficient as well as making many fewer assumptions about the network structure. We use our methodology to construct the TRN for E. coli and show that the estimate is biologically sensible and compares favorably with previous estimates.

Abstract

Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.

Abstract

Previous studies have implicated DTNBP1 as a schizophrenia susceptibility gene and its encoded protein, dysbindin, as a potential regulator of synaptic vesicle physiology. In this study, we found that endogenous levels of the dysbindin protein in the mouse brain are developmentally regulated, with higher levels observed during embryonic and early postnatal ages than in young adulthood. We obtained biochemical evidence indicating that the bulk of dysbindin from brain exists as a stable component of biogenesis of lysosome-related organelles complex-1 (BLOC-1), a multi-subunit protein complex involved in intracellular membrane trafficking and organelle biogenesis. Selective biochemical interaction between brain BLOC-1 and a few members of the SNARE (soluble N-ethylmaleimide-sensitive factor attachment protein receptor) superfamily of proteins that control membrane fusion, including SNAP-25 and syntaxin 13, was demonstrated. Furthermore, primary hippocampal neurons deficient in BLOC-1 displayed neurite outgrowth defects. Taken together, these observations suggest a novel role for the dysbindin-containing complex, BLOC-1, in neurodevelopment, and provide a framework for considering potential effects of allelic variants in DTNBP1--or in other genes encoding BLOC-1 subunits--in the context of the developmental model of schizophrenia pathogenesis.

Abstract

We previously reported linkage of bipolar disorder to 5q33-q34 in families from two closely related population isolates, the Central Valley of Costa Rica (CVCR) and Antioquia, Colombia (CO). Here we present follow up results from fine-scale mapping in large CVCR and CO families segregating severe bipolar disorder, BP-I, and in 343 population trios/duos from CVCR and CO. Employing densely spaced SNPs to fine map the prior linkage peak region increases linkage evidence and clarifies the position of the putative BP-I locus. We performed two-point linkage analysis with 1134 SNPs in an approximately 9 Mb region between markers D5S410 and D5S422. Combining pedigrees from CVCR and CO yields a LOD score of 4.9 at SNP rs10035961. Two other SNPs (rs7721142 and rs1422795) within the same 94 kb region also displayed LOD scores greater than 4. This linkage peak coincides with our prior microsatellite results and suggests a narrowed BP-I susceptibility regions in these families. To investigate if the locus implicated in the familial form of BP-I also contributes to disease risk in the population, we followed up the family results with association analysis in duo and trio samples, obtaining signals within 2 Mb of the peak linkage signal in the pedigrees; rs12523547 and rs267015 (P = 0.00004 and 0.00016, respectively) in the CO sample and rs244960 in the CVCR sample and the combined sample, with P = 0.00032 and 0.00016, respectively. It remains unclear whether these association results reflect the same locus contributing to BP susceptibility within the extended pedigrees.

Abstract

Down Syndrome cell adhesion molecule (Dscam) genes encode neuronal cell recognition proteins of the immunoglobulin superfamily. In Drosophila, Dscam1 generates 19,008 different ectodomains by alternative splicing of three exon clusters, each encoding half or a complete variable immunoglobulin domain. Identical isoforms bind to each other, but rarely to isoforms differing at any one of the variable immunoglobulin domains. Binding between isoforms on opposing membranes promotes repulsion. Isoform diversity provides the molecular basis for neurite self-avoidance. Self-avoidance refers to the tendency of branches from the same neuron (self-branches) to selectively avoid one another. To ensure that repulsion is restricted to self-branches, different neurons express different sets of isoforms in a biased stochastic fashion. Genetic studies demonstrated that Dscam1 diversity has a profound role in wiring the fly brain. Here we show how many isoforms are required to provide an identification system that prevents non-self branches from inappropriately recognizing each other. Using homologous recombination, we generated mutant animals encoding 12, 24, 576 and 1,152 potential isoforms. Mutant animals with deletions encoding 4,752 and 14,256 isoforms were also analysed. Branching phenotypes were assessed in three classes of neurons. Branching patterns improved as the potential number of isoforms increased, and this was independent of the identity of the isoforms. Although branching defects in animals with 1,152 potential isoforms remained substantial, animals with 4,752 isoforms were indistinguishable from wild-type controls. Mathematical modelling studies were consistent with the experimental results that thousands of isoforms are necessary to ensure acquisition of unique Dscam1 identities in many neurons. We conclude that thousands of isoforms are essential to provide neurons with a robust discrimination mechanism to distinguish between self and non-self during self-avoidance.

Abstract

Deletions within the neurexin 1 gene (NRXN1; 2p16.3) are associated with autism and have also been reported in two families with schizophrenia. We examined NRXN1, and the closely related NRXN2 and NRXN3 genes, for copy number variants (CNVs) in 2977 schizophrenia patients and 33 746 controls from seven European populations (Iceland, Finland, Norway, Germany, The Netherlands, Italy and UK) using microarray data. We found 66 deletions and 5 duplications in NRXN1, including a de novo deletion: 12 deletions and 2 duplications occurred in schizophrenia cases (0.47%) compared to 49 and 3 (0.15%) in controls. There was no common breakpoint and the CNVs varied from 18 to 420 kb. No CNVs were found in NRXN2 or NRXN3. We performed a Cochran-Mantel-Haenszel exact test to estimate association between all CNVs and schizophrenia (P = 0.13; OR = 1.73; 95% CI 0.81-3.50). Because the penetrance of NRXN1 CNVs may vary according to the level of functional impact on the gene, we next restricted the association analysis to CNVs that disrupt exons (0.24% of cases and 0.015% of controls). These were significantly associated with a high odds ratio (P = 0.0027; OR 8.97, 95% CI 1.8-51.9). We conclude that NRXN1 deletions affecting exons confer risk of schizophrenia.

Abstract

Genome-wide association studies (GWAS) of longitudinal birth cohorts enable joint investigation of environmental and genetic influences on complex traits. We report GWAS results for nine quantitative metabolic traits (triglycerides, high-density lipoprotein, low-density lipoprotein, glucose, insulin, C-reactive protein, body mass index, and systolic and diastolic blood pressure) in the Northern Finland Birth Cohort 1966 (NFBC1966), drawn from the most genetically isolated Finnish regions. We replicate most previously reported associations for these traits and identify nine new associations, several of which highlight genes with metabolic functions: high-density lipoprotein with NR1H3 (LXRA), low-density lipoprotein with AR and FADS1-FADS2, glucose with MTNR1B, and insulin with PANK1. Two of these new associations emerged after adjustment of results for body mass index. Gene-environment interaction analyses suggested additional associations, which will require validation in larger samples. The currently identified loci, together with quantified environmental exposures, explain little of the trait variation in NFBC1966. The association observed between low-density lipoprotein and an infrequent variant in AR suggests the potential of such a cohort for identifying associations with both common, low-impact and rarer, high-impact quantitative trait loci.

Abstract

Illumina genotyping arrays provide information on DNA copy number. Current methodology for their analysis assumes linkage equilibrium across adjacent markers. This is unrealistic, given the markers high density, and can result in reduced specificity. Another limitation of current methods is that they cannot be directly applied to the analysis of multiple samples with the goal of detecting copy number polymorphisms and their association with traits of interest.We propose a new Hidden Markov Model for Illumina genotype data, that takes into account linkage disequilibrium between adjacent loci. Our framework also allows for location specific deletion/duplication rates. When multiple samples are available, we describe a methodology for their analysis that simultaneously reconstructs the copy number states in each sample and identifies genomic locations with increased variability in copy number in the population. This approach can be extended to test association between copy number variants and a disease trait.We show that taking into account linkage disequilibrium between adjacent markers can increase the specificity of a HMM in reconstructing copy number variants, especially single copy deletions. Our multisample approach is computationally practical and can increase the power of association studies.

Abstract

Schizophrenia is a severe psychiatric disease with complex etiology, affecting approximately 1% of the general population. Most genetics studies so far have focused on disease association with common genetic variation, such as single-nucleotide polymorphisms (SNPs), but it has recently become apparent that large-scale genomic copy-number variants (CNVs) are involved in disease development as well. To assess the role of rare CNVs in schizophrenia, we screened 54 patients with deficit schizophrenia using Affymetrix's GeneChip 250K SNP arrays. We identified 90 CNVs in total, 77 of which have been reported previously in unaffected control cohorts. Among the genes disrupted by the remaining rare CNVs are MYT1L, CTNND2, NRXN1, and ASTN2, genes that play an important role in neuronal functioning but--except for NRXN1--have not been associated with schizophrenia before. We studied the occurrence of CNVs at these four loci in an additional cohort of 752 patients and 706 normal controls from The Netherlands. We identified eight additional CNVs, of which the four that affect coding sequences were found only in the patient cohort. Our study supports a role for rare CNVs in schizophrenia susceptibility and identifies at least three candidate genes for this complex disorder.

Abstract

Reduced fecundity, associated with severe mental disorders, places negative selection pressure on risk alleles and may explain, in part, why common variants have not been found that confer risk of disorders such as autism, schizophrenia and mental retardation. Thus, rare variants may account for a larger fraction of the overall genetic risk than previously assumed. In contrast to rare single nucleotide mutations, rare copy number variations (CNVs) can be detected using genome-wide single nucleotide polymorphism arrays. This has led to the identification of CNVs associated with mental retardation and autism. In a genome-wide search for CNVs associating with schizophrenia, we used a population-based sample to identify de novo CNVs by analysing 9,878 transmissions from parents to offspring. The 66 de novo CNVs identified were tested for association in a sample of 1,433 schizophrenia cases and 33,250 controls. Three deletions at 1q21.1, 15q11.2 and 15q13.3 showing nominal association with schizophrenia in the first sample (phase I) were followed up in a second sample of 3,285 cases and 7,951 controls (phase II). All three deletions significantly associate with schizophrenia and related psychoses in the combined sample. The identification of these rare, recurrent risk variants, having occurred independently in multiple founders and being subject to negative selection, is important in itself. CNV analysis may also point the way to the identification of additional and more prevalent risk variants in genes and pathways involved in schizophrenia.

Abstract

To investigate the clinical features and natural history of mal de debarquement (MdD).Retrospective case review with follow-up questionnaire and telephone interviews.University Neurotology Clinic.Patients seen between 1980 and 2006 who developed a persistent sensation of rocking or swaying for at least 3 days after exposure to passive motion.Clinical features,diagnostic testing, and questionnaire responses.Of 64 patients(75% women) identified with MdD, 34 completed follow-up questionnaires and interviews in 2006. Most patients had normal neurological exams, ENGs and brain MRIs. The average age of the first MdD episode was 39+/-13 years. A total of 206 episodes were experienced by 64 patients. Of these, 104 episodes (51%) lasted>1 month; 18%, >1 year; 15%, >2 years; 12%, >4 years, and 11%, >5 years. Eighteen patients (28%) subsequently developed spontaneous episodes of MdD-like symptoms after the initial MdD episode.There was a much higher rate of migraine in patients who went onto develop spontaneous episodes(73%) than in those who did not(22%). Subsequent episodes were longer than earlier ones in most patients who had multiple episodes.Re-exposure to passive motion temporarily decreased symptoms in most patients (66%).Subjective intolerance to visual motion increased (10% to 66%)but self-motion sensitivity did not(37% to 50%) with onset of MdD.The majority of MdD episodes lasting longer than 3 days resolve in less than one year but the probability of resolution declines each year. Many patients experience multiple MdD episodes. Some patients develop spontaneous episodes after the initial motion-triggered episode with migraine being a risk factor.

Abstract

Affymetrix's SNP (single-nucleotide polymorphism) genotyping chips have increased the scope and decreased the cost of gene-mapping studies. Because each SNP is queried by multiple DNA probes, the chips present interesting challenges in genotype calling. Traditional clustering methods distinguish the three genotypes of an SNP fairly well given a large enough sample of unrelated individuals or a training sample of known genotypes. This article describes our attempt to improve genotype calling by constructing Gaussian mixture models with empirically derived priors. The priors stabilize parameter estimation and borrow information collectively gathered on tens of thousands of SNPs. When data from related family members are available, our models capture the correlations in signals between relatives. With these advantages in mind, we apply the models to Affymetrix probe intensity data on 10,000 SNPs gathered on 63 genotyped individuals spread over eight pedigrees. We integrate the genotype-calling model with pedigree analysis and examine a sequence of symmetry hypotheses involving the correlated probe signals. The symmetry hypotheses raise novel mathematical issues of parameterization. Using the Bayesian information criterion, we select the best combination of symmetry assumptions. Compared to Affymetrix's software, our model leads to a reduction in no-calls with little sacrifice in overall calling accuracy.

Abstract

We propose a new method for haplotyping, genotype calling, and association testing based on a dictionary model for haplotypes. In this framework, a haplotype arises as a concatenation of conserved haplotype segments, drawn from a predefined dictionary according to segment specific probabilities. The observed data consist of unphased multimarker genotypes gathered on a random sample of unrelated individuals. These genotypes are subject to mutation, genotyping errors, and missing data. The true pair of haplotypes corresponding to a person's multimarker genotype is reconstructed using a Markov chain that visits haplotype pairs according to their posterior probabilities. Our implementation of the chain alternates Gibbs steps, which rearrange the phase of a single marker, and Metropolis steps, which swap maternal and paternal haplotypes from a given maker onward. Output of the chain include the most likely haplotype pairs, the most likely genotypes at each marker, and the expected number of occurrences of each haplotype segment. Reconstruction accuracy is comparable to that achieved by the best existing algorithms. More importantly, the dictionary model yields expected counts of conserved haplotype segments. These imputed counts can serve as genetic predictors in association studies, as we illustrate by examples on cystic fibrosis, Friedreich's ataxia, and angiotensin-I converting enzyme levels.

Abstract

Population isolates may be particularly useful for association studies of complex traits. This utility, however, largely depends on the transferability of tag SNPs chosen from reference samples, such as HapMap, to samples from such populations. Factors that characterize population isolates, such as widespread genetic drift, could impede such transferability. In this report, we show that tag SNPs chosen from HapMap perform well in several population isolates; this is true even for populations that differ substantially from the HapMap sample either in levels of linkage disequilibrium or in SNP allele frequency distributions.

Abstract

Tourette disorder (TD) is a neuropsychiatric disorder with a complex mode of inheritance and is characterized by multiple waxing and waning motor and phonic tics. This article reports the results of the largest genetic linkage study yet undertaken for TD. The sample analyzed includes 238 nuclear families yielding 304 "independent" sibling pairs and 18 separate multigenerational families, for a total of 2,040 individuals. A whole-genome screen with the use of 390 microsatellite markers was completed. Analyses were completed using two diagnostic classifications: (1) only individuals with TD were included as affected and (2) individuals with either TD or chronic-tic (CT) disorder were included as affected. Strong evidence of linkage was observed for a region on chromosome 2p (-log P = 4.42, P = 3.8 x 10(-5) in the analyses that included individuals with TD or CT disorder as affected. Results in several other regions also provide moderate evidence (-log P >2.0) of additional susceptibility loci for TD.

Abstract

Coexistent migraine affects relevant clinical features of patients with Ménière's disease (MD).Epidemiological studies have shown an association between migraine and MD. We sought to determine whether the coexistence of migraine affects any clinical features in patients with MD.In this retrospective case-control study of University Neurotology Clinic patients, 50 patients meeting 1995 AAO-HNS criteria for definite MD were compared to 18 patients meeting the same criteria in addition to the 2004 IHS criteria for migraine (MMD). All had typical low frequency sensorineural hearing loss and episodes of rotational vertigo. Outcome measures included: sex, age of onset of episodic vertigo or fluctuating hearing loss, laterality of hearing loss, aural symptoms, caloric responses, severity of hearing loss, and family history of migraine, episodic vertigo or hearing loss.Age of onset of episodic vertigo or fluctuating hearing loss was significantly lower in patients with MMD (mean +/- 1.96*SE = 37.2 +/- 6.3 years) than in those with MD (mean +/- 1.96*SE = 49.3 +/- 4.4 years). Concurrent bilateral aural symptoms and hearing loss were seen in 56% of MMD and 4% of MD patients. A family history of episodic vertigo was seen in 39% of MMD and 2% of MD patients.

Abstract

We consider the problem of controlling false discoveries in association studies. We assume that the design of the study is adequate so that the "false discoveries" are potentially only because of random chance, not to confounding or other flaws. Under this premise, we review the statistical framework for hypothesis testing and correction for multiple comparisons. We consider in detail the currently accepted strategies in linkage analysis. We then examine the underlying similarities and differences between linkage and association studies and document some of the most recent methodological developments for association mapping.

Abstract

Defining measures of linkage disequilibrium (LD) that have good small sample properties and are applicable to multiallelic markers poses some challenges. The potential of volume measures in this context has been noted before, but their use has been hampered by computational challenges.We design a sequential importance sampling algorithm to evaluate volume measures on I x J tables. The algorithm is implemented in a C routine as a complement to exhaustive enumeration. We make the C code available as open source. We achieve fast and accurate evaluation of volume measures in two dimensional tables.Applying our code to simulated and real datasets reinforces the belief that volume measures are a very useful tool for LD evaluation: they are not inflated in small samples, their definition encompasses multiallelic markers, and they can be computed with appreciable speed.

Abstract

Rare sequence variants may be important in understanding the biology of common diseases, but clearly establishing their association with disease is often difficult. Association studies of such variants are becoming increasingly common as large-scale sequence analysis of candidate genes has become feasible. A recent report suggested SLITRK1 (Slit and Trk-like 1) as a candidate gene for Tourette Syndrome (TS). The statistical evidence for this suggestion came from association analyses of a rare 3'-UTR variant, var321, which was observed in two patients but not observed in more than 2000 controls. We genotyped 307 Costa Rican and 515 Ashkenazi individuals (TS probands and their parents) and observed var321 in five independent Ashkenazi parents, two of whom did not transmit this variant to their affected child. Furthermore, we identified var321 in one subject from an Ashkenazi control sample. Our findings do not support the previously reported association and suggest that var321 is overrepresented among Ashkenazi Jews compared with other populations of European origin. The results further suggest that overrepresentation of rare variants in a specific ethnic group may complicate the interpretation of association analyses of such variants, highlighting the particular importance of precisely matching case and control populations for association analyses of rare variants.

Abstract

We performed a whole genome microsatellite marker scan in six multiplex families with bipolar (BP) mood disorder ascertained in Antioquia, a historically isolated population from North West Colombia. These families were characterized clinically using the approach employed in independent ongoing studies of BP in the closely related population of the Central Valley of Costa Rica. The most consistent linkage results from parametric and non-parametric analyses of the Colombian scan involved markers on 5q31-33, a region implicated by the previous studies of BP in Costa Rica. Because of these concordant results, a follow-up study with additional markers was undertaken in an expanded set of Colombian and Costa Rican families; this provided a genome-wide significant evidence of linkage of BPI to a candidate region of approximately 10 cM in 5q31-33 (maximum non-parametric linkage score=4.395, P<0.00004). Interestingly, this region has been implicated in several previous genetic studies of schizophrenia and psychosis, including disease association with variants of the enthoprotin and gamma-aminobutyric acid receptor genes.

Abstract

We have ascertained in the Central Valley of Costa Rica a new kindred (CR201) segregating for severe bipolar disorder (BP-I). The family was identified by tracing genealogical connections among eight persons initially independently ascertained for a genome wide association study of BP-I. For the genome screen in CR201, we trimmed the family down to 168 persons (82 of whom are genotyped), containing 25 individuals with a best-estimate diagnosis of BP-I. A total of 4,690 SNP markers were genotyped. Analysis of the data was hampered by the size and complexity of the pedigree, which prohibited using exact multipoint methods on the entire kindred. Two-point parametric linkage analysis, using a conservative model of transmission, produced a maximum LOD score of 2.78 on chromosome 6, and a total of 39 loci with LOD scores >1.0. Multipoint parametric and non-parametric linkage analysis was performed separately on four sections of CR201, and interesting (nominal P-value from either analysis <0.01), although not statistically significant, regions were highlighted on chromosomes 1, 2, 3, 12, 16, 19, and 22, in at least one section of the pedigree, or when considering all sections together. The difficulties of analyzing genome wide SNP data for complex disorders in large, potentially informative, kindreds are discussed.

Abstract

The genome-wide distribution of linkage disequilibrium (LD) determines the strategy for selecting markers for association studies, but it varies between populations. We assayed LD in large samples (200 individuals) from each of 11 well-described population isolates and an outbred European-derived sample, using SNP markers spaced across chromosome 22. Most isolates show substantially higher levels of LD than the outbred sample and many fewer regions of very low LD (termed 'holes'). Young isolates known to have had relatively few founders show particularly extensive LD with very few holes; these populations offer substantial advantages for genome-wide association mapping.

Abstract

We propose a dictionary model for haplotypes. According to the model, a haplotype is constructed by randomly concatenating haplotype segments from a given dictionary of segments. A haplotype block is defined as a set of haplotype segments that begin and end with the same pair of markers. In this framework, haplotype blocks can overlap, and the model provides a setting for testing the accuracy of simpler models invoking only nonoverlapping blocks. Each haplotype segment in a dictionary has an assigned probability and alternate spellings that account for genotyping errors and mutation. The model also allows for missing data, unphased genotypes, and prior distribution of parameters. Likelihood evaluations rely on forward and backward recurrences similar to the ones encountered in hidden Markov models. Parameter estimation is carried out with an EM algorithm. The search for the optimal dictionary is particularly difficult because of the variable dimension of the model space. We define a minimum description length criteria to evaluate each dictionary and use a combination of greedy search and careful initialization to select a best dictionary for a given dataset. Application of the model to simulated data gives encouraging results. In a real dataset, we are able to reconstruct a parsimonious dictionary that captures patterns of linkage disequilibrium well.

Abstract

In systems like Escherichia Coli, the abundance of sequence information, gene expression array studies and small scale experiments allows one to reconstruct the regulatory network and to quantify the effects of transcription factors on gene expression. However, this goal can only be achieved if all information sources are used in concert.Our method integrates literature information, DNA sequences and expression arrays. A set of relevant transcription factors is defined on the basis of literature. Sequence data are used to identify potential target genes and the results are used to define a prior distribution on the topology of the regulatory network. A Bayesian hidden component model for the expression array data allows us to identify which of the potential binding sites are actually used by the regulatory proteins in the studied cell conditions, the strength of their control, and their activation profile in a series of experiments. We apply our methodology to 35 expression studies in E.Coli with convincing results.www.genetics.ucla.edu/labs/sabatti/software.htmlThe supplementary material are available at Bioinformatics online.

Abstract

Benign recurrent vertigo (BRV) is a common disorder affecting up to 2% of the adult population and may be etiologically related to migraine because of similarities in the clinical spectrum of the phenotypes and a high co-morbidity within families. Many families have multiple-affected genetically related individuals suggesting familial transmission of the disorder with moderate to high penetrance. While clinically similar to episodic ataxias, there are currently no genes identified that contribute to BRV and no systematic linkage studies performed. In an initial effort to genetically define BRV, we have selected from our Neurology Clinic population a subset of 20 multigenerational families with apparent autosomal dominant transmission, and performed genetic linkage mapping using both parametric and non-parametric linkage (NPL) approaches. The Affymetrix 10K SNP Mapping Assay was used for the genotyping. Heterogeneity LOD (HLOD) analysis reveals the evidence of genetic heterogeneity for BRV and evidence of linkage in a subset of the families to 22q12 (HLOD = 4.02). An additional region was identified by NPL analysis at 5p15 (LOD = 2.63). As migraine is observed substantially more commonly both within the BRV-affected individuals and the related family members, it is possible that a form of migraine is allelic to the BRV locus at 22q12. However, testing linkage or the chromosome 22q12 region to a broader migraine/vertigo phenotype by defining affectation status as either migrainous headaches or BRV greatly weakened the linkage signal, and no significant other peaks were detected. Thus, BRV and migraine does not appear to be allelic disorders within these families. We conclude that BRV is a heterogeneous genetic disorder, appears genetically distinct from migraine with aura and is linked to 22q12. Additional family and population-based linkage and association studies will be needed to determine the causative alleles.

Abstract

Analyze the information contained in homozygous haplotypes detected with high density genotyping.We analyze the genotypes of approximately 2,500 markers on chr 22 in 12 population samples, each including 200 individuals. We develop a measure of disequilibrium based on haplotype homozygosity and an algorithm to identify genomic segments characterized by non-random homozygosity (NRH), taking into account allele frequencies, missing data, genotyping error, and linkage disequilibrium.We show how our measure of linkage disequilibrium based on homozygosity leads to results comparable to those of R(2), as well as the importance of correcting for small sample variation when evaluating D'. We observe that the regions that harbor NRH segments tend to be consistent across populations, are gene rich, and are characterized by lower recombination.It is crucial to take into account LD patterns when interpreting long stretches of homozygous markers.

Abstract

Late endosomes and lysosomes of mammalian cells in interphase tend to concentrate in the perinuclear region that harbors the microtubule-organizing center. We have previously reported abnormal distribution of these organelles - as judged by reduced percentages of cells displaying pronounced perinuclear accumulation - in mutant fibroblasts lacking BLOC-3 (for ;biogenesis of lysosome-related organelles complex 3'). BLOC-3 is a protein complex that contains the products of the genes mutated in Hermansky-Pudlak syndrome types 1 and 4. Here, we developed a method based on image analysis to estimate the extent of organelle clustering in the perinuclear region of cultured cells. Using this method, we corroborated that the perinuclear clustering of late endocytic organelles containing Lamp1 (for ;lysosome-associated membrane protein 1') is reduced in BLOC-3-deficient murine fibroblasts, and found that it is apparently normal in fibroblasts deficient in BLOC-1 or BLOC-2, which are another two protein complexes associated with Hermansky-Pudlak syndrome. Wild-type and mutant fibroblasts were transfected to express human LAMP1 fused at its cytoplasmic tail to green fluorescence protein (GFP). At low expression levels, LAMP1-GFP was targeted correctly to late endocytic organelles in both wild-type and mutant cells. High levels of LAMP1-GFP overexpression elicited aberrant aggregation of late endocytic organelles, a phenomenon that probably involved formation of anti-parallel dimers of LAMP1-GFP as it was not observed in cells expressing comparable levels of a non-dimerizing mutant variant, LAMP1-mGFP. To test whether BLOC-3 plays a role in the movement of late endocytic organelles, time-lapse fluorescence microscopy experiments were performed using live cells expressing low levels of LAMP1-GFP or LAMP1-mGFP. Although active movement of late endocytic organelles was observed in both wild-type and mutant fibroblasts, quantitative analyses revealed a relatively lower frequency of microtubule-dependent movement events, either towards or away from the perinuclear region, within BLOC-3-deficient cells. By contrast, neither the duration nor the speed of these microtubule-dependent events seemed to be affected by the lack of BLOC-3 function. These results suggest that BLOC-3 function is required, directly or indirectly, for optimal attachment of late endocytic organelles to microtubule-dependent motors.

Abstract

The authors recently introduced a framework, named Network Component Analysis (NCA), for the reconstruction of the dynamics of transcriptional regulators' activities from gene expression assays. The original formulation had certain shortcomings that limited NCA's application to a wide class of network dynamics reconstruction problems, either because of limitations in the sample size or because of the stringent requirements imposed by the set of identifiability conditions. In addition, the performance characteristics of the method for various levels of data noise or in the presence of model inaccuracies were never investigated. In this article, the following aspects of NCA have been addressed, resulting in a set of extensions to the original framework: 1) The sufficient conditions on the a priori connectivity information (required for successful reconstructions via NCA) are made less stringent, allowing easier verification of whether a network topology is identifiable, as well as extending the class of identifiable systems. Such a result is accomplished by introducing a set of identifiability requirements that can be directly tested on the regulatory architecture, rather than on specific instances of the system matrix. 2) The two-stage least square iterative procedure used in NCA is proven to identify stationary points of the likelihood function, under Gaussian noise assumption, thus reinforcing the statistical foundations of the method. 3) A framework for the simultaneous reconstruction of multiple regulatory subnetworks is introduced, thus overcoming one of the critical limitations of the original formulation of the decomposition, for example, occurring for poorly sampled data (typical of microarray experiments). A set of monte carlo simulations we conducted with synthetic data suggests that the approach is indeed capable of accurately reconstructing regulatory signals when these are the input of large-scale networks that satisfy the suggested identifiability criteria, even under fairly noisy conditions. The sensitivity of the reconstructed signals to inaccuracies in the hypothesized network topology is also investigated. We demonstrate the feasibility of our approach for the simultaneous reconstruction of multiple regulatory subnetworks from the same data set with a successful application of the technique to gene expression measurements of the bacterium Escherichia coli.

Abstract

Gene expression arrays enable measurements of transcription values for a large number or all genes in the genome. In order to better interpret these results and to use them to reconstruct transcription networks, information on location of binding sites for regulatory proteins in the entire genome is needed. In particular, this represents an open problem in Escherichia coli.We describe the first implementation of dictionary-style models to the study of transcription factors binding sites in an entire genome. Vocabulon's unique feature is that it can both reconstruct binding sites characterized by unknown motifs and impute locations of known binding sites in long sequences by simultaneous search. On one hand, the dictionary model specifies a probability for the entire sequence taking simultaneously into account all the possible binding sites. This greatly reduces the number of false positives. On the other hand, the possibility of refining motif description, as an increasing number of binding sites are identified, augments the sensitivity of the method. We illustrate these properties with examples in E.coli. The results of gene expression arrays are used both to guide the search and corroborate it.

Abstract

Gene microarray technology is often used to compare the expression of thousand of genes in two different cell lines. Typically, one does not expect measurable changes in transcription amounts for a large number of genes; furthermore, the noise level of array experiments is rather high in relation to the available number of replicates. For the purpose of statistical analysis, inference on the "population'' difference in expression for genes across the two cell lines is often cast in the framework of hypothesis testing, with the null hypothesis being no change in expression. Given that thousands of genes are investigated at the same time, this requires some multiple comparison correction procedure to be in place. We argue that hypothesis testing, with its emphasis on type I error and family analogues, may not address the exploratory nature of most microarray experiments. We instead propose viewing the problem as one of estimation of a vector known to have a large number of zero components. In a Bayesian framework, we describe the prior knowledge on expression changes using mixture priors that incorporate a mass at zero, and we choose a loss function that favors the selection of sparse solutions. We consider two different models applicable to the microarray problem, depending on the nature of replicates available, and show how to explore the posterior distributions of the parameters using MCMC. Simulations show an interesting connection between this Bayesian estimation framework and false discovery rate (FDR) control. Finally, two empirical examples illustrate the practical advantages of this Bayesian estimation paradigm.

Abstract

Of the more than 40 genetically defined dominantly inherited hearing loss syndromes, only a few are associated with bilateral vestibulopathy. No genetic mutations have been identified in families with bilateral vestibulopathy and normal hearing.To perform a genome-wide scan for linkage in four families with dominantly inherited bilateral vestibulopathy.Patients in four families reported brief episodes of vertigo followed by imbalance and oscillopsia. Bilateral vestibulopathy was documented with quantitative rotational testing. Most patients with bilateral vestibulopathy also had migraine. A 10 cM genome-wide screen was conducted using 423 microsatellite markers to identify linkage with vestibulopathy.The authors identified a 24 cM region on chromosome 6q suggestive of linkage to vestibulopathy in these four families (maximum lod score of 2.9 at marker D6S1556). A small fifth family with a different phenotype was not linked to this region on chromosome 6q.This is the first report of linkage in families with dominantly inherited vestibulopathy and normal hearing. Genetic heterogeneity is likely with inherited vestibulopathy.

Abstract

We describe a unique family in which several individual are affected with episodes of ataxia that best fit the phenotype of episodic ataxia type 2 (EA2). All of the affected family members had episodes typically lasting for several hours, and none of them had muscle abnormalities including myokymia. Episodic ataxia type 1 (EA1) was not considered initially as a clinical diagnosis for the affected individuals in this family. However, by linkage mapping, sequencing and polymorphism analysis, all affecteds were found to have a novel mutation in KCNA1. Numerous missense mutations have been described previously in KCNA1 that cause EA1. The mutation c.1025G>T replaces a highly conserved serine with isoleucine at position 342 (p.Ser342Ile) in the highly conserved fifth transmembrane domain of the KCNA1. This mutation leads to a distinct clinical phenotype without myokymia broadening the scope of clinical characteristics of EA1 and highlighting the heterogeneity of phenotypic effects from distinct missense mutations.

The use of pedigree, sib-pair and association studies of common diseases for genetic mapping and epidemiologyNATURE GENETICSFreimer, N., Sabatti, C.2004; 36 (10): 1045-1051

Abstract

Efforts to identify gene variants associated with susceptibility to common diseases use three approaches: pedigree and affected sib-pair linkage studies and association studies of population samples. The different aims of these study designs reflect their derivation from biological versus epidemiological traditions. Similar principles regarding determination of the evidence levels required to consider the results statistically significant apply to both linkage and association studies, however. Such determination requires explicit attention to the prior probability of particular findings, as well as appropriate correction for multiple comparisons. For most common diseases, increasing the sample size in a study is a crucial step in achieving statistically significant genetic mapping results. Recent studies suggest that the technology and statistical methodology will soon be available to make well-powered studies feasible using any of these approaches.

Abstract

Cells adjust gene expression profiles in response to environmental and physiological changes through a series of signal transduction pathways. Upon activation or deactivation, the terminal regulators bind to or dissociate from DNA, respectively, and modulate transcriptional activities on particular promoters. Traditionally, individual reporter genes have been used to detect the activity of the transcription factors. This approach works well for simple, non-overlapping transcription pathways. For complex transcriptional networks, more sophisticated tools are required to deconvolute the contribution of each regulator. Here, we demonstrate the utility of network component analysis in determining multiple transcription factor activities based on transcriptome profiles and available connectivity information regarding network connectivity. We used Escherichia coli carbon source transition from glucose to acetate as a model system. Key results from this analysis were either consistent with physiology or verified by using independent measurements.

Abstract

A semiblind deconvolution method of analysis for gene expression data was proposed recently in a series of articles appeared in PNAS. We illustrate here how similar goals can be achieved in a Bayesian framework and how necessary information on the presence of binding sites can be obtained with Vocabulon, an algorithm based on a stochastic dictionary model.

Abstract

High-dimensional data sets generated by high-throughput technologies, such as DNA microarray, are often the outputs of complex networked systems driven by hidden regulatory signals. Traditional statistical methods for computing low-dimensional or hidden representations of these data sets, such as principal component analysis and independent component analysis, ignore the underlying network structures and provide decompositions based purely on a priori statistical constraints on the computed component signals. The resulting decomposition thus provides a phenomenological model for the observed data and does not necessarily contain physically or biologically meaningful signals. Here, we develop a method, called network component analysis, for uncovering hidden regulatory signals from outputs of networked systems, when only a partial knowledge of the underlying network topology is available. The a priori network structure information is first tested for compliance with a set of identifiability criteria. For networks that satisfy the criteria, the signals from the regulatory nodes and their strengths of influence on each output node can be faithfully reconstructed. This method is first validated experimentally by using the absorbance spectra of a network of various hemoglobin species. The method is then applied to microarray data generated from yeast Saccharamyces cerevisiae and the activities of various transcription factors during cell cycle are reconstructed by using recently discovered connectivity information for the underlying transcriptional regulatory networks.

Abstract

The genetic programs underlying neural stem cell (NSC) proliferation and pluripotentiality have only been partially elucidated. We compared the gene expression profile of proliferating neural stem cell cultures (NS) with cultures differentiated for 24 h (DC) to identify functionally coordinated alterations in gene expression associated with neural progenitor proliferation. The majority of differentially expressed genes (65%) were upregulated in NS relative to DC. Microarray analysis of this in vitro system was followed by high throughput screening in situ hybridization to identify genes enriched in the germinal neuroepithelium, so as to distinguish those expressed in neural progenitors from those expressed in more differentiated cells in vivo. NS cultures were characterized by the coordinate upregulation of genes involved in cell cycle progression, DNA synthesis, and metabolism, not simply related to general features of cell proliferation, since many of the genes identified were highly enriched in the CNS ventricular zones and not widely expressed in other proliferating tissues. Components of specific metabolic and signal transduction pathways, and several transcription factors, including Sox3, FoxM1, and PTTG1, were also enriched in neural progenitor cultures. We propose a putative network of gene expression linking cell cycle control to cell fate pathways, providing a framework for further investigations of neural stem cell proliferation and differentiation.

Abstract

We explore the implications of the false discovery rate (FDR) controlling procedure in disease gene mapping. With the aid of simulations, we show how, under models commonly used, the simple step-down procedure introduced by Benjamini and Hochberg controls the FDR for the dependent tests on which linkage and association genome screens are based. This adaptive multiple comparison procedure may offer an important tool for mapping susceptibility genes for complex diseases.

Abstract

The prediction of operons, the smallest unit of transcription in prokaryotes, is the first step towards reconstruction of a regulatory network at the whole genome level. Sequence information, in particular the distance between open reading frames, has been used to predict if adjacent Escherichia coli genes are in an operon. While appreciably successful, these predictions need to be validated and refined experimentally. As a growing number of gene expression array experiments on E.coli became available, we investigated to what extent they could be used to improve and validate these predictions. To this end, we examined a large collection of published microarry data. The correlation between expression ratios of adjacent genes was used in a Bayesian classification scheme to predict whether the genes are in an operon or not. We found that for the genes whose expression levels change significantly across the experiments in the data set, the currently available gene expression data allowed a significant refinement of the sequenced-based predictions. We report these co-expression correlations in an E.coli genomic map. For a significant portion of gene pairs, however, the set of array experiments considered did not contain sufficient information to determine whether they are in the same transcriptional unit. This is not due to unreliability of the array data per se, but to the design of the experiments analyzed. In general, experiments that perturb a large number of genes offer more information for operon prediction than confined perturbations. These results provide a rationale for conducting expression studies comparing conditions that cause global changes in gene expression.

Abstract

We illustrate how homozygosity of haplotypes can be used to measure the level of disequilibrium between two or more markers. An excess of either homozygosity or heterozygosity signals a departure from the gametic phase equilibrium: We describe the specific form of dependence that is associated with high (low) homozygosity and derive various linkage disequilibrium measures. They feature a clear biological interpretation, can be used to construct tests, and are standardized to allow comparison across loci and populations. They are particularly advantageous to measure linkage disequilibrium between highly polymorphic markers.

Abstract

We consider array experiments that compare expression levels of a high number of genes in two cell lines with few repetitions and with no subject effect. We develop a statistical model that illustrates under which assumptions thresholding is optimal in the analysis of such microarray data. The results of our model explain the success of the empirical rule of two-fold change. We illustrate a thresholding procedure that is adaptive to the noise level of the experiment, the amount of genes analyzed, and the amount of genes that truly change expression level. This procedure, in a world of perfect knowledge on noise distribution, would allow reconstruction of a sparse signal, minimizing the false discovery rate. Given the amount of information actually available, the thresholding rule described provides a reasonable estimator for the change in expression of any gene in two compared cell lines.

Abstract

Archival formalin-fixed, paraffin-embedded and ethanol-fixed tissues represent a potentially invaluable resource for gene expression analysis, as they are the most widely available material for studies of human disease. Little data are available evaluating whether RNA obtained from fixed (archival) tissues could produce reliable and reproducible microarray expression data. Here we compare the use of RNA isolated from human archival tissues fixed in ethanol and formalin to frozen tissue in cDNA microarray experiments. Since an additional factor that can limit the utility of archival tissue is the often small quantities available, we also evaluate the use of the tyramide signal amplification method (TSA), which allows the use of small amounts of RNA. Detailed analysis indicates that TSA provides a consistent and reproducible signal amplification method for cDNA microarray analysis, across both arrays and the genes tested. Analysis of this method also highlights the importance of performing non-linear channel normalization and dye switching. Furthermore, archived, fixed specimens can perform well, but not surprisingly, produce more variable results than frozen tissues. Consistent results are more easily obtainable using ethanol-fixed tissues, whereas formalin-fixed tissue does not typically provide a useful substrate for cDNA synthesis and labeling.

Abstract

Compared to mixed populations, population isolates such as Finland show distinct differences in the prevalence of disease mutations. However, little information exists of the differences on the prevalence of different disease alleles in regional populations with different history of multiple bottlenecks. We constructed a DNA-array and monitored the prevalence of 31 rare and common disease mutations underlying 27 clinical phenotypes in a large population-based study sample. Over 64 000 genotypes were assigned in 2151 samples from four geographical areas representing early and late settlement regions of Finland. Each sample was analyzed in duplicate and a total of 142 000 array-derived genotyping calls were made. On average one in three individuals was found to be a carrier of one of the 31 monitored mutations. This should remove fears of the stigmatizing effect of a carrier-screening program monitoring multiple diseases. Regional differences were found in the prevalence of mutations, providing molecular evidence for the deviating population histories of regional subisolates. The mutations introduced early into the population revealed relatively even distribution in different subregions. More recently introduced rare mutations showed local clustering of disease alleles, indicating the persistence of population subisolates and the effect of multiple bottlenecks in molding the population gene pool. Regional differences were observed also for common disease alleles. Such precise information of the carrier frequencies could form the basis for targeted genetic screens in this population. Our approach describes a general paradigm for large-scale carrier-screening programs also in other populations.

Abstract

Haplotype analysis of disease chromosomes can help identify probable historical recombination events and localize disease mutations. Most available analyses use only marginal and pairwise allele frequency information. We have developed a Bayesian framework that utilizes full haplotype information to overcome various complications such as multiple founders, unphased chromosomes, data contamination, and incomplete marker data. A stochastic model is used to describe the dependence structure among several variables characterizing the observed haplotypes, for example, the ancestral haplotypes and their ages, mutation rate, recombination events, and the location of the disease mutation. An efficient Markov chain Monte Carlo algorithm was developed for computing the estimates of the quantities of interest. The method is shown to perform well in both real data sets (cystic fibrosis data and Friedreich ataxia data) and simulated data sets. The program that implements the proposed method, BLADE, as well as the two real datasets, can be obtained from http://www.fas.harvard.edu/~junliu/TechRept/01folder/diseq_prog.tar.gz.

Abstract

To develop diagnostic testing guidelines for the DYT1 GAG deletion in the Ashkenazi Jewish (AJ) and non-Jewish (NJ) primary torsion dystonia (PTD) populations and to determine the range of dystonic features in affected DYT1 deletion carriers.The authors screened 267 individuals with PTD; 170 were clinically ascertained for diagnosis and treatment, 87 were affected family members ascertained for genetic studies, and 10 were clinically and genetically ascertained and included in both groups. We used published primers and PCR amplification across the critical DYT1 region to determine GAG deletion status. Features of dystonia in clinically ascertained (affected) DYT1 GAG deletion carriers and noncarriers were compared to determine a classification scheme that optimized prediction of carriers. The authors assessed the range of clinical features in the genetically ascertained (affected) DYT1 deletion carriers and tested for differences between AJ and NJ patients.The optimal algorithm for classification of clinically ascertained carriers was disease onset before age 24 years in a limb (misclassification, 16.5%; sensitivity, 95%; specificity, 80%). Although application of this classification scheme provided good separation in the AJ group (sensitivity, 96%; specificity, 88%), as well as in the group overall, it was less specific in discriminating NJ carriers from noncarriers (sensitivity, 94%; specificity, 69%). Using age 26 years as the cut-off and any site at onset gave a sensitivity of 100%, but specificity decreased to 54% (63% in AJ and 43% in NJ). Among genetically ascertained carriers, onset up to age 44 years occurred, although the great majority displayed early limb onset. There were no significant differences between AJ and NJ genetically ascertained carriers, except that a higher proportion of NJ carriers had onset in a leg, rather than an arm, and widespread disease.Diagnostic DYT1 testing in conjunction with genetic counseling is recommended for patients with PTD with onset before age 26 years, as this single criterion detected 100% of clinically ascertained carriers, with specificities of 43% to 63%. Testing patients with onset after age 26 years also may be warranted in those having an affected relative with early onset, as the only carriers we observed with onset at age 26 or later were genetically ascertained relatives of individuals whose symptoms started before age 26 years.