1
Diversity Data at MaizeGDB Ethalinda KS Cannon 1, Carson M. Andorf 2, Bremen L. Braun 2, Darwin A. Campbell 2, Mary L. Schaeffer 3,4, Cheng-Ting Yeh 5, Patrick S. Schnable 5, Carolyn J Lawrence 1,2 1 Department of Genetics Development and Cell Biology, Iowa State University, Ames, IA 50011, USA 2 USDA-ARS Corn Insects and Crop Genetics Research Unit, Iowa State University, Ames, IA 50011, USA 3 USDA-ARS Plant Genetics Research Unit, University of Missouri, Columbia, MO 65211, USA 4 Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA 5 Department of Agronomy and Center for Plant Genomics Iowa State University, Ames, IA 50011, USA Introduction With the rapidly increasing quantity of diversity data, MaizeGDB is working with collaborators to lay the groundwork for storage, retrieval, visualization, and analysis of these datasets. There is need to handle not only Single Nucleotide Polymorphisms (SNPs) and Copy Number Variations + Presence/Absences variations (CNVs and PAVs), but also complex variation in which entire regions are rearranged along with transcripts that dont appear on the reference genome because they dont exist in B73, the sequenced reference genome. These data are useful for many research endeavors and would support Genome Wide Association Studies (GWAS), population studies, and to support high throughput phenotyping projects. Collaborators We are working with the Schnable lab, the Panzea project, iPlant Collaborative, the Genome Reference Consortium, and NCBI. Survey We created a survey to gather ideas from the community for what tools and views would be most helpful. Please help us by filling out the survey! http://survey.maizegdb.org/diversity/ Sample responses: o In the NAM diversity lines, I would like to find all the SNPs and PAVs present in a particular gene-sized region. I would like to be able to get the data in a table that I could sort- like excel. It would also be nice if I could see the data on a genome browser, but relative to just one genome. o I would like to develop genetic markers based on restriction site polymorphisms created by SNPs or PAV's (presence/absence/variations) o Retrieve all polymorphisms compared to a reference strain o Phenotype a ~500 line population for a trait of interest (disease resistance in my case) and be able to do high- resolution, genome-wide association analysis o Given a putative pedigree relationship between two lines, give me an estimate of whether the pedigree relationship is correct or not Single Nucleotide Variations (SNPs) Types of data We plan to accept: Associated phenotypes. (Also looking at NCBIs ClinVar.) Transcript assemblies and exome contigs. Complex alternative alleles. Assemblies and contigs that do not map to the reference genome. Genetic maps. We plan to provide access to data stored outside MaizeGDB for: Associated germplasm (note that Stock Center accessions will remain at MaizeGDB). SNP collections (iPlant and dbSNP, also individual projects). PAV & CNV collections in dbVar. (Also looking at NCBIs ClinVar.) Expression data, including RNA-seq and sequence- capture (PLEXdb, GEO, SRA, individual project sites, et cetera). Data access and visualization Downloads (whole and partial datasets, for example, by chromosome coordinates). Standard formats GBrowse tracks Analysis We are currently looking at tools like PLINK and TASSEL. Some analyses will be pre-computed and the results made available. We could provide analyses carried out by other labs. We wont provide computationally-heavy services but could potentially direct researchers to groups that might (for example, iPlant). Acknowledgements Major funding from NSF grant #1027527: Functional Structural Diversity among Maize Haplotypes and in-kind contributions from the USDA-ARS. We also thank Ed Buckler, Qi Sun, and Jeff Glaubitz at Panzea (NSF grant #0820619: Genetic Architecture of Maize and Teosinte); Eric Lyons and Nirav Merchant at iPlant (NSF grant #0735191: The iPlant Collaborative: A Cyberinfrastructure-Centered Community for a New Plant Biology). Presence/Absence (PAVs) and Copy Number Variations (CNVs) Complex Alleles We recommend that these be stored at NCBI in the dbSNP database. Each SNP and assay is given a unique ID, can be aligned to a reference assembly, and has associated population frequency data. In addition, if the data are at NCBI, MaizeGDB and other informatics projects all will be enabled to access and work with the data. http://www.ncbi.nlm.nih.gov/projects/SNP/ We recommend that these be stored at NCBI in the dbVar database, which was designed for larger variations than dbSNP. Variations are grouped by study. We plan to work with GenBank and early submittors as dbVar does not yet contain any plant data. http://www.ncbi.nlm.nih.gov/dbvar/ This is a form of variation in which an entire region can be duplicated, deleted, or rearranged. One maize example is R1. MaizeGDB contains 282 variants of this locus. We are talking with the Genome Reference Consortium (http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/) to learn how they are handling these variations for human, mouse, and zebrafish genome assemblies. Missing sequences As we all know, missing data can be due to a number of factors. For missing sequences, some may simply be present in the B73 genome but not yet included in the assembly, but others are entirely absent from B73. Instances of both will need to be recorded and made available, for example as FASTA sequence and in the MaizeGDB Genome Browser with each unaligned sequence as its own backbone. Variations Schematic Representation of the Sequence Relationship between Maize Inbreds B73 and Mo17 at Locus9002 on Chromosome 1L (Bin 1.08; Markers bz2, An1, and umc1446). The R1 locus Brunner S et al. Plant Cell 2005;17:343-360