Zea mays Assembly and Gene Annotation

About Zea mays

Zea mays (maize) has the highest world-wide production of all grain crops, yielding 875 million tonnes in 2012. Although a food staple in many regions of the world, most is used for animal feed and ethanol fuel. Maize was domesticated from wild teosinte in Central America and its cultivation spread throughout the Americas by Pre-Columbian civilisations. In addition to its economic value, maize is an important model organism for studies in plant genetics, physiology, and development. It has a large genome of of about 2.4 gigabases with a haploid chromosome number of 10 (Schnable et al., 2009; Zhang et al., 2009). Maize is distinguished from other grasses in that its genome arose from an ancient tetraploidy event unique to its lineage.

Assembly

This entirely new assembly of the maize genome (B73 RefGen_v4) is constructed from PacBio Single Molecule Real-Time (SMRT) sequencing at approximately 60-fold coverage and scaffolded with the aid of a high-resolution whole-genome restriction (optical) mapping. The pseudomolecules of maize B73 RefGen_v4 are assembled nearly end-to-end, representing a 52-fold improvement in average contig size relative to the previous reference (B73 RefGen_v3).

Annotation

Nomenclature of Maize RefGen_V4 gene models

The gene models of Maize RefGen_V4 were named following the standard of Maize Genetics Nomenclature. Previous identifiers (e.g. GRMZM) are retained as synonyms and can be searched.

Method

Gene annotation was performed in the laboratory of Doreen Ware (CSHL/USDA). Protein-coding genes were identified using MAKER-P software version 3.1 with the following transcript evidence: 111,151 PacBio Iso-seq long-reads from six tissues, 69,163 full-length cDNAs deposited in Genbank (Alexandrov et al. 2008; Soderlund et al. 2009), 1,574,442 Trinity-assembled transcripts from 94 B73 RNA-seq experiments, and 112,963 transcripts assembled from deep sequencing of a B73 seedling. Additional evidence included annotated proteins from Sorghum bicolor, Oryza sativa, Setaria italica, Brachypodium distachyon, and Arabidopsis thaliana downloaded from Ensembl Plants Release 29 (Oct-2015). Gene calling was assisted by Augustus and FGENESH trained on maize and monocots, respectively. Low-confidence gene calls were filtered on the basis of AED score and other criteria and are viewable as a separate track. In the end, the higher confidence set (called filtered gene set) has 39,324 protein coding genes. Gene annotations from B73 RefGen_v3 were mapped to the new assembly and are also available as a separate track. In addition, 2,532 Long non-coding RNA (lncRNA) genes were mapped and annotated from prior studies (Li et al., 2014; Wang et al., 2016), while 2,290 tRNA genes were identified using tRNAScan-SE, and 154 miRNA genes mapped from miRBase.

Regulation

DNA methylation

Genomewide patterns of DNA methylation for two maize inbred lines, B73 and Mo17, are now displayed on the maize genome browser. Cytosine methylation in symmetric (CG and CHG, where H is A, C, or T) context is associated with DNA replication and histone modification. CG (65%) and CHG (50%) methylation is also highest in transposons. Source: Maize methylome publication by Regulski et al. (2013).

Variation

HapMap2 dataset

A variation set which comprises the maize HapMap2 data (Chia et al., 2012). This dataset incorporates approximately 55 million SNPs and InDels identified in a collection of 103 pre-domesticated and domesticated Zea mays varieties, including a representative from the sister genus, Tripsacum dactyloides (Eastern gamagrass). Each line was sequenced to an average of 4.5-fold coverage using the Illumina GAIIx platform. The reads can be accessed from the SRA, with accession ID: SRA051245. Reads were initially mapped to the B73 RefGen_v3 reference genome using a combination of Bowtie, Novoalign and SOAP, then remapped to the most recent B73 RefGen_v4 reference genome. The variations were scored by taking into account identity-by-descent blocks that are shared among the lines.

This variation data set consists of 719,472 SNPs (excluding 332 SNPs that were removed for mapping to scaffolds) typed in 16,718 maize and teosinte lines, and grouped in 14 overlapping populations according to the germplasm set in the corresponding metadata table.

Summary

Gene counts

Gene/transcipt that contains an open reading frame (ORF).Coding genes

39,591

Non coding genes

6,812

Small non coding genes

4,221

Long non coding genes

2,591

A gene that has homology to known protein-coding genes but contain a frameshift and/or stop codon(s) which disrupts the ORF. Thought to have arisen through duplication followed by loss of function.Pseudogenes

27

A transcript is the operational unit of a gene. In a genomic context, transcripts consist of one or more exons, with adjoining exons being separated by introns. The exons/introns are transcribed and then the introns spliced out. Transcripts may or may not encode a proteinGene transcripts