Phaseolus vulgaris Assembly and Gene Annotation

About Phaseolus vulgaris

Legumes are the third largest family of angiosperms and include many populous species. The majority of legumes contain symbiotic bacteria within nodules in their roots that mediate nitrogen fixation and provide an advantage towards competing plants. Legume seeds are rich in protein content and thus many species have been used for human or animal consumption over the years. Legumes as a whole constitute the second largest class of crops, including peas, soybeans, peanuts, and beans. Common bean (Phaseolus vulgaris.), a major source of protein that complements carbohydrate-rich rice, maize, and cassava, is fundamental for the nutrition of more than 500 million people in developing countries [1].

Assembly

The P. vulgaris Mesoamerican common bean BAT93 genome was assembled using a hybrid sequencing strategy involving 454 single reads and 8, 10, and 20 kb mate pair libraries; 3 and 5 kb SOLiD mate pair libraries; and Sanger bacterial artificial chromosome (BAC)-end and genomic read pairs. Data free of redundancies were used as input for a Newbler assembly, and Illumina reads (45x coverage) were used to correct homopolymer errors and close or reduce gaps within scaffolds. Illumina genotyping-by-sequencing (GBS), data from a set of 60 F5 lines of a BAT93 x Jalo EEP558 advanced intercross (6.7x coverage per line on average), together with 827 public marker sequences, were used for assembly correction and scaffold anchoring. Discontinuous genotype profiles observed in 48 cases were manually corrected by breaking scaffolds at the mis-assembly points. Markers were aligned to the assembly and GBS profiles of these scaffolds were used as seeds to place other scaffolds with this or similar profiles onto chromosomes, followed by genetic map calculation. The final BAT93 genome sequence encompassed 549.6 Mbp, close to previous size estimates, with 81% of the assembly anchored to eleven linkage groups. The assembly included 97% of the conserved core eukaryotic genes, thus reflecting its completeness [1].

Annotation

Transposable elements were identified by combining de novo and homology-based approaches, finding 35% of the P. vulgaris BAT93 genome assembly to be covered by repeats, mostly long terminal repeats. To aid in gene prediction and to obtain a global view of the transcriptome during development, sequencing was done with Illumina 61 RNA samples from 34 different organs and/or developmental stages from healthy plants. In addition, two normalized libraries derived from 162 RNA samples from plants grown under optimal and stress conditions were used for 454 pyrosequencing. Illumina and 454 RNA-Seq reads, as well as public expressed sequence tags (EST) and cDNA sequences, were combined with ab initio predictions to produce an initial gene set. This was filtered to remove genes lacking both similarity to other plant proteins and any evidence of expression, resulting in 30,491 protein coding genes (PCGs), whose 66,634 transcripts encode 53,904 unique proteins. Using protein signatures and phylogeny-based transference of functional annotations it was possible to associate functions with 94% of the bean transcripts, with 76 % of them specifically associated with Gene Ontology (GO) terms [2].

Gene counts

Gene/transcipt that contains an open reading frame (ORF).Coding genes

28,134

Non coding genes

1,190

Small non coding genes

1,185

Long non coding genes

5

A transcript is the operational unit of a gene. In a genomic context, transcripts consist of one or more exons, with adjoining exons being separated by introns. The exons/introns are transcribed and then the introns spliced out. Transcripts may or may not encode a proteinGene transcripts