This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Abiotic and biotic stresses lead to massive reprogramming of different life processes and are the major limiting factors hampering crop productivity. Omics-based research platforms allow for a holistic and comprehensive survey on crop stress responses and hence may bring forth better crop improvement strategies. Since high-throughput approaches generate considerable amounts of data, bioinformatics tools will play an essential role in storing, retrieving, sharing, processing, and analyzing them. Genomic and functional genomic studies in crops still lag far behind similar studies in humans and other animals. In this review, we summarize some useful genomics and bioinformatics resources available to crop scientists. In addition, we also discuss the major challenges and advancements in the “-omics” studies, with an emphasis on their possible impacts on crop stress research and crop improvement.

bioinformaticscropsgenomicsstresses1. Introduction

According to the Food and Agricultural Organization of the United Nations (FAO), food production must be increased by 70% in the next 40 years to meet the increasing global demand [1]. Abiotic and biotic stresses are major limiting factors hampering crop productivity. Therefore, understanding the stress responses of crops using genomic information is important in bringing forth more effective crop improvement strategies.

The publishing of the Arabidopsis thaliana genome in 2000 is a cornerstone of the plant genomics era [2]. Taking advantage of the high-throughput data acquisition platforms of the next generation sequencing technology, additional crop genomes have been subsequently decoded. So far, the draft genomes of more than 40 plants have been completed, including those processed in the 1000 Plant and Animal Project [3]. Other “-omics” technologies such as transcriptomics, proteomics, metabolomics, and phenomics (Figure 1) have also undergone rapid development in recent years. Together, there is a large volume of accumulated data, and hence data management and data mining have become a bottleneck for “-omics” researches.

To convert the great amount of data into manageable information, it is essential to establish standard formats and methods for storing, retrieving, and sharing data. Algorithms based on mathematical and statistical models are needed to handle biological data. This review aims to provide a systematic summary of the currently available databases and bioinformatics resources and highlight some challenges and advancements in the study of genomics and other “-omics”, with emphasis on their implications on crop stress research.

2. General Bioinformatics Resources2.1. Databases

Various databases have been developed to accommodate the comprehensive -omics data and some of them also provide onsite analytical tools (Table 1). The three commonly used sequence databases are GenBank in USA, European Nucleotide Archive (ENA) in Europe, and DNA Data Bank of Japan (DDBJ). They are collaboratively accommodated by the International Nucleotide Sequence Databases (INSD), and the deposited data are frequently synchronized. There are also repositories designated specifically for plants, such as Phytozome that holds the genomic information of more than 40 plant species, including all the sequenced crops. Besides basic genomic information, databases such as Legume Information System (LIS) facilitate synteny analyses and comparative genomic studies between closely related crop plants.

Online resources for individual crops, together with massive datasets, have been developed (Table 2) where systematically integrated information including: genetic resources (genetic maps, molecular markers, and quantitative trait loci (QTL)); genomic resources (DNA sequences, gene models, and regulatory elements); gene expression data (ESTs, cDNA sequences, and transcriptomes); and functional units (proteomic and metabolomic data), is provided. Crops of higher economic values are usually accompanied with a more comprehensive database. The genomic sequences of some economically less important crops, such as foxtail millet, sorghum, and barley, have been released recently [18–20] and their corresponding integrated databases are still under development.

Some data repositories also provide information related to abiotic and biotic stress responses. For example, in MaizeGDB, there are well documented records for tropical maize exhibiting tolerance to drought stress [21]. In SoyBase, genetic markers associated with salt tolerance, drought tolerance, and cyst nematode resistance are incorporated with genomic and expression information. Databases for individual crops could also facilitate the unveiling of the genetic basis of specific traits. For example, the tomato genome sequence helped identify the R-genes which were then incorporated in the Plant Resistance Genes database [22].

2.2. Biological Ontologies Related to Crop Stress Research

The standardization of ontology is important for the structuring of huge datasets, interconnection between databases, merging resources, and curation of information. Each ontology term has its own name, identifier/ID/accession number and definition. The identifier/ID/accession number is usually made up of a prefix and a number. For example, the Gene Ontology term “lipid binding” has the accession number GO:0008289. The definition of “lipid binding” is a gene product that can interact selectively and non-covalently with a lipid.

The Gene Ontology (GO) project provides a well-established and controlled vocabulary database for describing the function of a gene and its gene product. The ontology covers three aspects, including cellular component, molecular function, and biological process. GO is used in genome annotation to provide information on gene products. An evidence code (by Evidence Code ontology) is used to describe the evidence that links the GO annotation with the gene product. The Evidence Ontology (EO) suggests whether an annotation has been made manually by a curator or by automated electronic annotation. For example, EXP refers to: “inferred from experiment”; IBA refers to: “inferred from biological aspect of ancestor”; and IEA refers to: “inferred from electronic annotation”. All this information can be found in the Gene Ontology website [27].

Plant Trait Ontology (TO) is a controlled vocabulary for describing the plant trait and phenotype. In addition to anatomical and morphological traits, TO also includes a subset of controlled vocabularies for abiotic and biotic stress traits. For example, the yellow dwarf disease resistance (TO:0000292) is the child term of resistance to disease by mycophasma-like organism (TO:0000013) under the lineage of stress trait (TO:0000164).

There are many other biological ontology projects for different research fields. Ontologies listed in Table 3 contain the information related to crop stress responses.

3. Recent Advances and Challenges in Crop Genomics3.1. Polyploidy as a Major Challenge in Crop Genome Assembly

Polyploidy is a major hindrance in crop genome assembly. One of the ways to tackle the highly polyploid genomes is to make references to the closely related, putative progenitor diploid genomes if they are available. The Catalogue of Life [32] and the Integrated Taxonomic Information System [33] may help to identify such related species. For example, the fiber-producing cotton (Gossypium hirsutum) is tetraploid, comprising an A-genome and a D-genome. To assist the assembly of the tetraploid genome, the diploid D-genome of G. raimondii was first sequenced and assembled [34]. A second example is strawberry (Fragaria × ananassa), with an estimated genome size of about 600 Mb. Although this is much smaller than other crop genomes, it is an octaploid (AAA′A′BBB′B′) [35]. Therefore, the genome sequence of the woodland strawberry (Fragaria vesca), a potential progenitor of Fragaria × ananassa, was completed in 2012 to provide the first diploid model for the genomes of F. spp. [35,36]. Wheat is another example of polyploid crop genomes. The hexaploid bread wheat (Triticum aestivum) contains the A, B and D genomes, which probably originated from Triticum urartu (A genome), Aegilops tauschii (D genome), and an unknown species related to Aegilops speltoides (B genome). The genomic sequence information of T. aestivum, T. monococcum (a community standard line related to the A-genome donor), and Ae. Tauschii, as well as the cDNA sequence information of T. aestivum and Ae. Speltodies, were obtained [37]. With reference to the respective diploid genome information, over 90% of the wheat genes were successfully assembled into the A, B, or D genome with over 70% precision [37]. The drafted de novo genomes of T. urartu and Ae. tauschii were recently published, representing 94.3% and 97.0% of the predicted genome sizes respectively [38,39]. Although the lack of a good reference for the B genome is still an obstacle in building the T. aestivum genome, these pieces of work have built a good framework for the further whole genome assembly of bread wheat, and established a model for the study of other polyploid genomes.

3.2. Reduced Genetic Diversity of Modern Crops

Modern crops originated from a small number of plants. Bottleneck effects during domestication and prolonged human selection together have significantly reduced the genetic diversity of modern crops. Such a reduction in genetic diversity has been confirmed by several genomic studies (Supplementary Table S1). For example, whole-genome resequencing of 14 cultivated and 17 wild soybean genomes revealed that the wild soybeans have higher numbers of SNPs and genetic diversity compared to those of the cultivated ones [40]. The domesticated rice cultivars (Oryza sativa indica and Oryza sativa japonica) also show a lower genetic diversity than their wild relatives (O. rufipogon and O. nivara) in a study on 50 accessions of cultivated and wild rice [41]. More interestingly, even though both indica rice and japonica rice are cultivated, the japonica rice shows significantly lower genetic diversity than the indica rice, suggesting that the japonica rice has suffered from a stronger bottleneck effect under domestication [41]. On the other hand, although maize landraces and improved lines have retained a higher nucleotide diversity from their wild progenitor, as compared to other self-fertilizing crop species, a weak bottleneck effect can still be observed [42]. Reduced genomic diversity of major staple crops limits their adaptability to the changing environment and reduces the room for crop improvement. Therefore, crop improvement programs should turn their focus to the genetically compatible wild species, which have higher biodiversity and can serve as natural genetic reservoirs.

Sequence differences and structural variations in genomes are usually identified by comparing the genomes of wild species to their related landraces and modern cultivars, and also to other model plants. These differences can, on the one hand, provide information about genome evolution, and, on the other hand, serve as molecular markers for genetic mapping. Sequence differences and structural variations that affect gene structure, gene expression, and gene copy number are major determinants shaping the diversity among different varieties of the same species. For instance, wild soybeans and some rice accessions possess some present/absent variations or unmapped contigs that contain bona fide genes annotated to be involved in abiotic and biotic stresses [40,41,43,44]. One specific example relating to biotic stresses is the enrichment and over-representation of LRR (leucine-rich repeat) and NB-ARC (nucleotide-binding adaptor shared by APAF-1, certain R gene products and CED-4) domain-containing genes in some crop genomes [19,45]. In plant genomes, disease resistance (R) genes are responsible for defense responses [46]. LRR and NB-ARC are two important domains found on the R proteins [46]. The LRR domain-containing proteins play important roles in pathogen-host interactions and the activation of defense responses [47,48]. On the other hand, the NB-ARC domain is responsible for the mulitmerization and autoactivation of the R proteins upon stimulus [49]. The LRR and NB-ARC-containing genes exhibit higher ratios of nonsynonymous-to-synonymous SNPs than the genome average in crops such as soybean [40], rice [41,50], and sorghum [51]. In maize, 101 out of 3490 large-effect SNPs detected are located on 49 LRR domain-containing genes [44]. LRR and NB-ARC domain-containing genes are important components in the plant defense response system [46,49,52] while the high nonsynonymous-to-synonymous SNP ratio of LRR or NB-ARC domain-containing genes suggests a dynamic evolution of these genes to combat pathogens.

In addition to disease resistance genes, some transcription factors are found to be over-retained after the whole-genome duplication in Musa α/β (banana) [53]. Some of these transcription factors such as Myb, AP2/ERF, and WRKY are known to be important regulators in abiotic stress responses [54]. On the other hand, compared to rice, sorghum, and maize, there are more genes encoding for cytochrome P450, CCAT-binding factor transcription factors, late-embryogenesis-abundant proteins, and osmoprotectant biosynthesis proteins in the Ae. tauschii genome (progenitor B genome of wheat) [38]. These genes are important for the adaptation to cold and physiological drought. Moreover, a significantly higher number of transmembrane ATPase subunits, which are probably involved in Na+ exclusion and mineral uptake, have been detected in Ae. tauschii than in wheat [37,38]. The extra genes in Ae. tauschii may be good candidates for wheat improvement.

3.4. Advances in Ultra-High-Density Genetic Mapping Using SNPs

Genetic mapping using genetic populations is one classical strategy to identify genes related to stress responses. Members in the mapping population can either be related (e.g., QTL mapping using bi-parental populations) or unrelated (e.g., genome-wide association study (GWAS) using germplasm collections) (for population structure, data characteristics and methods, see reviews [55,56]). There are some successful cases in identifying stress tolerance causal genes through mapping [57–59]. For example, a salt tolerance-conferring sodium transporter from rice was identified through QTL mapping [58]. The SKC1 locus corresponding to shoot K+ content was mapped with a BC2F2 population generated from a cross between a salt-tolerant indica variety and a susceptible japonica variety [58]. The SKC1 locus was further confined to a 7.4-kb stretch by the BC3F4 progeny testing of fixed recombinant plants. The locus contains only a single open reading frame, which encodes for a HKT-type transporter. SKC1 near-isogenic lines accumulated less Na+ under salt treatment compared to the susceptible parent. Voltage-clamp also supports the notion that the SKC1 protein functions as a Na+-selective transporter that probably regulates K+/Na+ homeostasis under salt stress [58].

Classical molecular markers for mapping such as AFLP, RFLP, and SSR markers are sparsely distributed in the genome, and hence limit the mapping resolution and pose difficulties in pinpointing the phenotype-causal genes. With the availability of genomic sequence data, SNP markers become more accessible for use in mapping, to help achieve a much better resolution. However, conventional PCR-based methods are laborious and time-consuming while the resolution of array-based methods is limited by the number of probes on the array.

High-resolution genotyping by whole-genome resequencing has been established [60,61], making the ultra-high-density genetic mapping more attainable. In principle, this method can achieve the highest resolution, provided that there are enough resources to capture all the SNPs in a population. In reality, polymorphic SNPs are usually captured by low-coverage sequencing (~1X for unrelated populations [58] and <0.1X for recombinant populations [60,62,63]).

In a QTL study of recombinant inbred populations originating from indica and japonica rice, SNPs between the parental reference genomes were first identified using DiffSeq in the EMBOSS package and cleaned by SSAHASNP in the ssaha2 package. Low-depth sequencing reads of recombinant inbred lines (RILs) were mapped to the parents’ pseudomolecules by using the SSAHA2 software [64] to determine the genotype of each RIL. SNPs were analyzed by a sliding window approach to determine the recombinant break points within the genome of every single line in the population to form a bin map [60]. This sliding window strategy can accommodate the high error rate of next generation sequencing and allow missing data resulting from low-coverage sequencing [60]. Each “bin” will serve as a “marker” in the subsequent linkage map construction using MAPMAKER/EXP and in QTL mapping using QTL Cartographer. In this study, using 150 rice RILs, the sequencing-based method increased the resolution by 35-fold and greatly reduced the time needed for genotyping, compared to the map generated from 287 PCR-based markers [57]. The power of this method was further illustrated in a study using 210 rice RILs to map the GS3 and GW5/qSW5 loci related to the grain length and grain width, respectively [62].

Since missing genotypes in low-depth sequencing would reduce the effectiveness of GWAS, after SNPs have been identified by mapping the sequencing reads, the k-nearest neighbor method (KNN) that uses in-house algorithms for data-imputation can be adopted in addition to increasing the sequencing depth, in order to reduce the missing genotypes [61]. GWAS has been conducted in mapping 14 agronomic traits, including drought tolerance, using 373 indica rice lines. One to seven loci have been mapped for each trait, and some of them overlap with the previously known loci/genes identified through bi-parental QTL mapping or mutant studies [61]. With the great reduction of sequencing cost (<US$0.1 per raw megabase in 2012) [65], we anticipate that mapping by sequencing will become a popular method to obtain high resolution maps for stress-related loci/genes.

3.5. Genomic Selections

Genomic selection (GS) is introduced to evaluate the overall effects of all contributing loci genome-wide [66]. During the process of GS, a training population will be used for computational model training to obtain the genomic estimated breeding values (GEBVs) [67]. Complex traits such as drought tolerance are usually determined by multiple small-effect QTLs. GEBV associates markers and QTLs by regarding all the markers as variables contributing to the trait and the effect of each marker allele towards the complex QTLs is quantified (it can be zero). GEBV determines the sum of the marker effects and thus indicates the breeding value of an individual; favorable individuals with high GEBVs from breeding populations will be selected for field application. Genotypic and phenotypic information of the breeding population can be used to further improve the computational model to form a training-breeding cycle [66]. Unlike GWAS and QTL studies, which are designed to reduce the breeding time by selecting plants with desired molecular markers at early growth stages instead of evaluating the actual phenotypes at a later stage, GEBVs serve only as selection criteria but do not lead to target markers or causal genes.

As high-throughput genotyping and phenotyping have accelerated GS studies by increasing marker density and selection capacity, one of the major challenges of GS is selection accuracy. Evaluations of GS accuracy have been performed in maize [68], wheat [69,70], barley [71], and cassava [72]. Several statistical models for GEBV calculations, including best linear unbiased prediction (BLUP) [67], Bayesian shrinkage regression (BayesA, BayesB, etc.) [73], and mixed models have been employed. There is no agreement on which model is the most efficient, because many factors such as population size and genetic background may affect statistical power [71]. It is believed that GS is a valuable approach for plant breeding [74], however, it will take some time for this concept to develop into a practical tool [75]. A GS-based breeding scheme has already been proposed and is considered to be an important tool for developing durable stem rust-resistant wheat [76].

3.6. Identification of Stress-Related Gene Families

When properly annotated genomes are available, the genome-wide identification of all members of a gene family will become feasible. Since genome duplication (polyploidy or paleopolyploidy) and single gene duplication are common in crops [77], genes usually exist in multiple copies and/or in gene families. Identifying all members of a gene family may give a more comprehensive view on the possible functions of a group of evolutionarily related genes. Bioinformatics tools such as Fgenesh [78], GAZE [79], and JIGSAW [80] have been adopted for searching gene families in crops.

Two typical ways to identify members of gene families from within a genome are keyword search and pattern/homology search. Keyword search usually requires precise keywords including gene names and controlled vocabularies. The most commonly used controlled vocabularies are Gene Ontology, as mentioned in section 2.2, and the functional classification by Pfam, InterPro and KEGG [81–83].

A genome-wide pattern search usually begins with searching sequence databases using programs like BLASTP or TBLASTN [84]. Databases can either be online resources (Table 1) or in-house databases. The occurrence of the desired functional domains in the potential sequences can then be verified using the Pfam protein families database [81], SMART database [85], or HMMER [86]. When the BLAST results are associated with unannotated sequences, these will require further analyses to determine the putative gene structures. One example of applying the above strategy to identify stress-related genes is the analysis of AP2/EREBPs in the rice genome [87]. “AP2/EREBP” was used as the keyword in searching databases, including DRTF, MSU NCBI, and KOMBE. Any non-redundant sequences obtained were then used as query terms in the TBLAST and BLASTP searches of the MSU and NCBI databases. Four genes with an incomplete AP2 domain were excluded after Pfam and SMART analyses because of their very small AP2/ERF domain. A total of 163 genes were identified using this method, in contrast to the 139 genes as suggested previously [87]. Expression studies revealed that a number of the members are responsive to abiotic or biotic stresses. A few of them can even be induced by multiple stresses, suggesting their possible involvement in stress responses [87].

Supplementary Table S2 summarizes the strategies and tools used in recent literature on genome-wide analyses of gene families related to stress responses in major crops.

4. Functional Genomics4.1. Transcriptome

There are two major technologies for obtaining the overall transcription map of specific plant tissues: hybridization-based microarray technology [88] and next generation RNA sequencing technology (RNA-seq) [89]. RNA-seq technology, in conjunction with efficient bioinformatics tools, is now more widely used to support predicted gene models, extract differentially expressed genes, and find novel transcripts in de novo assemblies. Public repositories such as ArrayExpress [90] are designed for the storage of expression data. Standard data formats including Minimum Information about Microarray Experiments (MIAME) or Minimum Information about Sequencing Experiments (MINSEQE) are unified to facilitate transcriptome data submission/downloading. Bioinformatics tools dealing with transcriptome alignment, splicing event prediction, and de novo assembly are also available (Table 4).

Crops such as maize [100] and soybean [101] have their own transcriptome atlases, compiled from sub-transcriptomes from multiple tissues and different developmental stages. For the transcriptome atlas of soybean, plant ontology (PO) was used to describe the developmental stage of each experimental tissue, providing a common ground for readers and users to discuss and perform further analyses. The cDNA short reads generated by Illumina Genome Analyzer were aligned to the soybean reference genome sequence assembly using GSNAP, released in 2005. The digital expression counts were determined using the R programming language and normalized using a variation of RPKM methods [101]. The global inventory of expressed transcripts of crops under stress is dynamic, both temporally and spatially. Time series sampling is a typical experimental design to trace the trajectory of such differentially expressed transcripts of crops under stress conditions. A typical example was the study of the soybean transcriptome under alkaline stress. Soybean plants were treated with NaHCO3 and transcriptomes were analyzed using microarray [102]. GO terms were successfully assigned to the 1380 significantly changed probe sets that are related to metabolism, signal transduction, energy, transcription, secondary metabolism, transporter, as well as disease and defense. A time series study revealed the interplay of signal transduction and metabolism during the progression of the treatment. MapMan tools were used to visualize these changes [102]. Other time series studies include the studies of rice root under low potassium [103], cassava under cold stress [104], and soybean subjected to Pseudomonas syringae infection [105]. The other widely reported experimental design is the comparative transcriptome study performed among crop accessions with different degrees of stress tolerance, such as the study of soybean accessions exhibiting differential tolerance toward low potassium [106], rice cultivars with contrasting abilities to withstand drought [107] and chilling [108], wheat with differential drought tolerance [109], and Medicago [110] and foxtail millet [111] cultivars with differential salt tolerance.

Another strategy to associate transcript abundance to genomic variations is the expression QTL (eQTL), which use differentially expressed transcripts as the quantitative traits [112]. The eQTL maps of maize root [113] and rice shoots [114] have identified thousands of cis and trans regulation factors by population transcriptome screening. The eQTLs co-localizing with traditional QTL regions could give supportive evidence explaining the genetic basis of the targeted phenotypic characters. One successful example is the eQTL study of the partial resistance toward Puccinia hordei in barley [115], in which some eQTLs were reported to co-localize with previously known rust resistance QTL regions.

4.2. Proteome

Due to the alternative splicing of RNA transcripts and post-translational modifications of the proteins themselves, the proteome within a cell can be much more complicated than the corresponding genome. The gel-based proteomics technology will soon be obsolete due to its limited sensitivity and semi-quantitative nature [116]. The rise of the next generation proteomics systems such as Orbitrap and QStar, together with the application of isotopic tag-based quantitative proteomics (ICATs [117], SILAC [118], isobaric tag-based quantitative proteomics (ITRAQ [119]), and label-free quantitative proteomics (MaxQuant [120], Serac [121], SIEVE (Thermo Scientific, San Jose CA, USA)) have expedited the development of high-throughput proteomic studies. Nevertheless, the pace of adopting these platforms in plant stress studies is far behind studies in humans.

Despite the advancement in the proteomics platforms, the application of de novo peptide sequencing is still limited. Protein identifications still largely rely on database searches in which experimental peptide mass spectra are compared with theoretical peptide mass spectra generated from existing sequence databases. Some commonly used databases and useful algorithms are summarized in Table 5. Since the genomes of many crops have not been completely sequenced, and some others are still unknown, proteins of species without a genome database are frequently identified by referring to cross-species databases. In these cases, it is not uncommon that molecular weights and isoelectric points (pI) of the identified proteins may deviate from the actual spot position on the 2D gel, despite the high protein scores.

Comprehensive reviews summarizing plant proteomic studies from 2006 to 2008 are available [122,123]. We have also summarized the plant proteomic studies in 2012 (Supplementary Table S3). Recently, plant proteomic investigations have been subdivided into several areas, including subcellular proteomics and proteomics-related post-translational modifications. For example, 21 differentially expressed proteins were identified from salt-treated wheat chloroplasts [124], and 13 and 11 differentially expressed microsomal proteins, respectively, were identified from two distinct cadmium-accumulating soybean cultivars [125].

Stress-induced posttranslational modifications of proteins are common. They are either the results of deleterious damage from the stress, or beneficial modifications to regulate the functions of the proteins in order to cope with the stress. To study posttranslational modifications of proteins, special techniques within proteomics are used. Redox proteomics requires special labeling methods, including the reduction and subsequent labeling of the oxidized thiol groups with 5-iodoacetamidofluorescein (IAF) [126]. Twenty-two highly oxidized proteins involved in a wide range of biological processes were identified in ozone-treated rice using this method [127]. Phosphoproteome [128], glycoproteome, and secretome [129–132] are sub-categories of proteomics that require special staining and enrichment techniques. Post-translational modifications involved in gene expression regulations will be discussed in the Epigenomics section below.

4.3. Interactome

Protein-protein interactions determine the contextual functions of a protein and hence play a crucial role in regulation and signal transduction [133]. There are several commonly used experimental systems to identify protein-protein interactions, including: (1) yeast two hybrid (Y2H) (reviewed in [134]); (2) biomolecular fluorescence complementation (BiFC) (reviewed in [134]); (3) affinity pull-down coupled with mass spectrometry (AP-MS) (reviewed in [134]); (4) blue native PAGE [135]; and (5) structural analysis of protein crystals [136,137]. In addition, literature curation involving tedious literature searches can be used to supplement the experimental efforts [138] and in silico prediction can be done by searching for orthologous pairs which interact in other systems, to identify possible interologues [134,139]. Multiple systems are generally adopted to authenticate the interactions.

The concept of the plant interactome was initiated years ago, and was based mainly on the information collected through literature curation [140]. Subsequently, an experimentally constructed interactome map of A. thaliana was established via intensive screening, recording a total of 6200 high-confidence interactions among 2700 proteins through the screening of proteins encoded by 8000 open reading frames in the Arabidopsis genome [141]. It is estimated that this screening only captured around 2% of the binary protein-protein interactome in A. thaliana [141]. Using the in silico interolog prediction method, more than 37,000 interactions among 4567 rice proteins were predicted, 168 of which have been experimentally confirmed [139]. In this piece of work, the INPARANOID 3.0 program was used to predict high-confidence protein orthologues in 12 species including rice. With the assumption that protein-protein interactions are retained in evolutionarily conserved orthologous proteins, rice protein-protein interactions were compiled using the predicted orthologous proteins and the known interactions in interactome databases [139]. Only a few studies directly related to crop stress interactomes have been published (Table 6). In the search for rice stress-related interactomes, 4 stress proteins related to disease (XA21 and NH1) and flooding (SUB1A and SUB1C) were used as baits for the initial interactome screens by Y2H [142]. Preys identified from the initial screens were then used as baits for subsequent screens. Together with the information from literature curation, an interactome network consisting of 100 proteins were constructed. The interactomes of the two kinds of stresses were linked by proteins such as SNRK1A, which has been shown to be related to ABA, a positive regulator of abiotic stress responses and a negative regulator of biotic stress responses [142].

Online resources such as PRIN [143] can help to predict rice interactomes, while BioGRID [144], DIP [145], PlaPID [146], and InAct [147] can be queried for some pre-determined interactomes in certain plant species. Recent large-scale stress interactome studies in crop plants are shown in Table 6.

4.4. Epigenome

In addition to the genetic information encoded by DNA, epigenetic modifications of DNA and histones provide another dimension of regulation to influence gene expressions. Chromatin-associated proteins, including DNA methylase, histones, and histone-modifying enzymes, are cataloged in the ChromDB [151]. Technological platforms for epigenomic research can be considered as an extension of genomic and proteomic studies with modifications in analysis protocols.

For example, cytosine DNA methylation, one of the epigenetic modifications, plays an important role in gene silencing and genomic imprinting [152,153]. The transcriptional levels of endogenous genes are highly correlated with the methylation status within their promoter or transcribed regions [154,155]. One way to detect DNA methylation is to capture and enrich the methylated DNA fragments by immunoprecipitation [156]. Bisulfite treatment is another way to distinguish between methylated and unmethylated DNA. The bisulfite treatment converts unmethylated (but not methylated) cytosines to uracils [157]. Both immunoprecipitation-enriched and bisulfite-treated DNA can be analyzed by microarray- or sequencing-based methods to the single-base level of resolution [157–159]. A number of bioinformatics tools are designed to handle the bisulfite sequencing data (Table 7).

Both biotic and abiotic stresses will lead to massive changes in the DNA methylation status [160–162]. Some stress-induced DNA methylations can be inherited by the next generation. The mechanism for trans-generation DNA methylation may be partially mediated by small RNAs [163]. This trans-generation DNA methylation has been observed in some crops in response to stress [164,165], as a way of pre-acquiring immunity toward the upcoming stresses via designed parental priming [164].

Histone proteins are responsible for the packing of DNA. The epigenetic modifications of core histones affecting the tightness of DNA packing are called histone codes that can relay important information to affect gene expressions [177]. The Histone Sequence Database provides a comprehensive collection of histone sequences and structural information [178].

The addition of acetyl groups to histones neutralizes the positive charges and hence loosens the condensed DNA, leading to transcriptional activation [179], while the methylation of histones results in gene deactivation or repression [180]. The phosphorylation of histones causes the relaxation of chromatin and modulates histone acetylation and methylation [181].

Individual types of histone modifications on specific amino acid residues can be detected using specific antibodies or various mass spectrometries while genome-wide histone-DNA associations can be captured by chromatin immunoprecipitation (ChIP) and subsequently analyzed using either microarray (ChIP-Chip) [182] or sequencing (ChIP-seq) [183].

Some histone-modifying enzymes are induced in crops under stress. For example, a trithorax-like H3K4 methyltransferase was found to be induced by drought in drought-tolerant barley cultivars [184] while a histone deacetylase was found to be induced by compatible infections and repressed by incompatible infections [185]. The methylation statuses of four transcription factors were affected by salt stress. The expression of three of these transcription factors were also found to be correlated with their H3 methylation and acetylation statuses [186]. A genome-wide study in rice identified 4837 genes that harbor differential H3K4me3 modification under drought stress, in which the expression of 609 genes were significantly correlated with the H3K4me3 modification [187].

4.5. Phenome

Every observable biological characteristic beyond the genotype can be regarded as the phenotype. Phenotypes can be observed at the molecular, cellular, organismal, and population levels. Phenotypes also vary throughout the organism’s lifecycle, spanning different growth stages, and during different periods of stress. The environment can also exert significant influences on the phenotype. The total sum of phenotypes of an organism or a population constitutes the phenome.

As mentioned in section 2.2, to make phenotypic data in public databases more searchable and accessible to users of bioinformatics tools, ontologies are used to describe the setup of the experiment and the phenotypic data. For example, one may study salt tolerance (TO:0006001) at the whole-plant flowering stage (PO:0007016) and the days to flower (TO:0000344) of Oryza sativa (GR_tax:013681), in a greenhouse study (EO:0007248) under a sodium chloride regimen (EO:0007048). These ontologies provide a common language to describe an experiment and render it understandable by both researchers and computational algorithms. For instance, some people may record certain phenotypes during the flowering stage. However, what does it mean by “flowering stage”? Some people refer to “flowering stage” as the time when the first flower opens. Others may refer to it as having half of the individual plants with flowers opened. In this case, the flowering time is well defined in plant ontology. PO:0007026, PO:0007034, PO:0007053 and PO:0007052 refer to the stage at which the first flower, 1/4 of the flowers, 1/2 of the flowers, and 3/4 of the flowers, open, respectively. PO:0007024 marks the end of the flowering stage. The application of these ontologies can thus reduce the discrepancies in annotating the phenotypes and treatment conditions.

High-quality phenotypic information is essential for mapping, association studies, gene identifications, gene functional studies and genomic selections. To design experiments to collect phenotypic information, some critical parameters have to be considered, such as the sample/population sizes, experimental conditions, phenotypes to be assessed, and the data acquisition methods.

The size of a population can vary from a few plants for functional studies, several hundred lines for mapping and GS, to as many as a thousand germplasms for GWAS. Some public collections of germplasms or populations are available for public requests. The United States Department of Agriculture National Plant Germplasm System has a collection of over 500,000 germplasm accessions from 10,000 plant species including rice, soybean, tomato and many other staple crops. Table 8 summarized some publicly available mutant and germplasm collections, some of which also provide phenotypic descriptions and photos of the mutant.

Since the phenome is the overall outcome of the interactions between the genotype and the environment, whether the phenotypic data are collected in a controlled environment or not can greatly affect the final interpretation of results. Field experiments can better mimic the actual conditions of crop production, but the consistency of the phenotype greatly depends on the location of the field, the soil composition, weather conditions, season, and so on. The interpretation of results can thus be complicated. For example, a change in the transpiration rate in some of the plants may not solely be the result of the stress treatment, but also the result of localized changes in light intensity and/or temperature in the field [188]. In general, a larger number of replications are required to compensate for the effects due to environmental variations. A controlled environment such as that in a greenhouse or a growth chamber can minimize the effects of environmental fluctuations and hence will emphasize the contribution of the genotype. However, data from controlled experiments are usually limited in scale and may overlook the fast-changing environment in the real production field.

Choosing the appropriate phenotypes to be assessed is also important. For example, stomatal conductance and pathogen titer are good indicators of osmotic stress tolerance and disease resistance, respectively. However, they are not quite applicable in large-scale experiments due to the limitation of the machine and the laborious procedures. On the other hand, fresh weight and biomass can truly reflect the productivity of the crops, but taking these measurements is destructive to the plant. For morphological and physiological phenotyping of crops under stress, a conversion of stress symptoms to parameters that can be captured and digitized is needed for high-throughput automation. Commonly used methods include: 2D or 3D visible light imaging [189,190], infrared thermography [188], near-infrared imaging, spectral reflectance [191], fluorescence analysis [191,192], stable isotope analysis [193] and X-ray imaging [194]. For example, a study of wheat salt stress response suggested that the shoot area calculated from 3 digital images (2 side and 1 top images) showed a strong positive correlation with manually measured leaf area and shoot fresh weight which commonly serve as the indicators of salt tolerance in crops [195]. As a non-destructive method, the imaging system could continuously monitor the growth of the plant and distinguish its bi-phasic (osmotic stress phase and ionic stress phase) growth under salinity stress [195]. Another example is related to osmotic stresses (salinity and drought) that reduce stomatal conductance. Since the reduction in stomatal conductance will halt the cooling effect of transpiration, infrared thermal imaging can be used to monitor the degree of salinity and drought stress [188,196]. In the case of lesions on leaf surfaces caused by plant diseases, instead of measuring the lesion area on each leaf, determining the reduction in chlorophyll fluorescence is a possible alternative [197].

In addition to the physiological phenotypes, metabolite profiles in crops are also altered by both biotic and abiotic stresses [198,199]. Deleterious metabolites such as reactive oxygen species might be generated through the disruption of normal cellular processes while beneficial metabolites such as signaling molecules and osmoprotectants may be generated to alleviate the stress [200,201].

There are two major research strategies in metabolomic studies: metabolic fingerprinting and metabolic profiling. Metabolic fingerprinting uses the mass-to-charge ratio of mass spectrometry, the peak height and/or retention time of chromatography and the strength of NMR signal as the metabolomic signature in specific samples, such that the identity of each metabolite is not necessary [231]. This helps to classify different samples into categories. For example, metabolic fingerprints have been made to differentiate between disease-resistant and susceptible varieties [217,227,228] or between salt-tolerant and sensitive varieties [232]. In one study, fourier transform infrared (FT-IR) spectroscopy was used for the metabolic fingerprinting of salt-treated tomatoes [233]. A total of 882 FT-IR spectra variables were collected between the wave number 4000 to 600 cm−1 for each sample [233]. Through discriminant function analysis (DFA) of the spectra variables, without knowing the identity and the quantity of each metabolite, salt-treated and control samples can be discriminated. Furthermore, key regions within the spectrum distinguishing the treated from the untreated samples were identified through genetic algorithms, and the major components were found to be amino radicals and nitrile-containing compounds [233]. Thus, disease resistance and stress tolerance of novel crop varieties can be assessed by comparing their metabolic fingerprints with those of well characterized varieties, facilitating the screening process.

On the other hand, metabolic profiling compares the metabolic compositions between samples and hence the quantitation and identification of the metabolites are required. Signal patterns must be matched to known standards or depositions in the databases in order to identify the actual compounds. For example, the accumulation of compatible solutes, such as proline, glycine-betaine, and their precursors, is usually observed in osmotically stressed crops, especially in tolerant varieties [216,220,221]. A specific example is the mitochondrial metabolic profile of flood-stressed soybean; metabolites were extracted from the roots and hypocotyls of soybean seedlings with or without submergence stress [226], and were then analyzed using capillary electrophoresis mass spectrometry. Eighty-one mitochondria-related metabolites were identified and quantified with reference to the commercially available standards [226]. There was an accumulation of TCA cycle-related metabolites, including citrate, succinate, and aconitate, but a reduction in ATP in flood-stressed plants, which can be explained by the arrest of aerobic respiration due to anoxia [222]. Following a similar logic, the accumulation of antimicrobial compounds, such as caffeic acid, phytoalexins, glycoalkaloids, and other polyphenolic compounds, are common in pathogen-infected crops compared to their uninfected counterparts [224–226,228]. Glucose oxidase secreted by a fungal pathogen, Botrytis cinerea, can also lead to the accumulation of gluconic acid in Vitis vinifera cv. Chardonnay berries [226].

5. Future Perspectives

Sequencing throughput is no longer the major limiting factor in genomics and transcriptomics studies. The next generation sequencing platforms can actually generate enough depth for genome assembly in one or several runs [234]. However, sequence assembly and annotation for complex genomes remain challenging. The data acquisition platforms for other “-omics”, on the other hand, are under rapid development to catch up with the pace of genomic research. While the data source is no longer a rate-determining step, data integration and interpretation have become the bottleneck in the research pipeline. One obstacle hindering the cross-platform analyses of different datasets is the variations in experimental designs, treatment conditions, and data formats. Drawing meaningful conclusions may sometimes be difficult when there are discrepancies between two germplasms. For example, the transcriptomic data of one germplasm may not be used effectively to explain the proteomic data of another germplasm. Researchers should therefore strategically design experiments to generate interrelated -omics data using carefully selected germplasms. The standardization of data acquisition and storage formats using strictly controlled vocabulary is also important.

With the advance of computer technology and high-throughput analysis platforms, life processes can now be captured, digitized, and stored in the hard disk of a computer. Yet, no matter how perfectly a genome is sequenced and assembled, biological data from experiments are still essential to connect the genotypes and the phenotypes. A few softwares/platforms have been developed to integrate the interactions of cellular components into networks [235,236]. For example, the VirtualPlant has been developed as a software platform for the integration and analysis of different levels of data [237]. It provides large datasets of Arabidopsis gene annotation, gene functional categories, microarray data, biochemical pathways, interaction information, and microRNA:mRNA interaction information. Users can also upload their own gene lists and microarray data for analysis, and identify coexpressed genes, interacting proteins, and metabolites associated with their genes of interest. Building a similar platform for crop plants could be extremely useful but it requires a well-coordinated effort among different research centers and groups.

Supplementary InformationAcknowledgments

This work is supported by the Hong Kong RGC Collaborative Research Fund (CUHK3/CRF/11G), the Hong Kong RGC General Research Fund (468610), and funding from the Lo Kwee-Seong Biomedical Research Fund and Lee Hysan Foundation. Jee Yan Chu copy-edited this manuscript.