Bottom Line:
Pathway enrichment analysis represents a key technique for analyzing high-throughput omic data, and it can help to link individual genes or proteins found to be differentially expressed under specific conditions to well-understood biological pathways.We present here a computational tool, SEAS, for pathway enrichment analysis over a given set of genes in a specified organism against the pathways (or subsystems) in the SEED database, a popular pathway database for bacteria.Our evaluation of SEAS indicates that the program provides highly reliable pathway mapping results and identifies more organism-specific pathways than similar existing programs.

Affiliation: Computational Systems Biology Lab, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, University of Georgia, Athens, Georgia, United States of America.

ABSTRACTPathway enrichment analysis represents a key technique for analyzing high-throughput omic data, and it can help to link individual genes or proteins found to be differentially expressed under specific conditions to well-understood biological pathways. We present here a computational tool, SEAS, for pathway enrichment analysis over a given set of genes in a specified organism against the pathways (or subsystems) in the SEED database, a popular pathway database for bacteria. SEAS maps a given set of genes of a bacterium to pathway genes covered by SEED through gene ID and/or orthology mapping, and then calculates the statistical significance of the enrichment of each relevant SEED pathway by the mapped genes. Our evaluation of SEAS indicates that the program provides highly reliable pathway mapping results and identifies more organism-specific pathways than similar existing programs. SEAS is publicly released under the GPL license agreement and freely available at http://csbl.bmb.uga.edu/~xizeng/research/seas/.

pone-0022556-g003: SEAS-based re-annotation of B. subtilis pathways using 11 reference genomes.(A) Taxonomic distance between reference genomes and B. subtilis. The first column represents the reference genomes, which are used in the x-axis in (B)–(D); (B) Re-annotation of B. subtilis pathways using the single genome strategy; (C) Re-annotation of B. subtilis pathways using the multiple genome strategy #1; (D) Re-annotation of B.subtilis pathways using the multiple genome strategy #2. L. sphaericus is very low in panel B at position 4 on the x-axis as it has no pathway annotation information.

Mentions:
The first two strategies have been well evaluated in the original papers on SEED [24], RAST [26] and P-MAP [27] so we focus on the assessment of the third strategy. Specifically, we will re-annotate the pathways of E. coli and B. subtilis (already in SEED) based on SEED pathways encoded by other genomes (as references). The annotation is quite time-consuming if all genomes in SEED are used as references, but the coverage could be low if only one is used considering the reference genome may not be evolutionarily close enough to contribute useful annotation templates. To balance the annotation performance and coverage, our idea is to combine some reprehensive genomes for each group of reference genomes having similar evolutional distances to the query genome. To assess this idea, we have evaluated different combinations of reference genomes in an iterative manner (Figure 2 and 3) based on the taxonomic distance, defined as the number of nodes in the path from the query organism to its closest common ancestor with its reference organism in the taxonomy tree defined in the KEGG Genome database (see Figure 2A and 3A). Based on the taxonomic distance, we have designed the following three strategies: the single genome strategy, which selects only one reference genome from SEED every time, but with different distance each time (see Figure 2B and 3B); multiple genome strategy #1, which starts with a genome in SEED having the smallest taxonomic distance to the query genome and iteratively adds the next closest genome each time until K genomes have been selected for a user selected K>0 (see Figure 2C and 3C); and multiple genome strategy #2, which starts from the farthest genome in SEED to the query genome and iteratively adds the next farthest genome each time until K genomes have been selected, trying to cover the best studied genomes as references, which could be close or distant. We compared the SEAS-based re-annotation results against the original pathway annotation of the two organisms in SEED using the following measures:where TP (true positive) is the number of the genes for which the SEAS-based annotation is the same as the original SEED annotation, FP (false positive) is the number of the genes for which the SEAS-based annotation is different from the original SEED annotation, and FN (false negative) is the number of genes in the genome with SEED annotations but not SEAS annotations.

pone-0022556-g003: SEAS-based re-annotation of B. subtilis pathways using 11 reference genomes.(A) Taxonomic distance between reference genomes and B. subtilis. The first column represents the reference genomes, which are used in the x-axis in (B)–(D); (B) Re-annotation of B. subtilis pathways using the single genome strategy; (C) Re-annotation of B. subtilis pathways using the multiple genome strategy #1; (D) Re-annotation of B.subtilis pathways using the multiple genome strategy #2. L. sphaericus is very low in panel B at position 4 on the x-axis as it has no pathway annotation information.

Mentions:
The first two strategies have been well evaluated in the original papers on SEED [24], RAST [26] and P-MAP [27] so we focus on the assessment of the third strategy. Specifically, we will re-annotate the pathways of E. coli and B. subtilis (already in SEED) based on SEED pathways encoded by other genomes (as references). The annotation is quite time-consuming if all genomes in SEED are used as references, but the coverage could be low if only one is used considering the reference genome may not be evolutionarily close enough to contribute useful annotation templates. To balance the annotation performance and coverage, our idea is to combine some reprehensive genomes for each group of reference genomes having similar evolutional distances to the query genome. To assess this idea, we have evaluated different combinations of reference genomes in an iterative manner (Figure 2 and 3) based on the taxonomic distance, defined as the number of nodes in the path from the query organism to its closest common ancestor with its reference organism in the taxonomy tree defined in the KEGG Genome database (see Figure 2A and 3A). Based on the taxonomic distance, we have designed the following three strategies: the single genome strategy, which selects only one reference genome from SEED every time, but with different distance each time (see Figure 2B and 3B); multiple genome strategy #1, which starts with a genome in SEED having the smallest taxonomic distance to the query genome and iteratively adds the next closest genome each time until K genomes have been selected for a user selected K>0 (see Figure 2C and 3C); and multiple genome strategy #2, which starts from the farthest genome in SEED to the query genome and iteratively adds the next farthest genome each time until K genomes have been selected, trying to cover the best studied genomes as references, which could be close or distant. We compared the SEAS-based re-annotation results against the original pathway annotation of the two organisms in SEED using the following measures:where TP (true positive) is the number of the genes for which the SEAS-based annotation is the same as the original SEED annotation, FP (false positive) is the number of the genes for which the SEAS-based annotation is different from the original SEED annotation, and FN (false negative) is the number of genes in the genome with SEED annotations but not SEAS annotations.

Bottom Line:
Pathway enrichment analysis represents a key technique for analyzing high-throughput omic data, and it can help to link individual genes or proteins found to be differentially expressed under specific conditions to well-understood biological pathways.We present here a computational tool, SEAS, for pathway enrichment analysis over a given set of genes in a specified organism against the pathways (or subsystems) in the SEED database, a popular pathway database for bacteria.Our evaluation of SEAS indicates that the program provides highly reliable pathway mapping results and identifies more organism-specific pathways than similar existing programs.

Affiliation:
Computational Systems Biology Lab, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, University of Georgia, Athens, Georgia, United States of America.

ABSTRACTPathway enrichment analysis represents a key technique for analyzing high-throughput omic data, and it can help to link individual genes or proteins found to be differentially expressed under specific conditions to well-understood biological pathways. We present here a computational tool, SEAS, for pathway enrichment analysis over a given set of genes in a specified organism against the pathways (or subsystems) in the SEED database, a popular pathway database for bacteria. SEAS maps a given set of genes of a bacterium to pathway genes covered by SEED through gene ID and/or orthology mapping, and then calculates the statistical significance of the enrichment of each relevant SEED pathway by the mapped genes. Our evaluation of SEAS indicates that the program provides highly reliable pathway mapping results and identifies more organism-specific pathways than similar existing programs. SEAS is publicly released under the GPL license agreement and freely available at http://csbl.bmb.uga.edu/~xizeng/research/seas/.