Bottom Line:
Identifying the biochemical basis of microbial phenotypes is a main objective of comparative genomics.Here we present a novel method using multivariate machine learning techniques for comparing automatically derived metabolic reconstructions of sequenced genomes on a large scale.Applying our method to 266 genomes directly led to testable hypotheses such as the link between the potential of microorganisms to cause periodontal disease and their ability to degrade histidine, a link also supported by clinical studies.

ABSTRACTIdentifying the biochemical basis of microbial phenotypes is a main objective of comparative genomics. Here we present a novel method using multivariate machine learning techniques for comparing automatically derived metabolic reconstructions of sequenced genomes on a large scale. Applying our method to 266 genomes directly led to testable hypotheses such as the link between the potential of microorganisms to cause periodontal disease and their ability to degrade histidine, a link also supported by clinical studies.

Figure 6: Classification quality for the phenotype periodontal disease causing. Left: classification of all genomes (266) into genomes related and not related to periodontal disease using the nearest neighbor classifier (IB1). Right: classification of oral genomes (15) into genomes related and not related to periodontal disease using the nearest neighbor classifier (IB1). Compared to classification based on all pathways (marked by a horizontal line) and based on randomly picked pathways (red), the classification based on the most relevant pathways yields better separation of periodontal species and other species in both genome datasets.

Mentions:
Analogous to the previous example of methanogenesis, we applied our method to the complete set of pathway profiles (266 species) as well as to the reduced set of 15 oral genomes to focus on periodontal-related rather than oral cavity-related biochemical features. Figure 6 shows the resulting classification qualities achieved with the nearest neighbor classifier. According to the cross-check, the phenotype 'periodontal disease causing' is reflected by the identified relevant pathways. In contrast to the phenotype 'methanogenesis', more highly ranking pathways must be considered for classification to reach the maximum classification quality. Therefore, we focus on the ten most relevant pathways in the following. Using these pathways, we obtained 0.75 as the maximum classification quality value in both genome sets compared to a maximum of 0.50 for all pathways and maximums of 0.08 and 0.29, respectively, for randomly chosen pathways (Table 4).

Figure 6: Classification quality for the phenotype periodontal disease causing. Left: classification of all genomes (266) into genomes related and not related to periodontal disease using the nearest neighbor classifier (IB1). Right: classification of oral genomes (15) into genomes related and not related to periodontal disease using the nearest neighbor classifier (IB1). Compared to classification based on all pathways (marked by a horizontal line) and based on randomly picked pathways (red), the classification based on the most relevant pathways yields better separation of periodontal species and other species in both genome datasets.

Mentions:
Analogous to the previous example of methanogenesis, we applied our method to the complete set of pathway profiles (266 species) as well as to the reduced set of 15 oral genomes to focus on periodontal-related rather than oral cavity-related biochemical features. Figure 6 shows the resulting classification qualities achieved with the nearest neighbor classifier. According to the cross-check, the phenotype 'periodontal disease causing' is reflected by the identified relevant pathways. In contrast to the phenotype 'methanogenesis', more highly ranking pathways must be considered for classification to reach the maximum classification quality. Therefore, we focus on the ten most relevant pathways in the following. Using these pathways, we obtained 0.75 as the maximum classification quality value in both genome sets compared to a maximum of 0.50 for all pathways and maximums of 0.08 and 0.29, respectively, for randomly chosen pathways (Table 4).

Bottom Line:
Identifying the biochemical basis of microbial phenotypes is a main objective of comparative genomics.Here we present a novel method using multivariate machine learning techniques for comparing automatically derived metabolic reconstructions of sequenced genomes on a large scale.Applying our method to 266 genomes directly led to testable hypotheses such as the link between the potential of microorganisms to cause periodontal disease and their ability to degrade histidine, a link also supported by clinical studies.

ABSTRACTIdentifying the biochemical basis of microbial phenotypes is a main objective of comparative genomics. Here we present a novel method using multivariate machine learning techniques for comparing automatically derived metabolic reconstructions of sequenced genomes on a large scale. Applying our method to 266 genomes directly led to testable hypotheses such as the link between the potential of microorganisms to cause periodontal disease and their ability to degrade histidine, a link also supported by clinical studies.