Mining Metabolic and Enzyme Databases for the Composition of Non-canonical Pathways

This project will develop and implement a set of algorithms for the discovery of complex metabolic pathways using information on metabolic pathways from different species available in databases such as KEGG and MetaCyc, and enzyme information available in databases such as BRENDA. The increasing availability of experimentally elucidated metabolic pathways together with the availability of enzyme functional data have ignited research in metabolic engineering and metabolomics. Recently, metabolic engineering has facilitated the production of complex pharmaceutical molecules by combining and manipulating metabolic pathways from multiple organisms. Metabolomic studies have revealed underlying causes of complex diseases and have helped diagnostic, prognostic, and treatment regimens as these evolve toward more individualized protocols. These spectacular advances have mostly relied upon laborious manual processes that are infeasible as more and more metabolic data are generated. The ability of the algorithms that will developed herein to discover known pathways will be investigated thoroughly and will include wetlab experimentation with two model studies. The studies involve niacin and shikonin, two therapeutic agents. Computational and wetlab experimentation will be conducted in parallel, with the computation feeding the experiment and the experiment providing valuable feedback for the tuning of the parameters of the corresponding algorithms. A web server will be built to facilitate the use of our tools and to widely disseminate our work in the scientific community.
The research is timely and directed toward an important area of broad scientific interest with implications extending to fields such as health and energy. The project will provide training opportunities for graduate and undergraduate students. Participation in this type of research will provide graduate students with interdisciplinary training in metabolic engineering, metabolomics, systems biology, biochemistry, and molecular biology, as well as in mathematical modeling, algorithms, and computer science. The project will provide training opportunities and mentoring for women and minority graduate, undergraduate, and high school students.

BibTeX

@article{litsa2018atom-mapping,
abstract = {Atom mapping of a chemical reaction is a mapping between the atoms in the reactant
molecules and the atoms in the product molecules. It encodes the underlying reaction
mechanism and, as such, constitutes essential information in computational studies in
metabolic engineering. Various techniques have been investigated for the automatic
computation of the atom mapping of a chemical reaction, approaching the problem as a graph
matching problem. The graph abstraction of the chemical problem, though, eliminates crucial
chemical information. There have been efforts for enhancing the graph representation by
introducing the bond stabilities as edge weights, as they are estimated based on
experimental evidence. Here, we present a fully automated optimization-based approach, named
AMLGAM, (Automated Machine Learning Guided Atom Mapping), that uses machine learning
techniques for the estimation of the bond stabilities based on the chemical environment of
each bond. The optimization method finds the reaction mechanism which favors the
breakage/formation of the less stable bonds. We evaluated our method on a manually curated
data set of 382 chemical reactions and ran our method on a much larger and diverse data set
of 7400 chemical reactions. We show that the proposed method improves the accuracy over
existing techniques based on results published by earlier studies on a common data set and
is capable of handling unbalanced reactions.},
author = {Litsa, Eleni E. and Pena, Matthew I. and Moll, Mark and Giannakopoulos, George and Bennett, George N. and Kavraki, Lydia E.},
title = {Machine Learning Guided Atom Mapping of Metabolic Reactions},
journal = {Journal of Chemical Information and Modeling},
year = {2018},
doi = {10.1021/acs.jcim.8b00434},
keywords = {metabolic networks},
note = {(To appear)}
}

Abstract

Atom mapping of a chemical reaction is a mapping between the atoms in the reactant
molecules and the atoms in the product molecules. It encodes the underlying reaction
mechanism and, as such, constitutes essential information in computational studies in
metabolic engineering. Various techniques have been investigated for the automatic
computation of the atom mapping of a chemical reaction, approaching the problem as a graph
matching problem. The graph abstraction of the chemical problem, though, eliminates crucial
chemical information. There have been efforts for enhancing the graph representation by
introducing the bond stabilities as edge weights, as they are estimated based on
experimental evidence. Here, we present a fully automated optimization-based approach, named
AMLGAM, (Automated Machine Learning Guided Atom Mapping), that uses machine learning
techniques for the estimation of the bond stabilities based on the chemical environment of
each bond. The optimization method finds the reaction mechanism which favors the
breakage/formation of the less stable bonds. We evaluated our method on a manually curated
data set of 382 chemical reactions and ran our method on a much larger and diverse data set
of 7400 chemical reactions. We show that the proposed method improves the accuracy over
existing techniques based on results published by earlier studies on a common data set and
is capable of handling unbalanced reactions.

BibTeX

@article{kim2017_pathfinding_review,
abstract = {Recent developments in metabolic engineering have led to the successful biosynthesis
of valuable products, such as the precursor of the antimalarial compound, artemisinin, and
opioid precursor, thebaine. Synthesizing these traditionally plant-derived compounds in
genetically modified yeast cells introduces the possibility of significantly reducing the
total time and resources required for their production, and in turn, allows these valuable
compounds to become cheaper and more readily available. Most biosynthesis pathways used in
metabolic engineering applications have been discovered manually, requiring a tedious search
of existing literature and metabolic databases. However, the recent rapid development of
available metabolic information has enabled the development of automated approaches for
identifying novel pathways. Computer-assisted pathfinding has the potential to save
biochemists time in the initial discovery steps of metabolic engineering. In this paper, we
review the parameters and heuristics used to guide the search in recent pathfinding
algorithms. These parameters and heuristics capture information on the metabolic network
structure, compound structures, reaction features, and organism-specificity of pathways. No
one metabolic pathfinding algorithm or search parameter stands out as the best to use
broadly for solving the pathfinding problem, as each method and parameter has its own
strengths and shortcomings. As assisted pathfinding approaches continue to become more
sophisticated, the development of better methods for visualizing pathway results and
integrating these results into existing metabolic engineering practices is also important
for encouraging wider use of these pathfinding methods.},
keywords = {metabolic networks},
author = {Kim, Sarah M. and Pe\~{n}a, Matthew I. and Moll, Mark and Bennett, George N. and Kavraki, Lydia E.},
title = {A Review of Parameters and Heuristics for Guiding Metabolic Pathfinding},
journal = {Journal of Cheminformatics},
year = {2017},
volume = {9},
number = {1},
pages = {51},
month = sep,
doi = {10.1186/s13321-017-0239-6},
pmcid = {PMC5602787},
pmid = {29086092}
}

Abstract

Recent developments in metabolic engineering have led to the successful biosynthesis
of valuable products, such as the precursor of the antimalarial compound, artemisinin, and
opioid precursor, thebaine. Synthesizing these traditionally plant-derived compounds in
genetically modified yeast cells introduces the possibility of significantly reducing the
total time and resources required for their production, and in turn, allows these valuable
compounds to become cheaper and more readily available. Most biosynthesis pathways used in
metabolic engineering applications have been discovered manually, requiring a tedious search
of existing literature and metabolic databases. However, the recent rapid development of
available metabolic information has enabled the development of automated approaches for
identifying novel pathways. Computer-assisted pathfinding has the potential to save
biochemists time in the initial discovery steps of metabolic engineering. In this paper, we
review the parameters and heuristics used to guide the search in recent pathfinding
algorithms. These parameters and heuristics capture information on the metabolic network
structure, compound structures, reaction features, and organism-specificity of pathways. No
one metabolic pathfinding algorithm or search parameter stands out as the best to use
broadly for solving the pathfinding problem, as each method and parameter has its own
strengths and shortcomings. As assisted pathfinding approaches continue to become more
sophisticated, the development of better methods for visualizing pathway results and
integrating these results into existing metabolic engineering practices is also important
for encouraging wider use of these pathfinding methods.

BibTeX

@inproceedings{kim-pena2016an-evaluation-of-different-clustering-methods,
abstract = {Large-scale annotated metabolic databases, such as KEGG and MetaCyc, provide a
wealth of information to researchers designing novel biosynthetic pathways. However, many
metabolic pathfinding tools that assist in identifying possible solution pathways fail to
facilitate the grouping and interpretation of these pathway results. Clustering possible
solution pathways can help users of pathfinding tools quickly identify major patterns and
unique pathways without having to sift through individual results one by one. In this paper,
we assess the ability of three separate clustering methods (hierarchical, k -means, and k
-medoids) along with three pair-wise distance measures (Levenshtein, Jaccard, and n -gram)
to expertly group lysine, isoleucine, and 3-hydroxypropanoic acid (3-HP) biosynthesis
pathways. The quality of the resulting clusters were quantitatively evaluated against
expected pathway groupings taken from the literature. Hierarchical clustering and
Levenshtein distance seemed to best match external pathway labels across the three
biosynthesis pathways. The lysine biosynthesis pathways, which had the most distinct
separation of pathways, had better quality clusters than isoleucine and 3-HP, suggesting
that grouping pathways with more complex underlying topologies may require more tailored
clustering methods.},
author = {Kim, Sarah M. and Pe{\~n}a, Matthew I. and Moll, Mark and Giannakopoulos, George and Bennett, George N. and Kavraki, Lydia E.},
booktitle = {2016 International Conference on Bioinformatics and Computational Biology. ISCA},
keywords = {metabolic networks},
title = {An Evaluation of Different Clustering Methods and Distance Measures Used for Grouping
Metabolic Pathways},
year = {2016},
pages = {115-122}
}

Abstract

Large-scale annotated metabolic databases, such as KEGG and MetaCyc, provide a
wealth of information to researchers designing novel biosynthetic pathways. However, many
metabolic pathfinding tools that assist in identifying possible solution pathways fail to
facilitate the grouping and interpretation of these pathway results. Clustering possible
solution pathways can help users of pathfinding tools quickly identify major patterns and
unique pathways without having to sift through individual results one by one. In this paper,
we assess the ability of three separate clustering methods (hierarchical, k -means, and k
-medoids) along with three pair-wise distance measures (Levenshtein, Jaccard, and n -gram)
to expertly group lysine, isoleucine, and 3-hydroxypropanoic acid (3-HP) biosynthesis
pathways. The quality of the resulting clusters were quantitatively evaluated against
expected pathway groupings taken from the literature. Hierarchical clustering and
Levenshtein distance seemed to best match external pathway labels across the three
biosynthesis pathways. The lysine biosynthesis pathways, which had the most distinct
separation of pathways, had better quality clusters than isoleucine and 3-HP, suggesting
that grouping pathways with more complex underlying topologies may require more tailored
clustering methods.

BibTeX

@article{antunes-15-eodd,
author = {Antunes, Dinler A. and Devaurs, Didier and Kavraki, Lydia E.},
title = {Understanding the challenges of protein flexibility in drug design},
journal = {Expert Opinion on Drug Discovery},
year = {2015},
abstract = {Protein-ligand interactions play key roles in various metabolic pathways, and the
proteins involved in these interactions represent major targets for drug discovery.
Molecular docking is widely used to predict the structure of protein-ligand complexes, and
protein flexibility stands out as one of the most important and challenging issues for
binding mode prediction. Various docking methods accounting for protein flexibility have
been proposed, tackling problems of ever-increasing dimensionality. This paper presents an
overview of conformational sampling methods treating target flexibility during molecular
docking. Special attention is given to approaches considering full protein flexibility.
Contrary to what is frequently done, this review does not rely on classical biomolecular
recognition models to classify existing docking methods. Instead, it applies algorithmic
considerations, focusing on the level of flexibility accounted for. This review also
discusses the diversity of docking applications, from virtual screening of small drug-like
compounds to geometry prediction of protein-peptide complexes. Considering the diversity of
docking methods presented here, deciding which one is the best at treating protein
flexibility depends on the system under study and the research application. In virtual
screening experiments, ensemble docking can be used to implicitly account for large-scale
conformational changes, and selective docking can additionally consider local binding-site
rearrangements. In other cases, on-the-fly exploration of the whole protein-ligand complex
might be needed for accurate geometry prediction of the binding mode. Among other things,
future methods are expected to provide alternative binding modes, which will better reflect
the dynamic nature of protein-ligand interactions.},
volume = {10},
number = {12},
pages = {1301--1313},
keywords = {proteins and drugs},
doi = {10.1517/17460441.2015.1094458},
pmid = {26414598}
}

Abstract

Protein-ligand interactions play key roles in various metabolic pathways, and the
proteins involved in these interactions represent major targets for drug discovery.
Molecular docking is widely used to predict the structure of protein-ligand complexes, and
protein flexibility stands out as one of the most important and challenging issues for
binding mode prediction. Various docking methods accounting for protein flexibility have
been proposed, tackling problems of ever-increasing dimensionality. This paper presents an
overview of conformational sampling methods treating target flexibility during molecular
docking. Special attention is given to approaches considering full protein flexibility.
Contrary to what is frequently done, this review does not rely on classical biomolecular
recognition models to classify existing docking methods. Instead, it applies algorithmic
considerations, focusing on the level of flexibility accounted for. This review also
discusses the diversity of docking applications, from virtual screening of small drug-like
compounds to geometry prediction of protein-peptide complexes. Considering the diversity of
docking methods presented here, deciding which one is the best at treating protein
flexibility depends on the system under study and the research application. In virtual
screening experiments, ensemble docking can be used to implicitly account for large-scale
conformational changes, and selective docking can additionally consider local binding-site
rearrangements. In other cases, on-the-fly exploration of the whole protein-ligand complex
might be needed for accurate geometry prediction of the binding mode. Among other things,
future methods are expected to provide alternative binding modes, which will better reflect
the dynamic nature of protein-ligand interactions.