Share A framework for protein structure prediction on the grid

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Related Documents

Share

Transcript

A Framework for Protein Structure Prediction on the Grid
1
A Framework for Protein Structure Pre-diction on the Grid
Eduardo HUEDO
1
, Ugo BASTOLLA
1
, Rub´en S. MONTERO
2
andIgnacio M. LLORENTE
2
,
11
Centro de Astrobiolog´ıa (CSIC-INTA). 28850 Torrej´ on de Ardoz,Spain.
2
Dpto. de Arquitectura de Computadores y Autom´ atica. Universidad Complutense, 28040 Madrid, Spain.
{
huedoce,bastollau
}
@inta.es,
{
rubensm,llorente
}
@dacya.ucm.es
Received 15 November 2003
Abstract
The large number of protein sequences, provided by genomic projectsat an increasing pace, constitutes a challenge for large scale computa-tional studies of protein structure and thermodynamics. Grid technologyis very suitable to face this challenge, since it provides a way to access theresources needed in compute and data intensive applications. In this pa-per, we show the procedure to adapt to the Grid an algorithm for the pre-diction of protein thermodynamics, using the Grid
W
ay tool. Grid
W
ayallows the resolution of large computational experiments by reacting toevents dynamically generated by both the Grid and the application.
Keywords
Bioinformatics, Grid Technology, Adaptive Scheduling andExecution.
§
1 Introduction
Bioinformatics, which has to do with the management and analysis of huge amounts of biological data, could enormously beneﬁt from the suitabilityof the Grid to execute high-throughput applications. It is foreseeable that theGrid will be soon adopted, because biological data is growing very fast, due
2
E. HUEDO, U. BASTOLLA, R.S. MONTERO and I.M. LLORENTE
to the proliferation of automated high-throughput experimental techniques andorganizations dedicated to Biotechnology. Therefore, the resources required tomanage and analyze this data will be only accessible through the Grid.One of the main challenges in Computational Biology concerns with theanalysis of the huge amount of protein sequences provided by genomic projectsat an ever increasing pace. The structure of a protein is coded in its aminoacid sequence, but deciphering it has turned out to be a very diﬃcult problem,which is still waiting for a complete solution. Nevertheless, in several cases,particularly when homologous proteins are known, computational methods canbe quite reliable. At an higher level of complexity, a very signiﬁcant eﬀort is beingdedicated to mapping the protein interactions, which ultimately determine manyof the response properties of the cell. Also for this task, intensive computationalmethods are needed to complement the diﬀerent experimental approaches, andanalyze their results.The aim of this paper is to present some experiences obtained on ap-plying Grid technology to Bioinformatics. In particular, we will consider analgorithm to predict the structure and thermodynamic properties of proteins,which could be applied to several kinds of large scale studies, to demonstratethe usefulness of the Grid to build sequence-structure alignments for a large setof sequences. The main characteristics of the structure prediction algorithm arebrieﬂy described in Section 2. Then, in Section 3, we present the Grid
W
ayframework to deal with the complexity of the Grid, and we enumerate the stepsneeded to adapt the application to take advantage of the Grid
W
ay features. InSection 4, we show the biological problems for which experimental computationalresults are presented. Finally, we give some conclusions in Section 5.
§
2 Prediction of Protein Structure and Ther-modynamics
In the past decades, a great eﬀort has been dedicated to the prediction of the native structure of proteins from the knowledge of their amino acid sequence.Despite promising recent progress, the accepted principle that the native state isthe thermodynamic state of minimal free energy of the protein plus solvent sys-tem is still unable to allow the prediction of protein structure on purely physicalgrounds. The most successful methods are based on the biological principle thatprotein structure is very conserved during evolution. Inspired by this principle,homology modelling aims at detecting an evolutionary relationship between the
A Framework for Protein Structure Prediction on the Grid
3
target sequence and the sequence of a protein with known structure, in order toinfer the target structure by analogy. A third class, known as threading methods,combines both the evolutionary and the physical approach. Based on the obser-vation that protein sequences can diverge through evolution to the point thattheir similarity is undetectable, while conserving roughly the same structure,these methods try to ﬁt all known protein structures to the target sequences,scoring the match in terms of both sequence similarity and some simpliﬁed freeenergy function. Methods of this class can in principle identify even distant ho-mologous proteins sharing the same fold as the query protein. Here we use suchmethods to obtain estimates of protein thermodynamics functions.In this work we will consider an eﬀective free energy function able toassign to the experimentally known native structure lowest energy of the wholeset of candidate structures obtained aligning without gaps the target sequencewith structures in the Protein Data Bank (PDB)
5, 2)
. This procedure for gener-ating candidate structures is called gapless threading. In this way, the correctstructure is recognized for most of the sequences in the PDB. Exceptions areproteins with large cofactors (i.e. non-proteic molecules needed for the function-ing of the protein, like the heme group in hemoglobin), which are not includedin the eﬀective energy function, small fragments, and multimeric proteins withstrong inter-chain interactions. The eﬀective energy function is able to estimateto a satisfactory accuracy the folding free energy (diﬀerence in free energy be-tween the native state and the almost random unfolded state) of proteins whosestructure is known.We have applied the eﬀective energy function to estimate the normalizedenergy gap
12, 4)
, a parameter involved in folding eﬃciency, for sets of orthologousproteins performing the same function in diﬀerent organisms. This study showedthat proteins of intracellular bacteria have smaller folding eﬃciency than thecorresponding proteins of free living bacteria
20)
. This result was expected fromthe argument that intracellular bacteria live in small populations, and naturalselection is less eﬀective in maintaining the properties of their macromolecules.In order to use the eﬀective energy function described above for pro-tein structure prediction, we have to apply it to
gapped
alignments between thequery sequences and the candidate structures. Gaps in the alignment representresidues that are deleted either from the sequence or from the structure in orderto ﬁt them together. This is motivated by the fact that during evolution aminoacids are inserted in or deleted from protein sequences, thus spoiling the perfect
4
E. HUEDO, U. BASTOLLA, R.S. MONTERO and I.M. LLORENTE
gapless alignment that two sequences had when they srcinated from a commonancestor. Introducing gaps increases enormously the space of candidate struc-tures for protein structure prediction. In order to eliminate spurious matchesobtained by placing a large number of gaps, one has to penalize the introductionof gaps. We therefore score an alignment
ali
(
A,B
) from each residue in theprotein A to the corresponding residue in the protein B with the expression:
Energy
(
Seq
(
A
)
,Str
(
B
)
,ali
(
A,B
)) +
G
0
·
N
gap
+
G
1
·
L
gap
,
where
N
gap
and
L
gap
are respectively the number and total length of gaps, and
G
0 and
G
1 are two parameters that have to be set by trial and error. Detailson the implementation of the scoring function and its optimization will be givenelsewhere.For each structure in the PDB, our algorithm builds the gapped align-ment between the target sequence and the structure which maximizes the abovescore. The method has been tested in the 5th round of
Critical Assessment of techniques for protein Structure Prediction
(CASP5)
∗
1
1)
. Although it is lesseﬃcient than homology based methods in recognizing distantly related proteins,when a close relative of the target structure is present in the PDB, even withvery low sequence similarity, the algorithm recognizes it and produces a goodalignment between sequence and structure. In such cases, the algorithm can beused to estimate thermodynamic parameters of the target sequence, such as thefolding free energy and the normalized energy gap, and as such it has been usedto conﬁrm our previous results on the folding eﬃciency of proteins of diﬀerentbacteria
3)
.
§
3 The Grid
W
ay Framework
The Globus toolkit has become a
de facto
standard in Grid computing
11)
.Globus services allow secure and transparent access to resources across multipleadministrative domains, and serve as building blocks to implement the stagesof Grid scheduling
19)
: resource discovery and selection, and job preparation,submission, monitoring, migration and termination. However, the user is re-sponsible for manually performing all the scheduling steps in order to achieveany functionality. Moreover, the Globus toolkit does not provide support foradaptive execution, required in dynamic Grid environments. In fact, one of the most challenging problems that the Grid computing community has to deal
∗
1
http://PredictionCenter.llnl.gov/casp5/
A Framework for Protein Structure Prediction on the Grid
5
with is the fact that Grids present a high fault rate and unpredictable changingconditions (dynamic resource availability, load and cost).To overcome these limitations, we have recently developed the Grid
W
ayexperimental framework
∗
2
. The core of the Grid
W
ay framework
15)
is a personal
submission agent
that performs all scheduling stages and watches over the correctand eﬃcient execution of jobs. Adaptation to changing conditions is achieved bydynamic rescheduling of jobs, which can lead to a job migration if it is consideredfeasible and worthwhile
17)
, when one of the following events is detected:
•
A “better” resource is discovered (opportunistic migration)
17)
.
•
The remote resource or its network connection fails.
•
The submitted job is cancelled or suspended.
•
Performance degradation is detected.
•
The resource demands of the application change (self-migration).The architecture of the Grid
W
ay framework is depicted in Figure 1.The user interacts with the framework through a programming or command lineinterface, which forwards client requests (
submit
,
kill
,
stop
,
resume
...) to the
dis- patch manager
. The
dispatch manager
periodically wakes up at each
scheduling interval
, and tries to submit pending and rescheduled jobs to Grid resources.Once a job is allocated to a resource, a
submission manager
and a
performance monitor
are started to watch over its correct and eﬃcient execution
15)
.
Remote Host
PerformanceProfileRequirementsHostExpressionRank ++
ServerGRAM
(File Proxy)
GridFTP
File Transfer & Submission
Job
GridWay Framework
Dynamic FileAccess
ServerGassGridInformationServiceResource SelectorDispatch ManagerJob/Array StructureUser & Application Programming InterfaceSubmissionManagerPerformanceMonitor
Fig. 1
Architecture of the Grid
W
ay framework.
∗
2
http://asds.dacya.ucm.es

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.