SCHEDULE: NOV 13-19, 2010

Separating Workflow Management and Data Staging to Improve the Performance of Scientific Workflows

SESSION: Student Poster Reception

EVENT TYPE: Poster, ACM Student Poster

TIME: 5:15PM - 7:00PM

AUTHOR(S):Muhammad Ali Amer

ROOM:Main Lobby

ABSTRACT:Scientists in a variety of application domains use workflow technologies to manage the complexity of sophisticated analyses and simulations. These analyses may be composed of thousands or millions of interdependent tasks that must be executed in a specified order, and each task may require the management of significant amounts of data. Data management operations performed by the workflow system include staging input data to the computational node where a task will be executed; staging results generated by a task to more permanent storage, if necessary; and removing or “cleaning up” data products that are no longer needed from the execution sites.
Our work focuses on improving the performance of these data management operations: stage-in of data to enable start of computation, stage-out of data products to permanent storage, and cleanup of files that are no longer needed. We introduce a Data Placement Service (DPS) that is responsible for performing these operations, thus freeing the workflow management system to perform other tasks, such as managing dependencies among computational tasks. We integrate a DPS with the Pegasus workflow management system. As part of workflow execution, Pegasus relies on DPS to handle data staging and cleanup operations.
The Pegasus Workflow Management System is a workflow-mapping and execution engine that is used to map complex, large-scale scientific workflows with thousands of tasks processing terabytes of data onto distributed resources. We modified Pegasus to call the DPS for staging and clean-up management jobs. The modified Pegasus creates requests for the DPS using a list of dependencies that are generated by Pegasus for each compute job; this list includes the stage-in, transfer and stage-out jobs necessary for the successful completion of a compute job. Pegasus submits the stage-in/stage-out and clean-up calls to DPS using non-blocking calls. The DPS server then performs the staging and clean-up operations.
Pegasus works with Condor-G, which provides a means for submitting jobs to compute clusters remotely. A Corral server reserves job space using placeholder jobs called glideins on the linux cluster. Condor then submits jobs directly to the glideins without having to go through the local scheduler’s queue. Data staging and cleanup jobs are submitted directly to DPS. DPS executes those jobs independent of Condor.
The DPS consists of four major components: the listening interface, the cleanup component, the staging component, and the task thread-pool. DPS receives incoming requests from Pegasus on a listening interface. On receiving a call, DPS invokes either the staging or the cleanup component. For both types of calls, the DPS manages one central thread-pool for transfers and deletes.
We test the framework on three different workflows: Montage, Epigenomics and Broadband. The Montage astronomy application is used to construct large image mosaics of the sky. The Epigenomics workflow is used by scientists to draw conclusions from gene density of DNA on a reference Genome. Finally, the Broadband workflow provides earthquake scientists a platform to combine long period deterministic seismograms with deterministic low frequency and stochastic high frequency simulations.
We measure the overall reduction in the execution time of each workflow to reach conclusions on how our framework performs based on the size and type of data placement involved in a workflow. Owing to limited space, we present results from only one workflow (Montage) in the supplemental file attached with this submission. We will present the overall architecture of the framework, implementation of the DPS, experimental results for all three workflows, conclusions and a summary of future work in the poster presentation.