Pegasus 4.4.1 Released

Pegasus 4.4.1 Released

We are happy to annouce the release of Pegasus 4.4.1. Pegasus 4.4.1 is a minor release, which contains minor enhancements and fixes bugs to Pegasus 4.4.0 release.

Enhancements:

Leaf cleanup job failures don’t trigger workflow failures

Finer grained capturing of GridFTP errors

Moved to only ignore common failures of GridFTP removals, instead of ignoring all errors

pegasus-transfer threading enhancements

Allow two retries with threading before falling back on single-threaded transfers. This prevents pegasus-transfer from overwhelming remote file servers when failures happen.

Support for MPI Jobs when submitting using Glite to PBS

For user specified MPI jobs in the DAX, the only way to ensure that the MPI job launches in the right directory through GLITE and blahp is to have a wrapper around the user mpi job and refer to that in the transformation catalog. The wrapper should cd in to the directory set by Pegasus in the job’s environment. The following environment variable is set _PEGASUS_SCRATCH_DIR

Updated quoting support for glite jobs

Quoting in the blahp layer in Condor for glite jobs is broken. There were fixes made to the planner and pbs_loca_submit_attributes.sh files such that env. var values can contain spaces or double quotes.

The fix relies on users to put the pbs_local_submit_attributes.sh from the pegasus distribution to the condor glite bin directory. More details at https://jira.isi.edu/browse/PM-802

pegasus-s3 now has support for copying objects larger than 5GB

pegasus-tc-converter code was cleaned up . support for database backed TC was dropped.

The planner now complaisn for deep LFN’s when using condor file transfers

Earlier pegasus-monitord had a race condition, at it tried to parse the .out and .err file when a JOB_FAILURE or JOB_SUCCESS happened, instead of doing it at POST_SCRIPT_SUCCESS or POST_SCRIPT_FAILURE message, if a postscript was associated . This resulted in it detecting empty kickstart output files, as postscript might have moved it before monitord opened a file handle to it. The fix for this , changed the monitord logic to parse files on JOB_FAILURE or JOB_SUCCESS only if postscript is not associated with the job

For aborted jobs that failed with signal, monitord did not parse the job status . Because of that no corresponding JOB_FAILURE was recorded, and hence the exitcode for the inv.end event is not recorded.

A set of portability fixes from the Debian packaging were incorporated into pegasus builds.

Clusters of size 1 should be allowed when using PMC

An earlier fix for 4.4.0 allowed single jobs to be clustered using PMC. However, this resulted in regular MPI jobs that should not be clustered, to be clustered also using PMC. The logic was updated to only wrap a single job with PMC if label based clustering is turned on and the job is associated with a label.

Based on user configuration, the leaf cleanup jobs tried to delete the submit directory for the workflow

A user can configure a workflow such that the workflow submit directory and the workflow scratch directory are the same on local site. This can result in stuck workflows if the leaf cleanup jobs are enabled. The planner now throws an error during planning if it detects the directories are the same

For hierarchical workflows, there maybe a case where the jobs that make up the workflow referred to by the subdax job may run in a child directory of the scratch directory in whcih jobs of top level worklfow are running. With leaf cleanup enabled, the parent scratch directory maybe cleaned before the subdax job has been completed. Fix for this involved, putting in explicit dependencies between the leaf cleanup job and the subdax jobs.

pegasus-analyzer did not show planner prescript log for failed subdax jobs

For prescript failures for sub dax jobs ( i.e the failure of planning operation on the sub workflow ), pegasus-analyzer never showed the content of the log. It only pointed to the location of the log in the submit directory. This is now fixed.https://jira.isi.edu/browse/PM-808

pegasus-analyzer shows job stderr for failed pegasus-lite jobs

When a Pegasus Lite job fails, pegasus-analyzer showed stderr from both the Kickstart record and the job stderr. This was pretty confusing as stderr for those jobs are used to log all kinds of PegasusLite stuff, and has usually nothing to do with the failure. To make these jobs easier to debug for our users, we added logic to only show the Kickstart stderr in these cases.

Planner did not validate pegasus.data.configuration value.
AS a result, because of a typo in the properties file planner failed with NPE.
More details at https://jira.isi.edu/browse/PM-799

pegasus-statistics output padding

Value padding is done only for text output files so they are human readable. However, due to a bug the value padding computation were being done for CSV file as well at one point in code. This caused an exception when output filetype for job statistics was csv

The Pegasus project is supported by the National Science Foundation under the OAC SI2-SSI program, grant #1664162. Pegasus also receives support from the Department of Energy, the National Institutes of Health, Defense Advanced Research Projects Agency, and the USC Information Sciences Institute.