8.5 Stable Release Series 6.6

This is a stable release series of Condor.
It is based on the 6.5 development series.
All new features added or bugs fixed in the 6.5 series are available
in the 6.6 series.
The details of each version are described below.

8.5.1 Version 6.6.12

Release Notes:

Contains only a couple bug fixes.

Bugs fixed that are included in version 6.7.19:

None.

Bugs fixes irrelevant to the 6.7 series:

Fixed a bug which caused the condor_ collector incorrectly
handle Collector ads in which the Machine attribute is
missing, or Storage ads in which the Name is missing. In
these cases, a condor_ collector running on some platforms
(notably, Solaris) could crash.

Known Bugs:

None.

Version 6.6.11

Release Notes:

A security team at UW-Madison is conducting an onging security
audit of the Condor system and has identified a few important
vulnerabilities.
Condor versions 6.6.11 and 6.7.18 fix these security problems and
other bugs.
There have been no reported exploits, but all sites are urged to
upgrade immediately.

The Condor Team will publish detailed reports of these vulnerabilities
on 2006-04-24, 4 weeks from the date when the fixes were first
released (2006-03-27).
This will allow all sites time to upgrade before enough information to
exploit these bugs is widely available.

Security Bugs Fixed:

Bugs in previous versions of Condor could allow any user who can
submit jobs on a machine to gain access to the ``condor'' account
(or whatever non-privileged user the Condor daemons are running as).
This bug can not be exploited remotely, only by users already logged
onto a submit machine in the Condor pool.

The security of the ``condor_ config_val -set'' feature was
found to be insufficient, so this feature is now disabled by default.
There are new configuration settings to enable this feature in a
secure manner.
Please read the descriptions of ENABLE_RUNTIME_CONFIG,
ENABLE_PERSISTENT_CONFIG and PERSISTENT_CONFIG_DIR
in the example configuration file shipped with the latest Condor
releases, or in section 3.3.5 on
page .

Other bugs fixed that are included in version 6.7.18:

Fixed a bug which could cause the condor_ collector to crash
when it receives certain types of malformed ads.

Fixed a bug which caused the condor_ collector incorrectly
handle ads in which the UpdateInterval attribute is set.
In particular, the previous versions of the condor_ collector will
use the UpdateInterval value as the maximum lifetime
of the ad when aging the ads, which could cause it to remove the ad
prematurely.
The condor_ collector now looks at the ClassAdLifetime
attribute, and uses its value (if set).
NOTE: No current Condor daemons are publishing either of these
attributes, but may do so in the future.

Bugs fixed that are included in version 6.7.14:

Fixed a rare problem in the condor_ negotiator where a poorly
formed classad from a single condor_ schedd could halt negotiation
for the entire pool.
This poorly formed ad could only happen in extrememly rare
circumstances, but it was possible.
Now, the condor_ negotiator will simply ignore poorly formed
classads and continue to negotiate with any other condor_ schedd in
the system that has idle jobs.

Fixed a bug which caused log messages which should contain
``PRIV_USER_FINAL'' to be ``PRIV_USER_FINALPRIV_FILE_OWNER''.
It's also possible that this same bug could cause crashes if any
daemon attempts to log a message which would refer to
``PRIV_FILE_OWNER''.

Fixed a bug which caused the condor_ starter to exit with an
error when the sum total of the file transfer size exceeded 2G.
This, in turn, caused a ``shadow exeception'', and the job would
fail.

Bugs fixed that are included in version 6.7.11:

In very rare cases, the condor_ startd could get into an
infinite loop if a job it was managing was suspended and then there
were fatal errors trying to send commands to evict the corresponding
condor_ starter.
This bug has been fixed, and the condor_ startd will now correctly
recover (and cleanup all processes) if it fails to send commands to a
starter managing a suspended job.

Condor on Solaris has been patched to work around a Solaris stdio
limitation of 255 maximum file descriptors. Before this patch, heavily
loaded Condor daemons running on Solaris, particularly the condor_ schedd,
could exit complaining about lack of file descriptors for dprintf.

Fixed a bug where the condor_ starter would follow symbolic links to
directories, when calculating job disk usage. This could cause an incorrect
job disk usage calculation, or hang the starter upon encountering an infinite
directory loop. This bug only affected Unix platforms.

For Globus jobs, the Rematch expression is now evaluated when a
submit fails (in addition to when a submit commit times out).

Fixed a bug that caused the condor_ gridmanager to go into an
infinite loop if an entry in the job's environment string was missing
an equals sign.

Bugs fixed that are included in version 6.7.9:

Fixed a bug where the condor_ startd would erroneously compute the
console idle time utilizing a file called /proc/interrupts on unix machines
that were not linux.

Fixed a bug where the condor_ negotiator might dump core if it was
reconfiged in the middle of a negotiation cycle.

Fixed a bug where the condor_ negotiator might dump core if a startd
had a name longer than 63 bytes.

Fixed a bug that could cause condor_ userprio to crash if the
data it gets back from the condor_ negotiator is invalid.

Fixed a bug where
DEFAULT_PRIO_FACTOR was ignored if
ACCOUNTANT_LOCAL_DOMAIN was not defined.

Bugs fixes irrelevant to the 6.7 series:

Added the -NoEventChecks and the -AllowLogError
command-line flags to condor_ submit_dag and the condor_ submit_dag
man page (they were already in condor_ dagman).
Added -r and -debug to the condor_ submit_dag
man page (they were already in condor_ submit_dag, just not
documented).

Made command-line arguments case insensitive in the Windows
version of condor_ submit_dag; also fixed log file checks in
that version.

Known Bugs:

A bug has been found which can cause a condor_ collector to
crash on some platforms (notably, Solaris). This can happen if the
condor_ collector receives a Collector ad in which the
Machine attribute is missing, or a Storage ad in which the
Name is missing. There is no security threat involved in
either case.

Version 6.6.10

Release Notes:

Most of the fixes included in this release were also included in
version 6.7.7 (see below).

The QUEUE_CLEAN_INTERVAL timer is reset during a
condor_ schedd reconfig only if this timer value has been changed.
Previously, the timer was reset during all condor_ schedd reconfigs, which
could prevent the job_queue.log file from being cleaned. Note that
this timer is always reset upon a condor_ schedd startup. See the
related change for truncating the job_queue.log below, for this same
release.

Previously, the condor_ schedd would over-react and exit if it
tried to send a user email and SMTP_SERVER was undefined;
now it simply prints an error in the SchedLog and moves on.

Bugs fixed that are included in version 6.7.7:

Fixed a bug that could cause the file job_queue.log in
the Condor SPOOL directory to grow unnecessarily large, thereby
slowing down the startup and/or shutdown times for the condor_ schedd
daemon.

Fixed a critical bug where the console idle time for PS/2 keyboards
and mice was not being updated correctly.

Fixed a bug in the condor_ collector that could cause it to
crash when parsing certain types of invalid ClassAds. In particular, if
a Machine, Schedd or License ClassAd sent to the condor_ collector has
an IP address field which is empty (which should never happen), the
condor_ collector will crash.

Fixed some bugs in how the condor_ schedd handles a graceful
shutdown (either because of a condor_ off) or a SIGTERM on
UNIX):

There was a minor bug if JOB_START_DELAY was set to
0 that would prevent the condor_ schedd from correctly cleaning
up during graceful shutdown.
Now, the condor_ schedd will properly shutdown, even if
JOB_START_DELAY is set to 0.

Fixed a bug when there are scheduler universe jobs that were
recently submitted to the queue.
Previously, the shutdown code would not evict scheduler universe
jobs that had been submitted since the last
SCHEDD_INTERVAL (which defaults to 5 minutes).
So, if a user submitted a scheduler universe job and then someone
shutdown Condor on that machine, the condor_ schedd would wait
until the next SCHEDD_INTERVAL had elapsed before
evicting the job.
Now, the schedd will always attempt to evict scheduler universe
jobs during a shutdown, without waiting for this interval to pass.

A number of Windows-specific bugs were fixed:

It was possible under certain circumstances for execute
directories to not be cleaned up properly. This has been fixed.

Certain Asian locales would cause the condor_ starter to crash
due to character translation problems. This has been fixed.

Condor will now properly report memory sizes that exceed 2 GB.

The condor_ starter would be unable to run jobs if the LOG
path had a period (.) in it. This has been fixed.

The condor_ startd would leak memory, especially on SMP
machines. This has been fixed.

The condor_ master would crash immediately on Windows 2003
Server if the firewall was enabled. This has been fixed.

Fixed a bug in condor_ dagman that could cause condor_ dagman
to fail an assertion if PRE or POST scripts are throttled with the
-maxpre or -maxpostcondor_ submit_dag command line flags.

Bugs fixed that are NOT included in version 6.7.7:

Fixed a bug where enabling the grid_monitor for any globus
job handled by something other than a hard-coded list of jobmanager names
would cause the job to stay idle forever. The hard-coded list of
jobmanager names was: condor, fork, lsf, pbs, and remote. A jobmanager
by any other name (e.g. condor_rh9, or lcgpbs) would cause the problem.
This bug was originally fixed in internal releases of 6.7.0, but it was
reintroduced by mistake in all public releases.

Fix the way condor_ version handles command line arguments
(there were a number of problems and inconsistencies) and added a
-help option and usage message.

Fixed some memory leaks in the condor_ startd that would be
induced by calling condor_ reconfig or condor_ status-d.

By design, Condor daemons will exit if their parent process
exits. On Windows, a bug introduced in v6.5.x series broke this
behavior. This is now fixed.

On Windows, users would often observe the condor_ master failing to
add exceptions for the Condor daemons to the Windows Firewall on Windows
XP SP2 or Windows 2003 Server SP1. The condor_ master will
now retry for a longer period of time to add these exceptions,
and the number of retries has now been made configurable. See
section 3.3.9 on
page for more information.

Known Bugs:

None.

Version 6.6.9

Release Notes:

Most of the fixes included in this release were also included in
version 6.7.5.
However, at the end of this section, a few fixes that were added to
6.6.9 after 6.7.5 was released are mentioned separately.

Bugs fixed that are included in version 6.7.5:

Fixed a security bug in the condor_ schedd that could enable a
maliciously modified condor_ submit tool to overwrite files in the Condor
SPOOL subdirectory, including the job queue.

Fixed a bug where under very pathological file permission failure
conditions with a standard universe job, there would be a cycle of an
execute event followed by a termination event in the user log when the
job had not actually ran.

Bugs fixed that are NOT included in version 6.7.5:

Fixed a memory management bug introduced in version 6.6.8 that
could result in deallocated memory being referenced after a child
process forked from a Condor daemon exits.

Fixed bugs in some Condor tools that failed to locate
condor_ startd daemons that contained multiple @ signs in
their Name attribute.
For example, a virtual machine from a multiple-CPU condor_ startd
spawned using glidein would have the name:
vm1@[pid]@[hostname].
All Condor tools that need to communicate with a condor_ startd
like this will now succeed.

Removed a fixed-length buffer in the code that handled the
SUBSYS_EXPRS config file setting.
Previously, if any attributes referred to were larger than
approximate 1000 bytes, Condor daemons would crash.
Now, there is no limit to the size of the attributes listed in
SUBSYS_EXPRS.
For more information about this setting, see
section 3.3.5 on page .

Fixed a bug which would cause Condor to fail to cache user GID
information and potentially overwhelm NIS servers.

Fixed another bug which could cause UDP machine updates to be
dropped by the condor_ collector.

Known Bugs:

If a DAG node has both retries and a POST script, and the
actual Condor job for the node fails, the POST script is not
run except after the last retry of the job (or if the job
succeeds). (The POST script should be run each time the node
job is run, whether the job succeeds or not.)

Occasionally, Condor generates both a terminated event and
an aborted event for a job that is aborted. If this happens for a
DAG node job, condor_ dagman considers this an error
and aborts the DAG. If you run into this problem, you can avoid
the abort by adding the -NoEventChecks flag to argument list
in the condor_ dagman submit file generated by condor_ submit_dag
(you have to do condor_ submit_dag-no_submit and hand-edit
the resulting submit file). However, if you get the
double events on a node that has retries, condor_ dagman will assert.
The only fix for this is to upgrade to a 6.7.5 or newer condor_ dagman.
You can do this by simply installing a newer condor_ dagman executable,
without any other changes to your Condor installation. It is fine to
run a 6.7 condor_ dagman on a 6.6 Condor installation.

In a DAG, if a node job generates an executable error event,
the DAG is aborted. This can be worked around by adding the
-NoEventChecks flag to argument list in the condor_ dagman
submit file generated by condor_ submit_dag (you have to do
condor_ submit_dag-no_submit and hand-edit the resulting
submit file).

Version 6.6.8

Release Notes:

Most of the fixes included in this release were also included in
version 6.7.3.
However, at the end of this section, a few fixes that were added to
6.6.8 after 6.7.3 was released are mentioned separately.

New Features:

None.

Bugs Fixed:

In version 6.6.7, we fixed bugs related to the
-format option to various Condor tools.
However, some sites were using -format in ways we did not
expect, by not specifying any 'string at all.
This used to work, given the old buggy code that handled
-format, but the changes in version 6.6.7 broke this, and
format strings without a 'Now, if the format string does not contain a 'the attribute name which follows it is once again ignored, and the
format string is printed directly without any modification.
For example, to print out the machine's Name (always defined)
and the RemoteUser (only defined if the machine is claimed),
and always print a newline (to keep the formatting legible), this
command will now work:

Fixed a bug in that would cause Condor to fail to gracefully
shutdown user jobs that are console applications (including batch
scripts).

Fixed an issue that would cause condor_ store_cred to fail
if the user did not have NETWORK logon rights.

condor_ store_credquery command would appear to succeed,
even if the stored credential was invalid (e.g. the password was changed
but the password stash was not updated). This has been fixed.

Fixed a bug that would cause the condor_ startd to crash under
certain conditions during job eviction. This bug was introduced in Condor
version 6.6.6.

Fixed a bug that would cause condor_ dagman to crash if it was
submitted as a non-Administrator user.

Fixed a bug that would cause Condor to occasionally kill processes
that didn't belong to it during job eviction or daemon restarts.

On startup, the condor_ master would occasionally fail to add the
daemons to the Windows XP firewall exception list because of a race with
the Windows SharedAccess service. This bug has been fixed.

If a user submitted a job with an invalid executable, the starter
would often wedge until the job was preempted. Now, the starter attempts
to detect invalid executables and prevent wedging.

Fixed issues that would cause condor_ startd to ``disappear''
from the pool because of dropped machine ad updates. This fix applies
to all platforms, but the symptoms were exhibited predominantly on
Windows machines.

Fixed a bug that could cause HIGHPORT and
LOWPORT parameters to be ignored if a Windows machine ran for
several weeks without being rebooted.

Starting with RedHat 9, newer versions of Linux began to produce
core files named core.<pid>.
This broke functionality in Condor that managed and transferred back
any core file created by the job, since the condor_ starter was
unable to locate the proper file.
Now, Condor will correctly transfer back core files, even if they
are created as core.<pid>.
This functionality works in all universes, and is independent of
Condor's file transfer mechanism.

Fixed a bug that was causing condor_ startd to consume large
amounts of memory over long periods of time.

Fixed a bug that was causing condor_ startd to fail to start up
with the message, "caInsert: Can't insert CpuBusy into target ClassAd."

Fixed a long-standing bug in Condor regarding the configuration
settings LOWPORT and HIGHPORT.
When these were enabled (to restrict Condor's port usage to a
specified range), Condor would fail to set the
SO_KEEPALIVE option on sockets it created.
This meant that in the case of a hard machine failure (such as a
sudden power outage, etc) on one machine, Condor daemons
communicating with that machine would never notice it had died.
Now, the SO_KEEPALIVE option is properly set on all
sockets, even with LOWPORT and HIGHPORT
defined.

Fixed a bug that caused condor_ rm-forcex to not remove
jobs that make use of leave_in_queue.
If invoked using a cluster id, username, or constraint expression,
condor_ rm would report success but the jobs would remain in the queue.
Now, the jobs will leave the queue.

When a held job is released, job ad attributes HoldReasonCode and
HoldReasonSubCode are now properly moved to LastHoldReasonCode and
LastHoldReasonSubCode.

Fixed a bug that would cause the RemoveReason attribute
for a job
to be set incorrectly in some circumstances.
Specifically, this was when a job
was not running and a periodic_remove expression
caused the job to be cancelled.

Fixed condor_ submit such that submit description file
commands written with syntax both of
ThisStyle and this_style will work.

Fixed a very rare but serious bug in Condor that was originally
introduced in version 6.3.0.
Under exceptional circumstances (a very heavily loaded machine where
a huge number of processes are being spawned all the time, and where
the condor_ schedd is managing many thousands of jobs in the
queue), it was possible for the condor_ schedd to run a job twice.
We have fixed the underlying problem that lead to the
condor_ schedd making this mistake, rendering this error
impossible.

Fixed a bug that occurred when submitted Condor-G jobs while
using the grid monitor. If the grid job monitor returned a FAILED
status for a job while the jobmanager is asleep, the condor_ gridmanager
could sometimes end up in a loop, continuously restarting the remote
Globus jobmanager then putting it back to sleep.

Known Bugs:

None

Bugs fixed that are not included in version 6.7.3:

Fixed a discrepancy in the SUBSYS_ADDRESS_FILE
setting.
Previously, this setting did not work for SUBSYS values of
COLLECTOR or NEGOTIATOR (for example, defining
COLLECTOR_ADDRESS_FILE had no effect).
Now, if either of these is defined in the configuration file,
the corresponding Condor daemon will write out the address
and port it is using to the specified file.
Normally, the condor_ collector and condor_ negotiator listen on a
well-known, fixed port.
However, on single-machine, Personal Condor installations,
these address files allow all of the Condor daemons and tools to locate
the condor_ collector and condor_ negotiator, even if they are
using a dynamically assigned port.
For more information about the SUBSYS_ADDRESS_FILE
setting, please see the description in
section 3.3.5 on
page .
For more information about using non-standard ports for the
condor_ collector and condor_ negotiator, please see the
description of ``Non Standard Ports for Central Managers'' in
section 3.7.1 on
page .

Version 6.6.7

Release Notes:

None.

New Features:

Added a feature to the condor_ master which automatically adds
the Condor daemons to the Windows Firewall exception list. This only
applies to machines running Windows XP SP2.

Bugs Fixed:

Fixed a bug specific to Windows that could cause, in rare occurrences
due to a race condition, Condor to fail to properly signal the job to
suspend, continue, or preempt.

When Condor transfers the job executable using the file transfer
mechanism, it used to leave the binary sitting as a world-writable
file inside the execute directory on UNIX.
Now, executable files transferred by Condor have the proper
permissions (mode 0755).

Fixed an important bug in the low-level code that Condor uses to
transfer files across a network.
There were certain temporary failure cases that were being treated
as permanent, fatal errors.
This resulted in file transfers that aborted prematurely, causing
jobs to needlessly re-run.
The code now gracefully recovers from these temporary errors.
This should significantly help throughput for some sites,
particularly ones that transfer very large files as output from
their jobs.

Fixed a bug in the file transfer mechanism which caused
segmentation faults when very long input/output/intermediate file
lists were used.

Fixed a number of bugs in the -format option to condor_ q
and condor_ status.
Now, these tools will properly handle printing boolean expressions
in all cases.
Previously, depending on how the boolean evaluated, either the
expression was printed, or the tool could crash.
Furthermore, the tools do a better job of handling the different
types of format conversion strings and printing out the appropriate
value.
For example, if a user tries to print out a boolean attribute with
condor_status -format "%d\n" HasFileTransfer, the
condor_ status tool will evaluate HasFiletransfer and print
either a 0 or a 1 (FALSE or TRUE).
If, on the other hand, a user tries to print out a boolean attribute
with condor_status -format "%s\n" HasFileTransfer, the
condor_ status tool will print out the string ``FALSE'' or ``TRUE''
as appropriate.

condor_ dagman now generates a fatal error if any node submit
files are missing the log file attribute. This behavior can be
overridden with the -AllowLogError command-line option.

condor_ dagman now does better checking for inconsistent events
(such as getting multiple terminate events for a single job). This
checking can be disabled with the -NoEventChecks command-line
option.

Under Tru64, Condor would sometimes fail to start a job while
setting the resource limits on behalf of the job.
This error appears to be the result of a kernel issue.
A workaround has been implemented which will leave the limits
of the job unmodified and run the job when this specific error
situation arises.

On Windows, occasionally Condor would exhibit erratic behavior
when a machine resumes from sleeping. This has been fixed.

On Windows, occasionally Condor would fail to bind to any available
interfaces due to a mishandling of a function return value. This has
been fixed.

Known Bugs:

None.

Version 6.6.6

Release Notes:

A condor_ dagman job will fail and report a cycle in the DAG
when XML logs are used in a single or multiple log format. The Post
Script completion event does not get converted to XML and Dagman
never sees them complete or fail because of the format of the event.

New Features:

The checkpoint server has moved from contrib module status to being
a normal part of Condor.

When the first start running, all Condor daemons will now try to
print to their log file the full path to the binary they are
executing.
Unfortunately, we can only reliably get this information on Linux,
Solaris, MacOSX, and Windows platforms.
On other platforms, this information will only be printed to the log
file in certain cases that depend on how the daemon was invoked.
This new feature was added to aid in debugging problems where sites
were not running the version of the Condor daemons they thought they
were due to problems in custom-built startup scripts.

condor_ wait is now available in the Windows port.

Added a fix to the accountant that allows users to specify user
priorities with condor_ userprio before any jobs have been submitted.

Added support for running batch files under Windows when using the
STARTD_CRON or USER_JOB_WRAPPER attributes.

Moved from Globus 2.2.2 to Globus 2.2.4 for Condor-G, except for
the DUX 4.0f platform.

Bugs Fixed:

Windows bug fixes:

Fixed a bug which could cause Condor to kill processes that
aren't related to Condor or the job it was running at the time.

Fixed a problem that could cause daemons or tools to crash
when they looked up information about processes running on the
system.

Fixed a problem with the collector dropping TCP updates with
pools larger than roughly 20 machines. This issue only occurs with
UPDATE_COLLECTOR_WITH_TCP enabled.

Fixed an issue with condor_ store_cred reporting success when
in fact under certain circumstances the store command actually failed.

Removed condor_ kbdd_dll. It is no longer used.

Fixed an issue with condor_ birdwatcher that caused it to
leak resource handles.

Fixed an issue with the Windows port of condor_ dagman that
would cause it to crash when POST scripts were used.

Fixed a bug where the environment of jobs in any universe could
be corrupted.

The condor_ startd now properly cleans up execute directories on
root-squashed NFS mounts.

Fixed a problem where the condor_ starter could crash if the
job it was running used Condor's file transfer mechanism and the
full path names to the job's files became longer than a few hundred
characters.

The image_size attribute of a job on Mac OS X is much
closer to the values that ps returns.
Previously it would be highly inflated.

Fixed a memory leak in the condor_ gridmanager.

Added the -Storklog argument to condor_ submit_dag to make it
compatible with the older perl script of the same name.

Removed support for the -libc option for condor_ version.

Added a fix to condor_ compile where if our internal ld managed
to not be invoked during linking of a standard universe executable,
a warning is emitted.

Fixed a minor bug in the file transfer mechanism. Specifically,
if a VANILLA job had when_to_transfer_output set to
ON_EXIT_OR_EVICT, wrote more than one output file, and was
actually evicted, the condor condor_ shadow would have a fatal
run-time error (shadow exception) and your job would be rerun.

DAGMan bug fixes:

If submit files for individual nodes referred to the same log
file with different paths, condor_ dagman would read log events
incorrectly and the DAG would fail.
condor_ dagman is now able to recognize that the different paths
actually refer to the same log file.

Fixed a bug where DAGMan failed to monitor Stork job logs.

If a node submit file doesn't specify a log file, the warning
message now gets printed out in the the DAGMan log file.

Fixed a bug that caused condor_ dagman to fail if first node
submit file has continuation in log file line.

Bugs related to configuration

Fixed a bug where Condor daemons could crash if
COLLECTOR_HOST or NEGOTIATOR_HOST was defined to
be something bogus.

Fixed potential crash in the condor_ collector when
COLLECTOR_NAME was too long.

The default setting for POOL_HISTORY_DIR is no
longer SPOOL.
Using the spool directory would result in history files being
obliterated by condor_ preen.

Fixed a bug which could result in a daemon crashing while it was
writing to its logfile.

Fixed a signal handling bug in the checkpoint server which could
cause the daemon to hang sometimes.

The Kerberos map file now tolerates spaces on either side of the
equals sign instead of generating a parse error.

The -analyze option to condor_ q is only meaningful for certain
universes. condor_ q now warns if the output might not be meaningful.

Java universe: when jar files are transferred to the execute
machine (with should_transfer_files or
transfer_input_files) the condor_ starter will use the
local path (in the execute directory) for the jarfiles, instead of
the original path specified in the submit file.

Previously, if a scheduler universe job died with a signal, the
condor_ schedd would write multiple (conflicting) events into the
UserLog file: a terminate event and an abort event.
Now, only the terminate event is written, not the abort event.

Fixed a minor bug where if the condor_ schedd crashed or was
killed at just the wrong moment while a job was being removed
because the periodic_remove expression had evaluated to
TRUE, the job might have been successfully removed but the
RemoveReason attribute could have been lost.
Now, both actions are taken together atomically.
If a job is successfully removed, it will always have a
RemoveReason attribute.

Fixed a memory leak in the condor_ collector.

Known Bugs:

None.

Version 6.6.5

Release Notes:

None.

New Features:

None.

Bugs Fixed:

Fixed a bug introduced in Condor version 6.6.2 that could cause
condor_ dagman to segfault while parsing some DAG files, or
fail to recognize already-completed nodes in a rescue DAG.

Fixed a bug in condor_ dagman, whereby it could fail to
automatically discover a Condor job's userlog file if the job's
submit file did not have whitespace surrounding the equal sign
on the log file line.

Fixed a bug in condor_ submit that appears to only have
effected OSX machines.
Previously, submit files that only defined a single job and used
queue without any numerical modifiers would result in an
error like this:

Now, condor_ submit will properly process and submit the job from
job description files that contain a single queue statement
with no modifiers.

Fixed a bug in the AIX condor_ starter that was causing the
starter to sometimes kill itself when the job completed. Because this
happened before the condor_ starter reported the job completion back
to the condor_ shadow, such a job would be restarted.

Fixed a few memory and registry handle leaks in the condor_ schedd
and condor_ startd. These leaks particularly affected Windows systems.

On Windows, Condor was known to have trouble accessing config files
with UNC paths (with appropriate permissions set). This has been fixed.

On Windows, condor_ store_cred would fail if the account did not
have Log on Locally privileges, even if the account was allowed
to log in interactively. This has been fixed.

Fixed a bug on Windows that would cause the condor_ schedd to
crash if D_ FULLDEBUG was turned on, and the submitting user
account did not have Administrator access rights.

Known Bugs:

condor_ dagman can fail to detect a job's progress if another
job in the DAG specifies the same underlying userlog file using
a different path or filename (e.g., log=foo and log=./foo) in
its submit file.

Version 6.6.4

Release Notes:

This version only contains platform-specific bug fixes.
Therefore, it was only released for the two effected platforms.

Bugs Fixed:

Fixed a major bug in the Windows NT/2000 port that caused the
Condor daemons to crash when attempting to authenticate.

Fixed the bug in Condor's file transfer mechanism for Mac OSX
that was introduced in version 6.6.3.

Known Bugs:

None.

Version 6.6.3

Release Notes:

The Globus universe support for versions of Globus prior to 2.2 (specifically, those using GRAM 1.5 or earlier) has been removed.

The negotiator no longer crashes when a grid site ClassAd sets WantAdRevaluate but does not contain an UpdateSequenceNumber.

Globus universe jobs were failing to go on hold when a $$() expression
could not be expanded.

On Windows, the system-wide TEMP variable is included in the
execute environment if it is not specified in the submit file.

Fixed a rarely-occurring bug when the child process forked by the schedd gets stuck in an infinite loop when the user does ``condor_submit -s''. This should also fix problems when the child process forked by the collector would sometimes get stuck in an infinite loop when COLLECTOR_QUERY_WORKERS > 0 in the config file.

Version 6.6.2

Release Notes:

There will be another release, 6.6.3, within a few weeks. We decided to
release this version now because it adds the AIX platform and has some bug
fixes which we thought important enough for a release. However, if you are
not affected by the bugs fixed (see below) you may wish to wait for 6.6.3.

New Features:

Clipped support for AIX 5.2.
This means VANILLA universe only - no checkpointing or STANDARD universe.

The setting GRIDMANAGER_GLOBUS_COMMIT_TIMEOUT allows
configuring the two phase commit timeout in Globus. This maps to the
two_phase setting in the Globus RSL.

Added a new configuration variable,
DAGMAN_MAX_SUBMIT_ATTEMPTS, that controls how many
times in a row condor_ dagman will attempt to execute
condor_ submit for a given job before giving up. It cannot be
set to less than 1 attempt, or more than 10; if left undefined,
it defaults to 6.

Added a new tool condor_ updates_stats to dump out the update
statistics information from ClassAds in a human readable format.
Condor 6.6.1, by default, publishes ``update statistics'' into the
ClassAds as published by the condor_ collector. This program parses
this output and displays it to the user in a readable format.

Changed the default condor_ dagman behavior so that it doesn't
check for cycles at startup, only at runtime, since the former
could be expensive for large DAGs. Added a boolean
DAGMAN_STARTUP_CYCLE_DETECT config attribute to
re-enable cycle-detection at startup.

condor_ dagman now offers a configuration variable,
DAGMAN_MAX_SUBMITS_PER_INTERVAL, which controls how
many individual jobs condor_ dagman will submit in a row before
servicing other requests (such as a condor_ rm).

The grid_monitor now automatically detects jobmanager scripts on the
remote gatekeeper. Previously it was limited to supporting the condor,
fork, lsf, pbs, and remote jobmanager scripts.

A new parameter, SEC_DEBUG_PRINT_KEYS, controls whether or not
the keys used for encryption get printed into the log.
The default is false.

Bugs Fixed:

Jobs that make use of Condor's file transfer mechanism were not
automatically authorized to read/write input/output files when
flocking to machines that did not happen to be in the
HOSTALLOW_WRITE list. This bug has existed since 6.3.

Eliminated a small chance that a grid_monitor log file or state file
might be reused. The unique identifying numbers are now unique across
the entire gridmanager, not each Globus resource.

Eliminated a race condition which might cause the grid monitor to
erroneously decide that the status file was broken when in fact it
was being uploaded and was empty.

The grid monitor now attempts to restart transfers in the event of
globus-url-copy hanging.

Removed some settings from the default configuration files
shipped with Condor that are no longer used in the code.

Fixed bugs in condor_ dagman parsing of submit files (to determine
node log files). Previously, a submit file line beginning with
"log" (e.g., "LogLock = True") would be interpreted as a log file
line. Also, if "log" was defined twice in the submit file,
condor_ dagman would incorrectly use the first definition, rather than
the last.

Re-added PVM support for IRIX 6.5.

Fixed an indirect bug whereby condor_ dagman could fail with an
assertion error if it encounters both a terminate and a abort event in
the userlog for the same job; this can happen due to a bug in the
condor_ schedd, which is not yet fixed.

condor_ dagman now works right with nodes that have an initialdir
specified in the node submit file. (Previously, specifying
an initialdir only worked if the log file path was absolute.)

condor_ dagman now responds more quickly to a request to be
removed from the queue (via condor_ rm), even if it is in the
midst of submitting jobs. Previously, condor_ dagman would
finish submitting all ready jobs before responding to a removal
request, which could take a long time, and forced it to
immediately remove all the jobs it had just submitted
unnecessarily.

If a scheduler universe job terminates via a signal, the
condor_ schedd logs both a terminate event and an abort event
to the userlog.

Keyboard activity is not reported for pseudo-ttys on Mac OS X, only
the physically connected keyboard

Version 6.6.1

Release Notes:

condor_ analyze is not included in the downloads of Version 6.6.1.
The existing binary from Version 6.6.0 is likely to work on all platforms
for which it was released.

New Features:

Added full support (including standard universe jobs with
checkpointing and remote system calls) for Linux i386 RedHat 9
(using gcc/g++ version 3.2.2 and glibc version 2.3.2).

Added full support (including standard universe jobs with
checkpointing and remote system calls) for Linux i386 RedHat 8
(using gcc/g++ version 3.2 and glibc version 2.2.93).

The time it takes condor_ dagman to submit jobs has been
reduced slightly to improve up the startup time of large DAGs.

In order to help reduce load on the condor_ schedd when
condor_ dagman is submitting jobs, there is a new config
variable, DAGMAN_SUBMIT_DELAY, to specify the number
of seconds condor_ dagman will sleep before submitting each
job.

Enabled the ``update statistics'' in the condor_ collector by
default in both the executable and in the default configuration.

Command-line arguments to condor_ dagman are now handled
case-insensitively.

Added support for Condor-G and strong authentication to Condor
for IRIX 6.5, but removed support for checkpointing and remote
system calls.
We plan to add support in Condor for IRIX's kernel-level
checkpointing in a future release.

Added a -p option to condor_ store_cred so that users
can now specify the the password on the command line instead of getting
prompted for it.

The gahp_server helper process for Condor-G includes patches from
the LHC Computing Grid Project to increase data transfer performance of
the Condor-G client. Previous versions of Condor-G could bog down in
accepting new transfer requests, producing a variety of errors.

Added a new configuration setting,
SUBMIT_SEND_RESCHEDULE which controls whether or not
condor_ submit should automatically send a condor_ reschedule
command when it is done.
Previously, condor_ submit would always send this reschedule so
that the condor_ schedd knew to start trying to find matches for
the new jobs.
However, for submit machines that are managing a huge number of jobs
(thousands or tens of thousands), this step would hurt performance
in such a way that it became an obstacle to scalability.
In this case, an administrator can set
SUBMIT_SEND_RESCHEDULE to FALSE, this extra
step is not performed, and the condor_ schedd will try to find
matches whenever the periodic timer in the condor_ negotiator
(NEGOTIATOR_INTERVAL) goes off.

Pool administrators can now specify the length of time before
the condor_ starter sends its initial update to the
condor_ shadow by defining
STARTER_INITIAL_UPDATE_INTERVAL.
The default is 8 seconds.
This setting would not normally need changing except to fine-tune a
heavily loaded system.

Administrators can now specify the default session duration for
each Condor subsystem.
This allows for fine tuning the image size of running Condor daemons
if the memory footprint is a concern.
The default for tools is 1 minute, the default for condor_ submit
is one hour, and the default for daemons is 100 days.
This does not mean that tools cannot run more than one minute or
submit cannot run for more than an hour; it only affects memory
usage.

Added new configuration setting
GRID_MONITOR_HEARTBEAT_TIMEOUT.
If this many
seconds pass without hearing from the grid_monitor, it is
assumed to be dead. Defaults to 300 (5 minutes). Increasing
this number will improve the ability of the grid_monitor to
survive in the face of transient problems but will also
increase the time before Condor notices a problem. Prior to
this change the gridmanager always waited 5 minutes, the user
could not change the setting.

Added new configuration setting
GRID_MONITOR_RETRY_DURATION.
If something goes wrong
with the grid_monitor at a particular site (like
GRID_MONITOR_HEARTBEAT_TIMEOUT expiring), it will be retried
for this many seconds. Defaults to 900 (15 minutes). If we
can't successfully get it going again the grid monitor will be
disabled for that site until 60 minutes have passed. Prior to
this change the condor_gridmanager wait 60 minutes after any
failure.

Bugs Fixed:

Fixed bugs related to network communication and timeouts that
impact scalability in Condor:

Fixed a bug inside Condor's network communication layer that
could result in Condor daemons blocking trying to read more data
after a socket had already been closed.

Fixed a condor_ negotiator bug that could, in certain rare
circumstances, cause a condor_ schedd to hang for five minutes
while trying to communicate with it.

Fixed a bug in which TCP connections would re-authenticate
needlessly when Condor's strong authentication was enabled.
This was not harmful but incurred a bit of overhead, especially
when using Kerberos authentication.

Fixed bugs related to network security sessions which were
getting cleared out.
If the timing was unfortunate, this could cause some jobs to fail
immediately after completion.
So, Condor no longer clears out security sessions periodically (it
used to happen every 8 hours) nor does it do so when a daemon
receives a condor_ reconfig command.

Fixed a bug in the standard universe where C++ code that threw an
exception would result in abortion of the executable instead of the
delivery of the exception. This bug affects Condor version 6.6.0 for
Redhat 7.x.

Fixed a condor_ shadow bug that could result in a fatal error
if the following 3 conditions were met: (1) the job enables Condor's
file transfer mechanism, (2) the job wants Condor to automatically
figure out what files to transfer back (the default), and (3) the
job does not specify a userlog.

Fixed bug whereby condor_ dagman, if removed from the queue via
condor_ rm, could fail to remove all of its submitted jobs if
any of their submit events had not yet appeared in the userlog.

Fixed a few bugs in condor_ preen:

It will no longer potentially remove files related to a valid
Computing on Demand (COD) claim on an otherwise idle machine.

condor_ preen will no longer keep reporting that it had
successfully removed a directory which was in fact failing to be
removed.

Fixed the faulty argument parsing in condor_ rm,
condor_ release, and condor_ hold.
Before you could accidentally type condor_rm -analyze, and it
would remove all of your jobs.
Now it gives an error.

On Windows, when you type a command like
condor_reconfig.exe instead of condor_reconfig, you no
longer get an error.

Fixed a bug on Windows that would cause ``GetCursorPos() failed''
to appear repeatedly in the StartLog. The startd now uses a different
function to track mouse activity that does not have a tendency to fail.

Fixed a bug on Windows that would prevent some condor_ shadow
daemons from obtaining a lock to their log file under heavy load, and
thus causing them to EXCEPT().

Fixed a bug on Windows where file transfers would incorrectly fail
because of bad permissions when using domain accounts with nested groups,
or when UNC paths were used.

Fixed the bug where the condor_ starter would fail to transfer
back core files created by Vanilla, Java and MPI universe jobs.
This bug was introduced in Condor version 6.5.2.
Now, Condor correctly transfers back any core files created by
faulty user jobs in any job universe.

In some circumstances, condor_ history would fail to read
information about some jobs, and would report errors. In particular,
when jobs had large environments, it would fail. This has been
corrected.

Fixed a rare bug affecting condor_ dagman when job-throttling
was enabled: if condor_ dagman was removed from the queue
together with some of its own jobs (e.g., via condor_rm -a),
it would quickly submit new jobs to replace them before
recognizing that it needs to exit. It now shuts down
immediately without submitting and then removing these
unnecessary jobs.

Fixed a potential security problem that was introduced in Condor
version 6.5.5 when the REQUIRE_LOCAL_CONFIG_FILE
configuration setting was added.
This setting used to default to FALSE if it was not defined in the
configuration files.
It now defaults to TRUE.
If administrators define local configuration files for the machines
in their pool, it should be a fatal error if those files don't exist
unless the administrators actively disable this check by defining
REQUIRE_LOCAL_CONFIG_FILE to be FALSE.

Fixed a bug on Windows that would cause the condor_ startd to
EXCEPT() if the condor_ starter exited and left orphaned processes to
be cleaned up. This bug first appeared in 6.5.0.

Fixed a bug on Windows that would cause graceful shutdowns on
Windows (such as when condor_vacate is called) to fail to
complete.

The gahp_server helper program, which provides Globus services
to Condor-G, was always dynamically linked, even in statically-linked
releases.
The statically linked distributions of Condor now include a static
gahp_server.

Fixed the messages written to the Condor daemon log files in
various error conditions to be more informative and clear:

The error message in the SchedLog that indicates that swap
space has been depleted has been rephrased so it appears to be
significant.

Certain serious error messages are now being written to the
D_ ALWAYS debug level that used to only appear if other debug
levels were enabled.

Clarified log messages related to errors looking up user
information in the passwd database on UNIX and for creating
dynamic users on Windows.

Log messages related to keep-alives sent between the
condor_ schedd and condor_ startd (written to D_ PROTOCOL)
now include the ClaimId on both sides, so that it is easier
to find potential problems and figure out which keep-alive
messages correspond to what resources.

Internal timeouts in the grid_monitor have been increased,
increasing robustness during transient errors.

Known Bugs:

Submission of MPI jobs from a Unix machine to run on Windows
machines (or vice versa) fails for machine_count > 1. This is
not a new bug. Cross-platform submission of MPI jobs between
Unix and Windows has always had this problem.

A multiple install of Condor's standard universe support libraries
onto an NFS server for the purposes of having a heterogeneous mix of Linux
distribution revisions all being able to utilize the same condor_ compile
does not function correctly if Redhat 9 is one of the distributions.

Version 6.6.0

New Features:

The condor_ dagman debugging log now reports the total number
of ``Un-Ready'' Nodes (i.e. those waiting for unfinished
dependencies) in its periodic summaries. In the past, the
omission of this state led to confusion because the total of all
reported job states didn't always match the total number of jobs
in the DAG.

Most Condor commands (condor_ on, condor_ off,
condor_ restart, condor_ reconfig, condor_ vacate,
condor_ checkpoint, condor_ reschedule) now support a -all
command-line option to specify which daemons to act on.
This is more efficient and much easier to use than previous methods
for accomplishing the same effect.
Using -all with condor_ off correctly leaves the existing
condor_ master processes running on each host, so that a subsequent
condor_ on would work.
See section 3.10.1 on
page for more details on
proper use of -all with condor_ off and condor_ on

Bugs Fixed:

Fixed a bug under Solaris 8 with Update 6+, and Solaris 9 where
Condor would incorrectly report the console and mouse idle times as zero.

The standard-universe fetch_files feature was not cleaning up
temporary files on the execution machine.

In rare circumstances, a Linux kernel bug results in conflicting
information about system boot time (/proc/stat and
/proc/uptime).
Specifically, the "btime" field in /proc/stat suddenly jumps to
the present moment and then stays at that value. This
was resulting in incorrect estimation of process ages, which caused
Condor's estimation of CondorLoadAvg to be completely wrong. A more
robust heuristic is now being used.

A long configuration line with with continuation lines can cause the
config file parser to not properly skip the leading whitespace from
the continued lines. This has been corrected.

The Grid Monitor now will automatically probe for and work with
``unknown'' batch systems.

Fixed a bug where under certain circumstances condor_ dagman
would fail to detect an unsuccessful invocation of
condor_ submit, and would instead report the job as
successfully submitted with job id 0.0.

Fixed a bug which was causing problems when a periodic_remove
expression for a scheduler universe job evaluates to true. Under
these conditions, the schedd did not log the job termination to the
job log. Additionally, the schedd would exit with an error status.

Fixed a recently-introduced condor_ dagman bug where the number
of node retries (specified with the RETRY keyword) wasn't being
updated after some failures; instead, the node would be allowed
to retry indefinitely if it kept failing.

Fixed a recently-introduced bug where shutting down the
condor_ schedd caused condor_ dagman to remove all its jobs
from the queue and write a rescue file, rather than simply
exiting so that it could recover automatically upon restart.

Whenever condor_ reconfig was used to re-configure multiple
daemons which included the condor_ collector for a pool, the
command would start to fail after the condor_ collector was
reconfigured due to problems with security sessions in Condor's
strong authentication code.
This situation no longer causes problems for the condor_ reconfig
tool, and it can properly re-configure multiple daemons at once,
even if one of them is the condor_ collector for a pool.

Most Condor commands (condor_ on, condor_ off,
condor_ restart, condor_ reconfig, condor_ vacate,
condor_ checkpoint, condor_ reschedule) now check to make sure
they are not sending a duplicate command if the user specifies the
same target machine or daemon twice. For example:

condor_reconfig hostname1 hostname2 hostname1

will only send a single reconfig command to hostname1.

Fixed a bug in the HPUX version of Condor which was causing the
startd to occasionally abort operation. This has been in Condor since
version 6.1.1.

The Condor daemons will no longer overwhelm NIS servers
when large numbers of daemons are running. Condor now caches
uid and group information internally, and refreshes the
cache entries on a specified interval (which defaults to 5
minutes). See section 3.3.3 on
page for more details.

Known Bugs:

The condor_ preen program does not know about Computing on
Demand (COD) claims.
If there are no regular Condor jobs on a given machine, but there
are COD claims, and condor_ preen is spawned, it will remove files
related to the COD claims.
In version 6.6.0, sites using COD are encouraged to disable
condor_ preen by commenting out the PREEN setting in the
config files.
This bug has been fixed in Condor version 6.6.1.

Normally, if a user's job crashes and creates a core file on a
remote execution machine, the condor_ starter will automatically
transfer the core file back to the submit machine.
However, beginning in Condor version 6.5.2, if a vanilla, Java, or
MPI universe job creates a core file, the condor_ starter will fail
to transfer it back.
This bug will be fixed in version 6.6.1.

There are a few bugs related to Condor tools failing to
correctly locate the condor_ negotiator daemon.
These bugs usually show up if a site is using non-standard ports for
the central manager daemon.
However, some of the bugs show up regardless of if the negotiator is
listening on the standard port or not.

condor_config_val -negotiator queries the
condor_ collector, instead of querying the
condor_ negotiator like it should.

Using the -pool option to condor_q -analyze
will not work.
The tool will fail to find and query the condor_ negotiator
for user priorities which it needs to determine why jobs may
not be running.

The Condor tools that support either the -negotiator
or -collector options do not work when a user also
specifies the -pool to define a remote pool to
communicate with.
The tools print a somewhat confusing message in this case.

Most Condor tools that support -pool hostname will
also recognize -pool hostname:port if the remote
condor_ collector is listening on a non-standard port.
However, the condor_ findhost tool does not work if given a
-pool option that includes a port.