Next: 8.5 Development Release Series
Up: 8. Version History and
Previous: 8.3 Development Release Series
Contents
Index
Subsections
8.4 Stable Release Series 6.8
This is a stable release series of Condor.
It is based on the 6.7 development series.
All new features added or bugs fixed in the 6.7 series are available
in the 6.8 series.
As usual, only bug fixes (and potentially, ports to new platforms)
will be provided in future 6.8.x releases.
New features will be added in the forthcoming 6.9.x development series.
The 6.8.x series supports a different set of platforms than 6.6.x.
Please see the updated table of available platforms in
section 1.5 on page
.
The details of each version are described below.
Version 6.8.7
Release Notes:
New Features:
Bugs Fixed:
Known Bugs:
Additions and Changes to the Manual:
Version 6.8.6
Release Notes:
- Condor is now officially supported on Microsoft Vista.
- Condor is now officially supported on MacOS running natively on Intel CPUs. (and Condor bianries for Intel MacOS are now available for download).
- Condor now uses Globus 4.0.5 for GSI, pre-WS GRAM, and GridFTP support.
New Features:
- On all UNIX ports of Condor except MacOSX, AIX, and Tru64, separate
debug symbol files are now supported. This allows meaningful debugging of
core files in addition to attaching to stripped executables during runtime.
- condor_ dagman now prints reports of pending nodes to the
dagman.out, if it has been waiting more than
DAGMAN_PENDING_REPORT_INTERVAL seconds without seeing
any node job events. (This is to help diagnose the problem if
condor_ dagman gets "stuck".)
- Optimized the submission of grid-type gt4 grid universe jobs to the
remote resource. Submission now takes one operation instead of three.
- The condor_ shadow will obtain a session key to the condor_ schedd
at the start of the job instead of potentially waiting until the job
completes. This reduces the chances of re-running already completed jobs
in the event of authentication failures (for instance, if a Kerberos KDC is
down or overloaded).
Bugs Fixed:
Known Bugs:
- Grid universe type GT4 (web services GRAM) does not work properly on Itanium-based machines, because it requies Java 1.5, which is not available on the Itanium (ia_64).
Additions and Changes to the Manual:
- Several updates to the DAGMan documentation (section 2.11).
Version 6.8.5
Release Notes:
- This release is not fully compatible with the 6.6 series (or
anything earlier than that). Specifically, a 6.6 schedd will be rejected when
it tries to contact a 6.8.5 startd to make use of a claim.
- The Globus libraries used by Condor now include the following advisory
packages:
- globus_gss_assist-3.23
- globus_xio-0.35
- globus_gram_protocol-6.5
- globus_gass_transfer-2.12
See http://www.globus.org/toolkit/advisories.html
for details on the
bugs fixed by these updated packages.
The patch given in Globus Bugzilla 5091
(http://bugzilla.mcs.anl.gov/globus/show_bug.cgi?id=5091) is also
included.
New Features:
- A clipped port to x86 Debian 4.0 has been added.
- The functionality embodied in condor_ q -better-analyze is now
available for X86_64 native ports of Condor.
- We now supply distinct, native ports for Mac OS X 10.3 and 10.4.
- There is a new configuration macro
COLLECTOR_REQUIREMENTS that may be used to filter out
unwanted ClassAd updates. For more information, see
section 3.3.17.
- Added a -f option to condor_ store_cred, which generates
a pool password file that can be used for the PASSWORD authentication
method on Unix Condor installations.
Bugs Fixed:
- The config file entry HOSTALLOW_DAEMON is now looked at in
addition to ALLOW_DAEMON .
- Fixed a bug where under certain conditions Condor's file logging codes
would perform a segmentation fault.
- Removed periodic re-indexing of the quill history_vertical table.
This should not be needed with the current schema, and it should speed
up database re-indexing operations.
- Fixed a bug that would cause the dedicated scheduler to
crash, if the condor_ schedd was suspended or blocked
for more than approximately 10 minutes.
The most likely cause of a suspension
is a condor_ schedd executable
mounted from a remote NFS file system.
- Fixed a bug where if -lc was specified multiple times for
the compiler when using condor_ compile (some tools like pgf90
do this), condor_ compile would fail to link the application and emit
a multiply defined symbol error for many symbols.
- Fixed a bug where Condor erroneously indicates that a scheduler
universe's job executable is missing or not executable.
This occurred if the scheduler
universe job had been submitted with
CopyToSpool = false in the submit
description file, and the user had a umask which prevented the user
named condor from following the search path to the
user-owned executable.
- Fixed a bug that could cause the condor_ schedd to crash if it
received too many matches in one negotiation cycle (more than 1000 on a
Linux platform).
- Fixed a bug in which condor_ history did not honor the -format flag
properly when Quill is in use.
- Fixed a bug in which a java property that includes
surrounding double quote marks
caused the detection of a java virtual machine to go awry.
The fix, which may change in the future, changes any extra double quotes
within a property value to single quotes.
- Fixed a bug in which the condor_ quill daemon
crashed occasionally when the Postgres database
server was unavailable.
- The Solaris 9 Condor package can be used under Solaris 10 again.
Changes in 6.7.20 broke this compatibility.
- condor_ dagman now does a better job, especially in recovery mode,
of detecting potentially incorrect submit events.
Those have Condor IDs not matching what is expected.
- condor_ dagman now truncates existing node job user log files
to zero length, rather than deleting the log files. This prevents breaking the
link if a user log file is set up as a link.
- When starting a GridFTP server to handle file transfers for gt4
grid jobs, the condor_ gridmanager now properly sets the
GLOBUS_TCP_PORT_RANGE and GLOBUS_TCP_SOURCE_RANGE environment
variables if appropriate.
- Fixed a bug that could cause a security session to get deleted
by the server (for example, the condor_ schedd) before the client
(for example, the condor_ shadow) was done using it.
This bug can be observed as
communication failure the next time the client tried to connect to
the server. In some cases, this caused jobs to be re-queued to be run
again, because the final update of the job queue failed.
- If a grid job becomes held while it's still submitted to the remote
resource and is then removed, the condor_ gridmanager will now attempt
to remove the job from the remote resource before letting it leave the
local job queue.
- Fixed a bug in the condor_ c-gahp that caused it to not use the
user's credential for authentication with the remote schedd on some
connections.
- The condor_ c-gahp now properly lists all of the commands it
supports in response to the COMMANDS command.
- Fix a bug in how the condor_ c-gahp updates configuration parameter
GSI_DAEMON_NAME to include the job's credential if it has one.
- Removed the 5096-character restriction on the length of DAG
macro values (and names) in condor_ dagman.
- Condor-G will now notice when jobs are missing from the status
reports sent by the Grid Monitor.
Jobs can disappear for short periods of time under normal circumstances,
but a prolonged absence is often a sign of problems on the remote machine.
The amount of time that a job can go missing from the Grid Monitor
status reports before the condor_ gridmanager reacts can be set by the
new configuration parameter GRID_MONITOR_NO_STATUS_TIMEOUT .
The default is 15 minutes.
- condor_ q -analyze will now print a warning if a job being analyzed
is already completed or if a grid universe job being analyzed has already
been matched.
- In condor_ shadow, when forwarding an updated X509 proxy to an
executing job, the logic for whether to delegate or copy the proxy
(determined by configuration parameter
DELEGATE_JOB_GSI_CREDENTIALS ) was reversed.
The authentication logic for this operation was also incorrect, causing the
operation to fail in many instances.
- Made a small improvement to the reliability of Condor's process
ancestry tracking under Linux. However, jobs that create children
with more than 4096 bytes of environment are still problematic, due to
a Linux kernel limitation that prevents reading more than 4k from
/proc/<pid>/environ. The only truly reliable way to ensure that
Condor is aware of all processes spawned by a Unix job is to use
VMx_USER.
- condor_ glidein option -run_here no longer fails when the
current working directory is not in PATH.
- condor_ glidein option -runtime would cause runtime errors
at startup under some batch systems. The problematic parentheses characters
are no longer generated as part of the environment value that is set by
this option.
- On rare occasions, the condor_ startd will compute a negative MIPS
rating when performing benchmarks on the machine. This caused the
Mips attribute to disappear from the machine ad. Now, the
condor_ startd ignores these bogus results. The cause of the negative MIPS
ratings is still unknown.
- Fixed a bug that caused condor_ dagman to hang if it processed,
in recovery mode, a node for which all submit attempts failed and a
POST script was run.
- Fixed a bug that would cause the condor_ negotiator's memory
usage to grow over time when job or machine ClassAds made use of
ClassAd functions that do regular expression matching operations.
- Fixed a bug that was preventing Condor daemons from caching DNS
information for hosts authenticated via HOSTALLOW settings (i.e. no
strong authentication). The collector, in particular, should spend
much less time on IP to hostname lookups.
- When a job has an X509 proxy file (as indicated by the
X509UserProxy attribute in the job ad), the condor_ starter
now always sets X509_USER_PROXY in the job's environment
to point to a copy of that proxy file.
- Fixed several bugs that could cause the condor_ c-gahp to time out
when talking to the condor_ schedd and falsely report that commands
completed successfully. A common result is grid type condor grid universe
jobs being placed on hold because the condor_ gridmanager mistakenly
thinks they disappeared from the remote condor_ schedd's queue.
- Fixed a bug in Stork which was causing it to write the output and
error log files as the wrong user, and read the input file as the wrong
user.
- Fixed a bug in Stork which was causing it to kill hung jobs
as the wrong user.
- Fixed some possible static buffer overflows related to the
transferring of a job's data files.
- Jobs with standard output and error going to the same file should
not lose data in the common case.
- Heavily loaded condor daemons (e.g. condor_ schedd) had a
problem when they got behind processing the exit status of child
process (e.g. condor_ shadow). The problem was that the daemon would
continue to expect status updates from its child, even after the child
had exited, and when the daemon decided that the lack of status
updates meant that the child was hung, the daemon would try to kill
any process that happened to have the same pid as the child which had
already exited. In the case of the schedd, this would also result in
the job run attempt being marked as a failure and the job would remain
in the queue to run again. Condor no longer activates the ``hung child''
procedure for jobs which have exited but which have not yet had their
exit status processed internally by the daemon.
- For grid-type condor jobs, made the condor_ gridmanager more
tolerant of unexpected responses from the remote condor_ schedd.
- On HPUX and AIX, fixed a bug that could cause Condor's process
family tracking logic to lose track of processes.
- Fixed a memory error that would cause condor_ q to sometimes
crash when using Quill.
- Fixed a problem where the Windows condor_ credd would be
inaccessible to other Condor components if CREDD_HOST were
set to a DNS alias and not the canonical DNS name.
- Fixed a bug in the condor_ shadow on Windows where it would
fail to correctly perform the PASSWORD authentication method.
- The Windows condor_ credd now uses the configuration parameter
CREDD_HOST, if defined, to set its name when advertising itself
to the condor_ collector. Thus, if CREDD_HOST is set to something
other than then condor_ credd's hostname, clients can still locate the
daemon.
- Fixed a bug in the condor_ c-gahp that could cause it to not
perform hold, release, or remove commands on jobs in the remote
condor_ schedd.
- Fixed the default value of configuration parameter
STARTD_AD_REEVAL_EXPR.
Known Bugs:
- condor_ dagman incorrectly parses DAG file VARS lines specifying
more than one macroname/value pair. You can work around this problem
by specifying each macroname/value pair on a separate line. (This bug
was introduced in version 6.8.5.)
Version 6.8.4
Release Notes:
New Features:
Bugs Fixed:
- Fixed a bug in condor_ q that only happened when running
with a Quill database and using the long (-l) option. The bug was
introduced in 6.8.3. The bug truncated the output of condor_ q, and
only displayed some of the job attributes.
- Fixed a bug in condor_ submit that caused standard universe jobs
to be unable to open their standard output or standard error, if
should_transfer_files is YES or
IF_NEEDED in the submit description file.
- Fixed a bug in condor_ glidein that could cause it to request the
queue unknown when submitting its setup job to GRAM, leading to
failures.
- The OnExitRemove expression generated for DAGMan by
condor_ submit_dag evaluated to UNDEFINED for some values of
ExitCode, causing condor_ dagman to go on hold.
- Fixed a bug in which garbage values (random bits from memory)
were sometimes
written to the pool history file in the field representing the
backfill state.
- condor_ submit_dag now generates a submit file
(.condor.sub) for condor_ dagman that sends stdout and
stderr to separate files. This has always been recommended,
and recent versions of Condor cause stdout and stderr to
overwrite each other if they are directed to the same file.
- Fixed several bugs for grid type nordugrid jobs.
The condor_ gridmanager would create an invalid RSL for these jobs
and save their output to the wrong location in some cases.
- condor_ glidein now properly escapes glidein tarball URLs that
contain characters that have special meaning to GRAM RSL. It also turns
on TCP updates to the condor_ collector,
if they are enabled on the submit machine.
- When using the submit file option getenv=true,
environment
variables containing a newline in their value are no longer inserted
into the job's environment. The condor_ schedd daemon
does not allow newlines
within ClassAd values, so the attempt to insert such values resulted
in failure of job submission and caused the condor_ schedd daemon
to abort.
- Fixed a bug that caused condor_ dagman to hang if a node
with a POST script and retries initially runs but fails, and then
has all condor_ submit attempts fail on the retry.
- Fixed a problem in the Windows installer where the
DAEMON_LIST parameter would be incorrectly set if the ``Join
an existing Condor pool'' option was selected or the ``Submit jobs to
Condor pool'' option was unchecked. In the first case, a
condor_ collector and condor_ negotiator would incorrectly be run on
the machine. In the second case, a condor_ schedd would incorrectly
be run. The problem exists in all previous 6.8 and 6.9 series
releases.
- Fixed a bug in the handling of local universe jobs
for a very busy condor_ schedd daemon.
When a local universe job completed, the condor_ starter might not
be able to connect to the condor_ schedd daemon to update final information
about the job, such as the exit status.
Under this circumstance,
the condor_ starter would hang indefinitely.
The bug is fixed by having the condor_ starter attempt
to retry a few times (with a delay in between each attempt) before
exiting with a fatal error.
The fatal error causes the job to restart.
Known Bugs:
- Setting DAGMAN_DELETE_OLD_LOGS to false can cause
condor_ dagman to have problems (including hanging), especially
when running a rescue DAG. If you want to keep your old user log
files, the best thing to do is to rename them before each
condor_ dagman run. If you do run with
DAGMAN_DELETE_OLD_LOGS set to false, check your
dagman.out file for error messages about submit event
Condor IDs not matching the expected value. If you get such an
error, you will probably have to condor_ rm the condor_ dagman
job, remove or rename the old user log file(s) and run the rescue DAG.
(Note: this bug also applies to earlier versions of condor_ dagman.)
Version 6.8.3
Release Notes:
- In this release,
the command condor_ q -long does not work when querying
the Quill database.
Instead, use the command
condor_ q -direct quilld -long,
or use a previous version of condor_ q.
- Performed a security audit of all places where Condor opens files,
to make certain files are opened with a reasonable permission mode
and with the
O_EXCL flag whenever possible.
New Features:
- Added the JOB_INHERITS_STARTER_ENVIRONMENT configuration
macro. When set
to True, jobs inherit all environment variables from
the condor_ starter. This is useful for glidein jobs that need to access
environment variables from the batch system running the glidein daemons.
The default for this configuration macro is False, so existing behavior
is unchanged. This feature does not apply to standard and pvm universe
jobs.
- Changed the default UDP receive buffer for the
condor_ collector from 1M to 10M. This value can be configured with
the (existing) COLLECTOR_SOCKET_BUFSIZE macro.
NOTE: For some Linux distributions, it may be necessary to configure
a larger value than the default; this parameter is
/proc/sys/net/core/rmem_max . You can see the values that the
condor_ collector actually used by enabling D_FULLDEBUG for the
condor_ collector and looking at the log line that looks like this:
Reset OS socket buffer size to 2048k (UDP), 255k (TCP).
- Added a new configuration macro to control the size of the
TCP send buffers for the condor_ collector. This macro used to
be the same as COLLECTOR_SOCKET_BUFSIZE. The new macro is
COLLECTOR_TCP_SOCKET_BUFSIZE , and it defaults to 128K.
- Added a clipped port for SuSE Linux Enterprise Server 9 running on the
PowerPC architecture. Note the known bug below.
- The condor_ schedd now maintains a birth date for the job queue.
Nothing in Condor currently uses this feature, but future versions of condor_ quill may require it.
- There is a new configuration file macro
RANDOM_INTEGER(min,max[,step]). It produces a
pseudo-random integer within the range
min and max,
inclusive at configuration time.
Bugs Fixed:
- Fixed a deadlock situation between the condor_ schedd and
the condor_ startd that can
significantly impact the condor_ schedd's performance. The likelihood of the
deadlock increased based upon the number of VMs advertised by the
condor_ startd.
- Fixed a bug reading the user job log on Windows that caused
occasional DAGMan confusion.
Thanks to Fairview Software, Inc. for
both finding the bug and writing a patch.
- Fixed a denial of service problem: Condor daemons no longer freeze
for 20 seconds when a client connects to them and then sends no data.
This behavior is common with port scanners.
- Fixed a race condition with condor_ quill caused by
PostgreSQL's default transaction isolation level being ``read
committed''.
This bug would cause truncated condor_ q reads when using Quill.
- Fixed a bug where the condor_ ckpt_server would segfault when
turned off with condor_ off -fast.
- Fixed a bug in the condor_ startd where it could die with
SIGABRT when a condor_ starter exited under certain rare
circumstances.
The bug seems to have been most likely to appear on x86_64 Linux
machines, but could potentially affect all platforms.
- Fixed a problem with condor_ history when running with Quill enabled,
which caused it to allocate an unbounded amount of memory.
- Fixed a problem with condor_ q when running with Quill, which caused
it to silently truncate the printing of the job queue.
- Fixed a bug in the condor_ gridmanager that caused the following
configuration files parameters to be ignored for grid types condor and
nordugrid jobs: GRIDMANAGER_RESOURCE_PROBE_INTERVAL,
GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE, and
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE.
- Fixed a bug in condor_ run that caused it to abort on non-fatal
warnings from condor_ submit and print incorrect error messages.
- Fixed a bug in the condor_ gridmanager dealing with grid type gt4
grid universe jobs. If the job's standard output or error was not specified
in the job ClassAd, the condor_ gridmanager would create an improper GRAM
RSL string, causing the job to fail.
- Fixed a bug in the condor_ gridmanager that could cause it to
delegate the wrong credential when refreshing the credentials for a
grid type gt4 grid universe job.
- The condor_ gridmanager could get into a state where it would no
longer start up Globus jobmanagers for grid type gt2 grid universe jobs,
if previous requests failed due to connection errors. This bug has been
fixed.
- The condor_ c-gahp now properly exits when the pipe to its parent
goes away. Before, it would fill its log with large amounts of useless
messages, before exiting several minutes later.
- Fixed a bug where a problem opening standard input, output, or error,
the standard universe might generate an incorrect warning in the
condor_ shadow's log.
- The condor_ gridmanager now recovers properly when a proxy refresh
fails for a gt2 grid universe job in the stage-out state. Before, the job
would become held with a hold reason of ``Globus error 3: an I/O operation
failed''.
- A number of fixes to minor typos and incorrect formatting in
Condor's log files.
- When REQUEST_CLAIM_TIMEOUT was reached and the
condor_ schedd
failed to contact the condor_ startd to release the claim, the
condor_ schedd would
periodically try releasing the claim indefinitely, possibly resulting in
a lengthy communication delay each time.
- Under Windows, Condor daemons such as the condor_ schedd were sometimes
limiting their use of pending connect operations more than they should
have. This would result in the message, ``file descriptor safety level
exceeded''.
- condor_ fetchlog no longer allows or documents the -dagman option.
The option's appearance was an error. The option never worked.
- The condor_ schedd ensures that the initial job queue log file
contains a sequence number for use by Quill. This fixes a case in
which no sequence number was inserted, because the initial rotation of
this (empty) file failed. Quill also now reports exactly what the
problem is if it reads a job queue log in this state, rather than
simply crashing. This problem has so far only been observed under
Windows.
- Fixed a problem on Windows where, when submitting a job with a
sandbox (for example, using the -s or -r option to
condor_ submit), an erroneous file permissions check in the
condor_ schedd would result in a failed submission.
- The condor_ startd would crash shortly after start up if the
RANK expression contained any use of the unary minus
operator. This patch should also fix any other cases where Condor
daemons crashed due to the use of the unary minus operator in ClassAd
expressions.
- Stork now writes a terminated event to the user log when it removes
a transfer job from its queue because of failures to invoke a transfer
module. Without this event, DAGMan would not notice that these jobs had
left the queue.
- Fixed a problem where the condor_ schedd on Windows would
incorrectly reject a job if the client provided an Owner
attribute that was correct but differed in case from the authenticated
name. This bug was thought to have been fixed in Condor 6.8.0.
- Fixed problems with condor_ store_cred behaving strangely when
storing or removing a user name that is some initial substring of
``condor_pool''. Specifying such a user name would be incorrectly
interpreted as equivalent to specifying the -c option.
- Fixed a problem with condor_ glidein spewing lots of text to
the screen when checking the status of a job it submitted.
- A new version of the GT4 GAHP is included, with the following changes:
- A new axis.jar from Globus fixes a thread safety bug that
can cause lockups in subscriptions for WS notifications. See Globus
Bugzilla 4858
(http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=4858).
- Fixed bugs that caused memory related to destroyed jobs to not
be reclaimed in both the client and the server.
- Removed redundant usage of Secure Message, Secure Conversation,
and Transport Security when talking to a WS GRAM service. Now, only
Transport Security is used.
- Fixed memory leaks in condor_ quill.
- Fixed a bug that might have caused condor_ startd problems
launching the condor_ starter for the standard universe on 64-bit systems.
- Improved Condor's file transfer. If you request that Condor
automatically transfer back your output, it now detects changes better.
Previously, it would only transfer back files that had a more recent timestamp
than the spool date. Now, it will transfer back any file that has changed
in date (including being dated in the past) or changed in size.
Known Bugs:
Version 6.8.2
Release Notes:
- Condor now uses Globus 4.0.3 for GSI, GRAM, and GridFTP support.
This includes a patch for the OpenSSL vulnerability detailed in
CVE-2006-4339 and http://www.openssl.org/news/secadv_20060905.txt.
It also includes fixes for Globus Bugzilla 4689
(http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=4689) and a
bug that can cause duplicate UUIDs to be generated for WS GRAM jobs.
- The condor_ schedd daemon no longer forks separate processes to
change ownership of job directories in the spool.
Previously on Unix-like systems, this would create a
new process before a job started running and after it finished running. Some
sites with very busy condor_ schedd daemons were encountering scaling problems.
New Features:
- Because, by default, the condor_ startd daemon references the job
ClassAd attribute NumCkpts, Condor's default configuration
will now round up the value of NumCkpts, in order to improve
matchmaking performance. See the entry on SCHEDD_ROUND_ATTR
in section 3.3.11.
- Enhanced the RHEL3 x86_64 port of Condor to include the standard
universe.
- condor_ submit_dag -f no longer deletes the
dagman.out file. condor_ submit_dag without the -f
option will now submit a DAGMan run even if the dagman.out
file exists. In this case, the file will be appended to.
- Added a property to the Windows installer program to determine
whether the Condor service will be started after installation. The
property name is STARTSERVICE, and the default value is ``Y''.
Bugs Fixed:
- A bug caused the condor_ master daemon to kill
only immediate children within the process tree,
upon an abnormal exit of the condor_ master daemon.
The condor_ master daemon now kills all descendant processes.
- Fixed a bug where if the file system was full, the debugging log
files (for example SchedLog) would silently lose messages. Now,
if the disk is full, the Condor daemons will
exit.
- Fixed a bug in the condor_ schedd daemon that caused it to stop
negotiating for grid universe jobs in the case that it decided
it could not spawn any new condor_ shadow processes.
- Added the ProcessId class (which more uniquely identifies a
process than a PID does) to the condor_ dagman abort duplicate
runs feature. This makes it less likely that a given instance of
condor_ dagman will mistakenly conclude that another instance of
condor_ dagman is already running on the same DAG. Also fixed an
unrelated bug in the abort duplicate runs feature that could cause
a condor_ dagman to not abort itself when it should.
- Condor daemons leaked memory (consuming more and more memory over time)
when parsing ClassAds that use functions with arguments.
- Fixed a bug in the condor_ starter daemon,
which caused it to look in the
wrong place for the job's executable, if TransferExecutable was set
to True in the job ClassAd.
- condor_ history no longer crashes if HISTORY is not defined
in the Condor configuration file.
- Fixed an unintentional change to the value of -Condorlog
in a condor_ dagman submit description file: it is once again the log file of
the first node job.
- Fixed a bug in condor_ q that would cause condor_ q -hold or
condor_ q -run to exit with an error on some platforms.
- Fixed a bug on Unix platforms, in which a misconfiguration of
MAIL would cause the condor_ master daemon to restart
all of its child
daemons whenever it tried (and failed) to send e-mail to the
administrator.
- Network related error messages have been improved to make debugging
easier. For example, when timing out on a read or write operation, the
peer's address is now included in the error message.
- An invalid value for UPDATE_INTERVAL now causes
the condor_ startd daemon to abort. Previously, it would continue running,
but some invalid values (for example, 0) could cause it to stop sending
periodic ClassAd updates to the condor_ collector, even after being
reconfigured with a valid value. Only a complete restart of
the condor_ startd daemon was sufficient to get it out of this state.
- Fixed a bug that caused X.509 limited proxies to be delegated as
impersonation (i.e. non-limited) proxies. Any authentication attempted
with the resulting proxies would fail.
- Fixed a couple bugs that would cause Condor to lose track of
some Condor-related processes and subsequently fail to clean up (kill)
these processes.
- Fixed a bug that would cause condor_ history to crash when
dealing with rotated history files. Note that history file rotation is
turned on by default. (See
Section 3.3.3 for descriptions of
ENABLE_HISTORY_ROTATION and
MAX_HISTORY_ROTATIONS .)
Known Bugs:
Version 6.8.1
Release Notes:
New Features:
- Added an optional argument to the condor_ dagman ABORT-DAG-ON
command that allows the DAGMan exit code to be specified separately
from the node value that causes the abort; also, a DAG can now be
aborted on a zero exit code from a node.
- Added the ALLOW_FORCE_RM configuration variable.
If this expression evaluates to True,
then an condor_ rm -f attempt is allowed. If it evaluated to False,
the attempt is disallowed.
The expression is evaluated in the context of the job ClassAd.
If not defined, the value defaults to True, matching the behavior of
previous Condor releases.
- condor_ dagman will now reject DAGs for which any of the nodes'
user job log files are on NFS (because of the unreliability of NFS
file locking, this can cause DAGs to fail). This feature can be
turned off by setting the DAGMAN_LOG_ON_NFS_IS_ERROR
configuration macro to False (the default is True).
- condor_ submit can now be configured to reject jobs for which
the log file is on NFS.
To do this, set the LOG_ON_NFS_IS_ERROR
configuration macro to True.
The default is that condor_ submit will issue a warning
for a log file on NFS.
- Added the DAGMAN_ABORT_DUPLICATES configuration macro,
which causes
condor_ dagman to attempt to detect at startup whether another
condor_ dagman is already running on the same DAG; if so, the second
condor_ dagman will abort itself.
- The new configuration variable
NETWORK_MAX_PENDING_CONNECTS may be used to limit the
maximum number of simultaneous network connection attempts. This is
primarily relevant to the condor_ schedd daemon, which may try to connect to
large numbers of condor_ startd daemons when claiming them.
The condor_ negotiator may also
connect to large numbers of condor_ startd daemons when initiating
security sessions
used for sending MATCH messages. On Unix, the default is to allow up to
eighty percent of the process file descriptor limit. On Windows, the
default is 1600.
- Added some more debug output to condor_ dagman to clarify
fatal errors.
- The -format argument to condor_ q and condor_ status can now take an expression in addition to a simple attribute name.
- DRMAA is now available on most Linux platforms, Windows and PPC MacOS.
Bugs Fixed:
- When a large number of jobs (roughly 200 or more) are running from a
single condor_ schedd daemon, and those jobs are using job leases
(the default in 6.8), it is
possible for the condor_ schedd daemon to enter a state
where it crashes on startup until all of
the job leases expire.
- Condor jobs submitted with the NiceUser priority were
not being matched if the NEGOTIATOR_MATCHLIST_CACHING
setting was TRUE (which is enabled by default).
- Fixed a Quill bug that prevented it from running on Windows. The
symptom showed with errors in the QuillLog such as
POLLING RESULT: ERROR
- Fixed a bug in Quill where it would cause errors such as
duplicate key violates unique constraint "history_vertical_pkey"
in the QuillLog and the PostgreSQL log file. These errors
triggered
a significant slowdown in the performance of Quill and the database. This
would only happen when a job attribute changed type from a string
type to a numeric type, or vice versa.
- In those unusual cases where Condor is unable to create a new process,
it shuts down cleanly, eliminating a small possibility of data corruption.
- Fixed a bug with the gt4 and nordugrid grid universe jobs that
caused the stdout and stderr of a job to not be
transferred correctly, if the given file names had absolute paths.
- condor_ dagman now echos warnings from condor_ submit and
stork_ submit to the dagman.out file.
- Fixed a bug introduced in 6.7.20, causing the condor_ ckpt_server
to exit immediately after starting up, unless Condor's security
negotiation was disabled.
- MAX_<SUBSYS>_LOG defaults to one Megabyte, even if the
setting is missing from the configuration. Previously it was 64 Kilobytes.
- Fixed a bug related to non-blocking connect that could occasionally
cause Condor daemons to crash.
- Fixed a rare bug where an exceptionally large query to the
condor_ collector could cause it to crash. The most common cause was a single
condor_ schedd daemon restarting,
and trying to recover a large number of job leases at once.
More than approximately 250 running jobs on a single condor_ schedd daemon
would be necessary to trigger this bug.
- When using the JOB_PROXY_OVERRIDE_FILE configuration
parameter, the X.509 proxy will now be properly forwarded for Condor-C jobs.
- Greatly reduced the chance that a Condor-C job in the REMOVED state
will be HELD due to an expired proxy or failure to talk to the remote
condor_ schedd.
- Fixed error and debug messages added in Condor version 6.7.20 that
incorrectly reported IP and port numbers. These messages were
intended to report the peer's address, but they were instead reporting the
local address of the network socket.
- Fixed a bug introduced in Condor version 6.7.20
which could cause Condor daemons to
die with the message
PANIC -- OUT OF FILE DESCRIPTORS
The conditions
causing this related to failed attempts to send updated status
to the condor_ collector daemon,
with both non-blocking updates and security negotiation
enabled (the defaults).
- Also fixed a bug in the negotiator with the same effect as
above, except it only happened with the configuration setting
NEGOTIATOR_USE_NONBLOCKING_STARTD_CONTACT=False.
- Fixed a bug in condor_ schedd under Solaris that could also
cause file descriptors to become exhausted over time when many
machines were claimed in a short spans of time (e.g. over 100) and the
condor_ schedd process file descriptor limit was near 256.
- Fixed a bug in condor_ schedd under Windows that could cause
network sockets to be allocated and never released back to the system.
The circumstances that could cause this were very rare. The error
message in the logs indicating that this problem was happening is
ERROR: DuplicateHandle() failed in Sock::set_inheritable
In cases where this error message is displayed, the network socket
is closed.
- Under some conditions, when making TCP connections, Condor was
still trying to connect for the full duration of the operation timeout
(often 10 or 20 seconds), even if the connection attempt was refused
(for example, because the port being accessed is not accepting connections).
Now, the connect operation finishes immediately after the first such
failure, allowing the Condor process to continue with other tasks.
- Fixed the problems relating to credential cache problems in the Kerberos
authentication mechanism. The current version of Kerberos is 1.4.3.
- Fixed bugs in the SSL authentication mechanism that caused the
condor_ schedd to crash when submitting a job (on Unix) and caused
all tools and daemons to crash on Windows when using SSL.
- Some of the binaries required to use Condor-C on Windows were
mistakenly not included in previous releases of Condor. This has been
fixed.
- Fixed a problem on Windows where the condor_ startd could fail to
include some attributes in its ClassAd. This would result in some jobs
incorrectly not being matched to that machine. This only happened if
CREDD_HOST was defined and Condor daemons on the execute
machine were unable to authenticate with the condor_ credd.
- Fixed a condor_ dagman bug which had prevented the
$(DAGManJobId) attribute from being expanded in job submit files
(for example,
when used as the value to define the Priority command).
- Fixed a bug in condor_ submit that caused parallel universe jobs
submitted via Condor-C to become mpi universe jobs.
- Fixed a bug which could cause Condor daemons to hang if they try
to write to the standard error stream (stderr) on some platforms. In
general, this should never happen, but can, due to third party
libraries (beyond our control) trying to write error or other messages.
- Fixed condor_ status to report error messages.
- Fixed a bug in which setting the configuration variable
NEGOTIATOR_CONSIDER_PREEMPTION = False
caused an incorrect calculation.
The fraction of the pool already being claimed by a user was
calculated using the wrong total number of condor_ startd daemons.
This could cause some condor_ startd daemons to remain unclaimed,
even when there were jobs available to run on them.
- Fixed a security vulnerability in Condor's FS and FS_REMOTE
authentication methods. The vulnerability allowed an attacker to impersonate
another user on the system, potentially allowing submission of jobs as a
different user. This may allow escalation to root privilege if the Condor
binaries and configuration files have improper permissions. The fix is not
backwards compatible, which means all daemons and tools using FS authentication
must be running Condor 6.8.1 or greater. The same applies to FS_REMOTE; All
daemons and tools using FS_REMOTE must be using Condor 6.8.1 or greater. In
practice, this means that for FS, all Condor binaries on one host must be
version 6.8.1 or greater, but versions can be different from host to host. For
FS_REMOTE it means all binaries across all hosts must be 6.8.1 or greater.
- Fixed a couple race conditions in stork and the credd where credential
files were possibly created with improper permissions before being set to owner
permissions.
- Fixed a bug in the condor_ gridmanager that caused it to delegate
12-hour proxies for grid-type gt4 jobs and then not refresh them.
- Fixed a bug in the condor_ gridmanager that caused a directory
needed for staging-in of grid-type gt4 job files to be removed when
the condor_ Gridmanager exited, causing the stage-in to fail.
- Fixed a bug that caused the checkpoint server to restart
because of (ostensibly) getting an unexpected errno from select().
- Fixed a bug on Windows where setting output or
error to a relative or absolute path (as opposed to a
simple file name without path information) would not work properly.
- History file rotation did not previously work on Windows because
the name of a rotated files would contain an ISO 8601 extended format
timestamp, which contains colon characters. The naming convention for
rotated files has been modified to use ISO 8601 basic format, avoiding
this problem.
- The CLAIMTOBE authentication method (which is inherently
insecure and should only be used for testing or other special
circumstances) previously would authenticate without providing the
``domain'' portion of the user name. As an example, a user would be
authenticated as simply ``user'' rather than
``user@cs.wisc.edu''. This problem has been fixed, but the new
protocol is not backwards compatible so the fix is turned off by
default. Correct behavior can be enabled by setting the
SEC_CLAIMTOBE_INCLUDE_DOMAIN parameter to True.
- Fixed a bug with the NEGOTIATOR_MATCHLIST_CACHING that
would cause very low-priority jobs (like jobs submitted with
nice_user=True) to not match even if resources were available.
- Fixed a buffer overflow that could crash the condor_ negotiator.
- SCHEDD_ROUND_ATTR_<xxxx> preserves the value being
rounded up when it is a multiple of the power of 10 specified for
rounding. Previously, the value would be incremented; now it remains
the same. For example, if SCHEDD_ROUND_ATTR_<xxxx>=2 and the value
being rounded up is 100, it now remains 100, rather than being
incremented to 200.
- Fixed condor_ updates_stats to report it's version number
correctly.
Known Bugs:
- The -completedsince option to condor_ history works
when Quill is enabled. The behavior of condor_ history
-completedsince is undefined when Quill is not
enabled.
Version 6.8.0
Release Notes:
- The default configuration for Condor now requires that
HOSTALLOW_WRITE be explicitly set. Condor will refuse
to start if the default configuration is used unmodified.
Existing installations should not need to change anything. For
those who desire the earlier default, you can set it to "*", but
note that this is potentially a security hole allowing anyone to
submit jobs or machines to your pool.
- Most Linux distributions are now supported using dynamically
linked binaries built on a RedHat Enterprise Linux 3 machine.
Recent security patches to a number of Linux distributions have
rendered the binaries built on RedHat 9 machines ineffective.
The download pages have been changed to reflect this, but Linux users
should be aware of this change.
The recommended download for most x86 Linux users is now:
condor-6.8.0-linux-x86-rhel3-dynamic.tar.gz.
- Some log messages have been clarified or moved to different
debugging levels.
For example, certain messages that looked like errors were printed
to D_ALWAYS, even though nothing was wrong and the system was
behaving as expected.
- The new features and bugs fixed in the rest of this section only
refer to changes made since the 6.7.20 release, not the last stable
release (6.6.11).
For a complete list of changes since 6.6.11, read the 6.7 version
history in section 8.5 on
page
.
New Features:
- Version 1.4 of the Condor DRMAA libraries are now included
with the Condor release.
For more information about DRMAA, see section 4.4.2 on
page
.
- Version 1.0.15 of the Condor GAHP is now used for Condor-G and
Condor-C.
- Added the -outfile_dir command-line argument to
condor_ submit_dag. This allows you to change the directory in which
condor_ dagman writes the dagman.out file.
- Added a new -summary (also -s) option to the
condor_ update_stats tool. If enabled, this prevents it from
displaying the entire history for each machine and only displays the
summary info.
Bugs Fixed:
- Fixed a number of potential static buffer overflows in various
Condor daemons and libraries.
- Fixed some small memory leaks in the condor_ startd,
condor_ schedd, and a potential leak that effected all Condor
daemons.
- Fixed a bug in Quill which caused it to crash when certain
long attributes appeared in a job ad.
- The startd would crash after a reconfig if the address of a
collector had not been resolved since the previous reconfig
(e.g. because DNS was down during that time).
- Once a Condor daemon failed to lookup the IP address of the
collector (e.g. because DNS was down), it would fail to contact the
collector from that time until the next reconfig. Now, each time Condor
tries to contact the collector, it generates a fresh DNS query if the
previous attempt failed.
- When using Condor-C or the -s or -r command-line options to
condor_ submit, the job's standard output and error would be placed
in the job's initial working directory, even if the job ad said to
place them in a different directory.
- Greatly sped up the parsing of large DAGs (by a factor of 50
or so) by using a hash table instead of linear search to find DAG nodes.
- Fixed a bug in condor_ dagman that caused an EXECUTABLE_ERROR
event from a node job to abort the DAG instead of just marking the
relevant node as failed.
- Fixed a bug in condor_ collector that caused it to discard
machine ads that don't have an IP address field (either StartdIpAddr
or STARTD_IP_ADDR). The condor_ startd will always produce a
StartdIpAddr field, but machine ads published through
condor_ advertise may not.
- When using BIND_ALL_INTERFACES on a dual-homed
machine, a bug introduced in 6.7.18 was causing Condor daemons to
sometimes incorrectly report their IP addresses, which could cause
jobs to fail to start running.
- Made the event checking in condor_ dagman less strict:
added the new "allow duplicate events" value to the
DAGMAN_ALLOW_EVENTS macro (this value is part of the
default); 16 value now also allows terminate event before submit;
changed "allow all events" to "allow almost all events"
(all except "run after terminal event"), so it is more useful.
- condor_ dagman and condor_ submit_dag now report
-NoEventChecks as ignored rather than deprecated.
- Fixed a bug in the condor_ dagman -maxidle feature:
a shadow exception event now puts the corresponding job into the
idle state in condor_ dagman's internal count.
- Fixed a problem on Windows where daemons would sometimes crash
when dealing with UNC path names.
- Fixed a problem where the condor_ schedd on Windows would
incorrectly reject a job if the client provided an Owner
attribute that was correct but differed in case from the authenticated
name.
- Fixed a condor_ startd crash introduced in version 6.7.20. This
crash would appear if an execute machine was matched for preemption
but then not claimed in time by the appropriate condor_ schedd.
- Resolved an issue where the condor_ startd was unable to clean
up jobs' execute directories on Windows when the condor_ master was
started from the command line rather than as a service.
- Added more patches to Condor's DRMAA interface to make it more
compatible with Sun Grid Engine's DRMAA interface.
- Removed the unused D_UPDOWN debug level and added the
D_CONFIG debug level.
- Fixed a bug that caused condor_ q with the -l or -xml
arguments to print out duplicate attributes when using Quill.
- Fixed a bug that prevented Condor-C jobs (universe grid jobs of type condor)
from submitting correctly if QUEUE_ALL_USERS_TRUSTED is set to
True.
- Fixed a bug that could cause the condor_ negotiator to crash if the
pool contains several different versions of the condor_ schedd and in the
config file NEGOTIATOR_MATCHLIST_CACHING is set to True.
- Changed the default value for config file entry
NEGOTIATOR_MATCHLIST_CACHING from False to True. When set to
True, this will instruct the negotiator to safely cache data in order to
improve matchmaking performance.
- The Condormaster now recognizes condor_ quill as a valid
Condor daemon without any manual configuration on the part of site
administrators.
This simplifies the configuration changes required to enable Quill.
- Fixed a rare bug in the condor_ starter where if there was a
failure transferring job output files back to the submitting host,
it could hang indefinitely, and the job appeared as if it was
continuing to run.
Known Bugs:
- The -completedsince option to condor_ history works
when Quill is enabled. The behavior of condor_ history
-completedsince is undefined when Quill is not
enabled.
Next: 8.5 Development Release Series
Up: 8. Version History and
Previous: 8.3 Development Release Series
Contents
Index
condor-admin@cs.wisc.edu