Next: 9. Command Reference Manual
Up: 8. Version History and
Previous: 8.6 Development Release Series
Contents
Index
Subsections
8.7 Stable Release Series 6.8
This is a stable release series of Condor.
It is based on the 6.7 development series.
All new features added or bugs fixed in the 6.7 series are available
in the 6.8 series.
As usual, only bug fixes (and potentially, ports to new platforms)
will be provided in future 6.8.x releases.
New features will be added in the forthcoming 6.9.x development series.
The 6.8.x series supports a different set of platforms than 6.6.x.
Please see the updated table of available platforms in
section 1.5 on page
.
The details of each version are described below.
Version 6.8.8
Release Notes:
- This release fixes a security vulnerability that affects
those who rely upon Condor's network message integrity checking
(where the configuration is set to
SEC_DEFAULT_INTEGRITY = REQUIRED). Not all of
Condor's network communications are vulnerable to the integrity
checking bug, so based on the scope of the affected parts, we consider
the level of threat to be modest. A denial of service attack could be
launched against Condor by an attacker who tampers with Condor's
network communications. All previous releases of Condor are affected
by this bug. For users of the 6.9 development series, a
fix for this problem will be released as part of the new 7.0.0 stable
series release, which is planned to happen near the end of 2007.
New Features:
Bugs Fixed:
- Fixed a named pipe collision on Windows: streaming error and
output would not work on more than one slot
(Condor version 6.8.8 terminology: Condor vm) at a time.
- Fixed a bug in Condor's network message integrity checking.
- Fixed a forward-compatibility problem when a 6.8 condor_startd
runs jobs for a 6.9 or later condor_schedd and the communication
between them is configured to use integrity checking or encryption.
The problem caused the condor_startd to crash.
- Fixed a problem that sometimes caused corruption of ClassAd data
that is forwarded from one condor_collector daemon to another via
CONDOR_VIEW_HOST.
Known Bugs:
Additions and Changes to the Manual:
Version 6.8.7
Release Notes:
New Features:
Bugs Fixed:
- On Windows, fixed a problem that could cause spurious failures with Condor-C
or with streaming a job's standard output or error.
- A claim in the state Claimed/Idle could not be preempted until
it transitioned into Busy or went away of its own accord. This bug
was introduced in 6.7.1.
- The user-based authorization parameters in the configuration file
(for example, ALLOW_READ) now properly recognize values where the
user name contains a wild card
(for example, *@cs.wisc.edu/bird.cs.wisc.edu).
- A rare threading problem in the Windows version of Condor has
been fixed. The problem could cause memory corruption in the
condor_starter while receiving input files and in the
condor_schedd while transferring input/output files for a remotely
submitted job or a spooled job.
- Increased the verbosity of some error messages (related to
reading log files) in condor_dagman.
- Fixed a bug in condor_dagman that would cause it to hang if it
was unable to successfully spawn a PRE or POST script. This case is now
dealt with as a PRE or POST script failure.
Known Bugs:
Additions and Changes to the Manual:
Version 6.8.6
Release Notes:
- Condor is now officially supported on Microsoft Vista.
- Condor is now officially supported on MacOS running natively on Intel CPUs. (and Condor binaries for Intel MacOS are now available for download).
- Condor now uses Globus 4.0.5 for GSI, pre-WS GRAM, and GridFTP support.
New Features:
- On all Unix ports of Condor except MacOSX, AIX, and Tru64, separate
debug symbol files are now supported. This allows meaningful debugging of
core files in addition to attaching to stripped executables during runtime.
- condor_dagman now prints reports of pending nodes to the
dagman.out, if it has been waiting more than
DAGMAN_PENDING_REPORT_INTERVAL seconds without seeing
any node job events. This is to help diagnose the problem if
condor_dagman gets "stuck".
- Optimized the submission of grid-type gt4 grid universe jobs to the
remote resource. Submission now takes one operation instead of three.
- The condor_shadow will obtain a session key to the condor_schedd
at the start of the job instead of potentially waiting until the job
completes. This reduces the chances of re-running already completed jobs
in the event of authentication failures (for instance, if a Kerberos KDC is
down or overloaded).
Bugs Fixed:
Known Bugs:
- Grid universe type GT4 (web services GRAM) does not work properly on Itanium-based machines, because it requires Java 1.5, which is not available on the Itanium (ia_64).
Additions and Changes to the Manual:
- Several updates to the DAGMan documentation (section 2.10).
- Improved the group quota documentation.
Version 6.8.5
Release Notes:
- This release is not fully compatible with the 6.6 series (or
anything earlier than that). Specifically, a 6.6 schedd will be rejected when
it tries to contact a 6.8.5 startd to make use of a claim.
- The Globus libraries used by Condor now include the following advisory
packages:
- globus_gss_assist-3.23
- globus_xio-0.35
- globus_gram_protocol-6.5
- globus_gass_transfer-2.12
See http://www.globus.org/toolkit/advisories.html
for details on the
bugs fixed by these updated packages.
The patch given in Globus Bugzilla 5091
(http://bugzilla.mcs.anl.gov/globus/show_bug.cgi?id=5091) is also
included.
New Features:
- A clipped port to x86 Debian 4.0 has been added.
- The functionality embodied in condor_q -better-analyze is now
available for X86_64 native ports of Condor.
- We now supply distinct, native ports for Mac OS X 10.3 and 10.4.
- There is a new configuration macro
COLLECTOR_REQUIREMENTS that may be used to filter out
unwanted ClassAd updates. For more information, see
section 3.3.16.
- Added a -f option to condor_store_cred, which generates
a pool password file that can be used for the PASSWORD authentication
method on Unix Condor installations.
Bugs Fixed:
- The config file entry HOSTALLOW_DAEMON is now looked at in
addition to ALLOW_DAEMON .
- Fixed a bug where under certain conditions Condor's file logging codes
would perform a segmentation fault.
- Removed periodic re-indexing of the quill history_vertical table.
This should not be needed with the current schema, and it should speed
up database re-indexing operations.
- Fixed a bug that would cause the dedicated scheduler to
crash, if the condor_schedd was suspended or blocked
for more than approximately 10 minutes.
The most likely cause of a suspension
is a condor_schedd executable
mounted from a remote NFS file system.
- Fixed a bug where if -lc was specified multiple times for
the compiler when using condor_compile (some tools like pgf90
do this), condor_compile would fail to link the application and emit
a multiply defined symbol error for many symbols.
- Fixed a bug where Condor erroneously indicates that a scheduler
universe's job executable is missing or not executable.
This occurred if the scheduler
universe job had been submitted with
CopyToSpool = false in the submit
description file, and the user had a umask which prevented the user
named condor from following the search path to the
user-owned executable.
- Fixed a bug that could cause the condor_schedd to crash if it
received too many matches in one negotiation cycle (more than 1000 on a
Linux platform).
- Fixed a bug in which condor_history did not honor the -format flag
properly when Quill is in use.
- Fixed a bug in which a java property that includes
surrounding double quote marks
caused the detection of a java virtual machine to go awry.
The fix, which may change in the future, changes any extra double quotes
within a property value to single quotes.
- Fixed a bug in which the condor_quill daemon
crashed occasionally when the Postgres database
server was unavailable.
- The Solaris 9 Condor package can be used under Solaris 10 again.
Changes in 6.7.20 broke this compatibility.
- condor_dagman now does a better job, especially in recovery mode,
of detecting potentially incorrect submit events.
Those have Condor IDs not matching what is expected.
- condor_dagman now truncates existing node job user log files
to zero length, rather than deleting the log files. This prevents breaking the
link if a user log file is set up as a link.
- When starting a GridFTP server to handle file transfers for gt4
grid jobs, the condor_gridmanager now properly sets the
GLOBUS_TCP_PORT_RANGE and GLOBUS_TCP_SOURCE_RANGE environment
variables if appropriate.
- Fixed a bug that could cause a security session to get deleted
by the server (for example, the condor_schedd) before the client
(for example, the condor_shadow) was done using it.
This bug can be observed as
communication failure the next time the client tried to connect to
the server. In some cases, this caused jobs to be re-queued to be run
again, because the final update of the job queue failed.
- If a grid job becomes held while it's still submitted to the remote
resource and is then removed, the condor_gridmanager will now attempt
to remove the job from the remote resource before letting it leave the
local job queue.
- Fixed a bug in the condor_c-gahp that caused it to not use the
user's credential for authentication with the remote schedd on some
connections.
- The condor_c-gahp now properly lists all of the commands it
supports in response to the COMMANDS command.
- Fix a bug in how the condor_c-gahp updates configuration parameter
GSI_DAEMON_NAME to include the job's credential if it has one.
- Removed the 5096-character restriction on the length of DAG
macro values (and names) in condor_dagman.
- Condor-G will now notice when jobs are missing from the status
reports sent by the Grid Monitor.
Jobs can disappear for short periods of time under normal circumstances,
but a prolonged absence is often a sign of problems on the remote machine.
The amount of time that a job can go missing from the Grid Monitor
status reports before the condor_gridmanager reacts can be set by the
new configuration parameter GRID_MONITOR_NO_STATUS_TIMEOUT .
The default is 15 minutes.
- condor_q -analyze will now print a warning if a job being analyzed
is already completed or if a grid universe job being analyzed has already
been matched.
- In condor_shadow, when forwarding an updated X509 proxy to an
executing job, the logic for whether to delegate or copy the proxy
(determined by configuration parameter
DELEGATE_JOB_GSI_CREDENTIALS ) was reversed.
The authentication logic for this operation was also incorrect, causing the
operation to fail in many instances.
- Made a small improvement to the reliability of Condor's process
ancestry tracking under Linux. However, jobs that create children
with more than 4096 bytes of environment are still problematic, due to
a Linux kernel limitation that prevents reading more than 4k from
/proc/<pid>/environ. The only truly reliable way to ensure that
Condor is aware of all processes spawned by a Unix job is to use
VMx_USER.
- condor_glidein option -run_here no longer fails when the
current working directory is not in PATH.
- condor_glidein option -runtime would cause runtime errors
at startup under some batch systems. The problematic parentheses characters
are no longer generated as part of the environment value that is set by
this option.
- On rare occasions, the condor_startd will compute a negative MIPS
rating when performing benchmarks on the machine. This caused the
Mips attribute to disappear from the machine ad. Now, the
condor_startd ignores these bogus results. The cause of the negative MIPS
ratings is still unknown.
- Fixed a bug that caused condor_dagman to hang if it processed,
in recovery mode, a node for which all submit attempts failed and a
POST script was run.
- Fixed a bug that would cause the condor_negotiator's memory
usage to grow over time when job or machine ClassAds made use of
ClassAd functions that do regular expression matching operations.
- Fixed a bug that was preventing Condor daemons from caching DNS
information for hosts authenticated via HOSTALLOW settings (i.e. no
strong authentication). The collector, in particular, should spend
much less time on IP to host name lookups.
- When a job has an X509 proxy file (as indicated by the
X509UserProxy attribute in the job ad), the condor_starter
now always sets X509_USER_PROXY in the job's environment
to point to a copy of that proxy file.
- Fixed several bugs that could cause the condor_c-gahp to time out
when talking to the condor_schedd and falsely report that commands
completed successfully. A common result is grid type condor grid universe
jobs being placed on hold because the condor_gridmanager mistakenly
thinks they disappeared from the remote condor_schedd's queue.
- Fixed a bug in Stork which was causing it to write the output and
error log files as the wrong user, and read the input file as the wrong
user.
- Fixed a bug in Stork which was causing it to kill hung jobs
as the wrong user.
- Fixed some possible static buffer overflows related to the
transferring of a job's data files.
- Jobs with standard output and error going to the same file should
not lose data in the common case.
- Heavily loaded condor daemons (e.g. condor_schedd) had a
problem when they got behind processing the exit status of child
process (e.g. condor_shadow). The problem was that the daemon would
continue to expect status updates from its child, even after the child
had exited, and when the daemon decided that the lack of status
updates meant that the child was hung, the daemon would try to kill
any process that happened to have the same pid as the child which had
already exited. In the case of the schedd, this would also result in
the job run attempt being marked as a failure and the job would remain
in the queue to run again. Condor no longer activates the ``hung child''
procedure for jobs which have exited but which have not yet had their
exit status processed internally by the daemon.
- For grid-type condor jobs, made the condor_gridmanager more
tolerant of unexpected responses from the remote condor_schedd.
- On HPUX and AIX, fixed a bug that could cause Condor's process
family tracking logic to lose track of processes.
- Fixed a memory error that would cause condor_q to sometimes
crash when using Quill.
- Fixed a problem where the Windows condor_credd would be
inaccessible to other Condor components if CREDD_HOST were
set to a DNS alias and not the canonical DNS name.
- Fixed a bug in the condor_shadow on Windows where it would
fail to correctly perform the PASSWORD authentication method.
- The Windows condor_credd now uses the configuration parameter
CREDD_HOST, if defined, to set its name when advertising itself
to the condor_collector. Thus, if CREDD_HOST is set to something
other than then condor_credd's host name, clients can still locate the
daemon.
- Fixed a bug in the condor_c-gahp that could cause it to not
perform hold, release, or remove commands on jobs in the remote
condor_schedd.
- Fixed the default value of configuration parameter
STARTD_AD_REEVAL_EXPR.
Known Bugs:
- condor_dagman incorrectly parses DAG file VARS lines specifying
more than one macroname/value pair. You can work around this problem
by specifying each macroname/value pair on a separate line. (This bug
was introduced in version 6.8.5.)
Version 6.8.4
Release Notes:
New Features:
Bugs Fixed:
- Fixed a bug in condor_q that only happened when running
with a Quill database and using the long (-l) option. The bug was
introduced in 6.8.3. The bug truncated the output of condor_q, and
only displayed some of the job attributes.
- Fixed a bug in condor_submit that caused standard universe jobs
to be unable to open their standard output or standard error, if
should_transfer_files is YES or
IF_NEEDED in the submit description file.
- Fixed a bug in condor_glidein that could cause it to request the
queue unknown when submitting its setup job to GRAM, leading to
failures.
- The OnExitRemove expression generated for DAGMan by
condor_submit_dag evaluated to UNDEFINED for some values of
ExitCode, causing condor_dagman to go on hold.
- Fixed a bug in which garbage values (random bits from memory)
were sometimes
written to the pool history file in the field representing the
backfill state.
- condor_submit_dag now generates a submit file
(.condor.sub) for condor_dagman that sends stdout and
stderr to separate files. This has always been recommended,
and recent versions of Condor cause stdout and stderr to
overwrite each other if they are directed to the same file.
- Fixed several bugs for grid type nordugrid jobs.
The condor_gridmanager would create an invalid RSL for these jobs
and save their output to the wrong location in some cases.
- condor_glidein now properly escapes glidein tarball URLs that
contain characters that have special meaning to GRAM RSL. It also turns
on TCP updates to the condor_collector,
if they are enabled on the submit machine.
- When using the submit file option getenv=true,
environment
variables containing a newline in their value are no longer inserted
into the job's environment. The condor_schedd daemon
does not allow newlines
within ClassAd values, so the attempt to insert such values resulted
in failure of job submission and caused the condor_schedd daemon
to abort.
- Fixed a bug that caused condor_dagman to hang if a node
with a POST script and retries initially runs but fails, and then
has all condor_submit attempts fail on the retry.
- Fixed a problem in the Windows installer where the
DAEMON_LIST parameter would be incorrectly set if the ``Join
an existing Condor pool'' option was selected or the ``Submit jobs to
Condor pool'' option was unchecked. In the first case, a
condor_collector and condor_negotiator would incorrectly be run on
the machine. In the second case, a condor_schedd would incorrectly
be run. The problem exists in all previous 6.8 and 6.9 series
releases.
- Fixed a bug in the handling of local universe jobs
for a very busy condor_schedd daemon.
When a local universe job completed, the condor_starter might not
be able to connect to the condor_schedd daemon to update final information
about the job, such as the exit status.
Under this circumstance,
the condor_starter would hang indefinitely.
The bug is fixed by having the condor_starter attempt
to retry a few times (with a delay in between each attempt) before
exiting with a fatal error.
The fatal error causes the job to restart.
Known Bugs:
- Setting DAGMAN_DELETE_OLD_LOGS to false can cause
condor_dagman to have problems (including hanging), especially
when running a rescue DAG. If you want to keep your old user log
files, the best thing to do is to rename them before each
condor_dagman run. If you do run with
DAGMAN_DELETE_OLD_LOGS set to false, check your
dagman.out file for error messages about submit event
Condor IDs not matching the expected value. If you get such an
error, you will probably have to condor_rm the condor_dagman
job, remove or rename the old user log file(s) and run the rescue DAG.
(Note: this bug also applies to earlier versions of condor_dagman.)
Version 6.8.3
Release Notes:
- In this release,
the command condor_q -long does not work when querying
the Quill database.
Instead, use the command
condor_q -direct quilld -long,
or use a previous version of condor_q.
- Performed a security audit of all places where Condor opens files,
to make certain files are opened with a reasonable permission mode
and with the
O_EXCL flag whenever possible.
New Features:
- Added the JOB_INHERITS_STARTER_ENVIRONMENT configuration
macro. When set
to True, jobs inherit all environment variables from
the condor_starter. This is useful for glidein jobs that need to access
environment variables from the batch system running the glidein daemons.
The default for this configuration macro is False, so existing behavior
is unchanged. This feature does not apply to standard and pvm universe
jobs.
- Changed the default UDP receive buffer for the
condor_collector from 1M to 10M. This value can be configured with
the (existing) COLLECTOR_SOCKET_BUFSIZE macro.
NOTE: For some Linux distributions, it may be necessary to configure
a larger value than the default; this parameter is
/proc/sys/net/core/rmem_max . You can see the values that the
condor_collector actually used by enabling D_FULLDEBUG for the
condor_collector and looking at the log line that looks like this:
Reset OS socket buffer size to 2048k (UDP), 255k (TCP).
- Added a new configuration macro to control the size of the
TCP send buffers for the condor_collector. This macro used to
be the same as COLLECTOR_SOCKET_BUFSIZE. The new macro is
COLLECTOR_TCP_SOCKET_BUFSIZE , and it defaults to 128K.
- Added a clipped port for SuSE Linux Enterprise Server 9 running on the
PowerPC architecture. Note the known bug below.
- The condor_schedd now maintains a birth date for the job queue.
Nothing in Condor currently uses this feature, but future versions of condor_quill may require it.
- There is a new configuration file macro
RANDOM_INTEGER(min,max[,step]). It produces a
pseudo-random integer within the range
min and max,
inclusive at configuration time.
Bugs Fixed:
- Fixed a deadlock situation between the condor_schedd and
the condor_startd that can
significantly impact the condor_schedd's performance. The likelihood of the
deadlock increased based upon the number of VMs advertised by the
condor_startd.
- Fixed a bug reading the user job log on Windows that caused
occasional DAGMan confusion.
Thanks to Fairview Software, Inc. for
both finding the bug and writing a patch.
- Fixed a denial of service problem: Condor daemons no longer freeze
for 20 seconds when a client connects to them and then sends no data.
This behavior is common with port scanners.
- Fixed a race condition with condor_quill caused by
PostgreSQL's default transaction isolation level being ``read
committed''.
This bug would cause truncated condor_q reads when using Quill.
- Fixed a bug where the condor_ckpt_server would segfault when
turned off with condor_off -fast.
- Fixed a bug in the condor_startd where it could die with
SIGABRT when a condor_starter exited under certain rare
circumstances.
The bug seems to have been most likely to appear on x86_64 Linux
machines, but could potentially affect all platforms.
- Fixed a problem with condor_history when running with Quill enabled,
which caused it to allocate an unbounded amount of memory.
- Fixed a problem with condor_q when running with Quill, which caused
it to silently truncate the printing of the job queue.
- Fixed a bug in the condor_gridmanager that caused the following
configuration files parameters to be ignored for grid types condor and
nordugrid jobs: GRIDMANAGER_RESOURCE_PROBE_INTERVAL,
GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE, and
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE.
- Fixed a bug in condor_run that caused it to abort on non-fatal
warnings from condor_submit and print incorrect error messages.
- Fixed a bug in the condor_gridmanager dealing with grid type gt4
grid universe jobs. If the job's standard output or error was not specified
in the job ClassAd, the condor_gridmanager would create an improper GRAM
RSL string, causing the job to fail.
- Fixed a bug in the condor_gridmanager that could cause it to
delegate the wrong credential when refreshing the credentials for a
grid type gt4 grid universe job.
- The condor_gridmanager could get into a state where it would no
longer start up Globus jobmanagers for grid type gt2 grid universe jobs,
if previous requests failed due to connection errors. This bug has been
fixed.
- The condor_c-gahp now properly exits when the pipe to its parent
goes away. Before, it would fill its log with large amounts of useless
messages, before exiting several minutes later.
- Fixed a bug where a problem opening standard input, output, or error,
the standard universe might generate an incorrect warning in the
condor_shadow's log.
- The condor_gridmanager now recovers properly when a proxy refresh
fails for a gt2 grid universe job in the stage-out state. Before, the job
would become held with a hold reason of ``Globus error 3: an I/O operation
failed''.
- A number of fixes to minor typos and incorrect formatting in
Condor's log files.
- When REQUEST_CLAIM_TIMEOUT was reached and the
condor_schedd
failed to contact the condor_startd to release the claim, the
condor_schedd would
periodically try releasing the claim indefinitely, possibly resulting in
a lengthy communication delay each time.
- Under Windows, Condor daemons such as the condor_schedd were sometimes
limiting their use of pending connect operations more than they should
have. This would result in the message, ``file descriptor safety level
exceeded''.
- condor_fetchlog no longer allows or documents the -dagman option.
The option's appearance was an error. The option never worked.
- The condor_schedd ensures that the initial job queue log file
contains a sequence number for use by Quill. This fixes a case in
which no sequence number was inserted, because the initial rotation of
this (empty) file failed. Quill also now reports exactly what the
problem is if it reads a job queue log in this state, rather than
simply crashing. This problem has so far only been observed under
Windows.
- Fixed a problem on Windows where, when submitting a job with a
sandbox (for example, using the -s or -r option to
condor_submit), an erroneous file permissions check in the
condor_schedd would result in a failed submission.
- The condor_startd would crash shortly after start up if the
RANK expression contained any use of the unary minus
operator. This patch should also fix any other cases where Condor
daemons crashed due to the use of the unary minus operator in ClassAd
expressions.
- Stork now writes a terminated event to the user log when it removes
a transfer job from its queue because of failures to invoke a transfer
module. Without this event, DAGMan would not notice that these jobs had
left the queue.
- Fixed a problem where the condor_schedd on Windows would
incorrectly reject a job if the client provided an Owner
attribute that was correct but differed in case from the authenticated
name. This bug was thought to have been fixed in Condor 6.8.0.
- Fixed problems with condor_store_cred behaving strangely when
storing or removing a user name that is some initial substring of
``condor_pool''. Specifying such a user name would be incorrectly
interpreted as equivalent to specifying the -c option.
- Fixed a problem with condor_glidein spewing lots of text to
the screen when checking the status of a job it submitted.
- A new version of the GT4 GAHP is included, with the following changes:
- A new axis.jar from Globus fixes a thread safety bug that
can cause lockups in subscriptions for WS notifications. See Globus
Bugzilla 4858
(http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=4858).
- Fixed bugs that caused memory related to destroyed jobs to not
be reclaimed in both the client and the server.
- Removed redundant usage of Secure Message, Secure Conversation,
and Transport Security when talking to a WS GRAM service. Now, only
Transport Security is used.
- Fixed memory leaks in condor_quill.
- Fixed a bug that might have caused condor_startd problems
launching the condor_starter for the standard universe on 64-bit systems.
- Improved Condor's file transfer. If you request that Condor
automatically transfer back your output, it now detects changes better.
Previously, it would only transfer back files that had a more recent timestamp
than the spool date. Now, it will transfer back any file that has changed
in date (including being dated in the past) or changed in size.
Known Bugs:
Version 6.8.2
Release Notes:
- Condor now uses Globus 4.0.3 for GSI, GRAM, and GridFTP support.
This includes a patch for the OpenSSL vulnerability detailed in
CVE-2006-4339 and http://www.openssl.org/news/secadv_20060905.txt.
It also includes fixes for Globus Bugzilla 4689
(http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=4689) and a
bug that can cause duplicate UUIDs to be generated for WS GRAM jobs.
- The condor_schedd daemon no longer forks separate processes to
change ownership of job directories in the spool.
Previously on Unix-like systems, this would create a
new process before a job started running and after it finished running. Some
sites with very busy condor_schedd daemons were encountering scaling problems.
New Features:
- Because, by default, the condor_startd daemon references the job
ClassAd attribute NumCkpts, Condor's default configuration
will now round up the value of NumCkpts, in order to improve
matchmaking performance. See the entry on SCHEDD_ROUND_ATTR
in section 3.3.11.
- Enhanced the RHEL3 x86_64 port of Condor to include the standard
universe.
- condor_submit_dag -f no longer deletes the
dagman.out file. condor_submit_dag without the -f
option will now submit a DAGMan run even if the dagman.out
file exists. In this case, the file will be appended to.
- Added a property to the Windows installer program to determine
whether the Condor service will be started after installation. The
property name is STARTSERVICE, and the default value is ``Y''.
Bugs Fixed:
- A bug caused the condor_master daemon to kill
only immediate children within the process tree,
upon an abnormal exit of the condor_master daemon.
The condor_master daemon now kills all descendant processes.
- Fixed a bug where if the file system was full, the debugging log
files (for example SchedLog) would silently lose messages. Now,
if the disk is full, the Condor daemons will
exit.
- Fixed a bug in the condor_schedd daemon that caused it to stop
negotiating for grid universe jobs in the case that it decided
it could not spawn any new condor_shadow processes.
- Added the ProcessId class (which more uniquely identifies a
process than a PID does) to the condor_dagman abort duplicate
runs feature. This makes it less likely that a given instance of
condor_dagman will mistakenly conclude that another instance of
condor_dagman is already running on the same DAG. Also fixed an
unrelated bug in the abort duplicate runs feature that could cause
a condor_dagman to not abort itself when it should.
- Condor daemons leaked memory (consuming more and more memory over time)
when parsing ClassAds that use functions with arguments.
- Fixed a bug in the condor_starter daemon,
which caused it to look in the
wrong place for the job's executable, if TransferExecutable was set
to True in the job ClassAd.
- condor_history no longer crashes if HISTORY is not defined
in the Condor configuration file.
- Fixed an unintentional change to the value of -Condorlog
in a condor_dagman submit description file: it is once again the log file of
the first node job.
- Fixed a bug in condor_q that would cause condor_q -hold or
condor_q -run to exit with an error on some platforms.
- Fixed a bug on Unix platforms, in which a misconfiguration of
MAIL would cause the condor_master daemon to restart
all of its child
daemons whenever it tried (and failed) to send e-mail to the
administrator.
- Network related error messages have been improved to make debugging
easier. For example, when timing out on a read or write operation, the
peer's address is now included in the error message.
- An invalid value for UPDATE_INTERVAL now causes
the condor_startd daemon to abort. Previously, it would continue running,
but some invalid values (for example, 0) could cause it to stop sending
periodic ClassAd updates to the condor_collector, even after being
reconfigured with a valid value. Only a complete restart of
the condor_startd daemon was sufficient to get it out of this state.
- Fixed a bug that caused X.509 limited proxies to be delegated as
impersonation (i.e. non-limited) proxies. Any authentication attempted
with the resulting proxies would fail.
- Fixed a couple bugs that would cause Condor to lose track of
some Condor-related processes and subsequently fail to clean up (kill)
these processes.
- Fixed a bug that would cause condor_history to crash when
dealing with rotated history files. Note that history file rotation is
turned on by default. (See
Section 3.3.3 for descriptions of
ENABLE_HISTORY_ROTATION and
MAX_HISTORY_ROTATIONS .)
Known Bugs:
Version 6.8.1
Release Notes:
New Features:
- Added an optional argument to the condor_dagman ABORT-DAG-ON
command that allows the DAGMan exit code to be specified separately
from the node value that causes the abort; also, a DAG can now be
aborted on a zero exit code from a node.
- Added the ALLOW_FORCE_RM configuration variable.
If this expression evaluates to True,
then an condor_rm -f attempt is allowed. If it evaluated to False,
the attempt is disallowed.
The expression is evaluated in the context of the job ClassAd.
If not defined, the value defaults to True, matching the behavior of
previous Condor releases.
- condor_dagman will now reject DAGs for which any of the nodes'
user job log files are on NFS (because of the unreliability of NFS
file locking, this can cause DAGs to fail). This feature can be
turned off by setting the DAGMAN_LOG_ON_NFS_IS_ERROR
configuration macro to False (the default is True).
- condor_submit can now be configured to reject jobs for which
the log file is on NFS.
To do this, set the LOG_ON_NFS_IS_ERROR
configuration macro to True.
The default is that condor_submit will issue a warning
for a log file on NFS.
- Added the DAGMAN_ABORT_DUPLICATES configuration macro,
which causes
condor_dagman to attempt to detect at startup whether another
condor_dagman is already running on the same DAG; if so, the second
condor_dagman will abort itself.
- The new configuration variable
NETWORK_MAX_PENDING_CONNECTS may be used to limit the
maximum number of simultaneous network connection attempts. This is
primarily relevant to the condor_schedd daemon, which may try to connect to
large numbers of condor_startd daemons when claiming them.
The condor_negotiator may also
connect to large numbers of condor_startd daemons when initiating
security sessions
used for sending MATCH messages. On Unix, the default is to allow up to
eighty percent of the process file descriptor limit. On Windows, the
default is 1600.
- Added some more debug output to condor_dagman to clarify
fatal errors.
- The -format argument to condor_q and condor_status can now take an expression in addition to a simple attribute name.
- DRMAA is now available on most Linux platforms, Windows and PPC MacOS.
Bugs Fixed:
- When a large number of jobs (roughly 200 or more) are running from a
single condor_schedd daemon, and those jobs are using job leases
(the default in 6.8), it is
possible for the condor_schedd daemon to enter a state
where it crashes on startup until all of
the job leases expire.
- Condor jobs submitted with the NiceUser priority were
not being matched if the NEGOTIATOR_MATCHLIST_CACHING
setting was TRUE (which is enabled by default).
- Fixed a Quill bug that prevented it from running on Windows. The
symptom showed with errors in the QuillLog such as
POLLING RESULT: ERROR
- Fixed a bug in Quill where it would cause errors such as
duplicate key violates unique constraint "history_vertical_pkey"
in the QuillLog and the PostgreSQL log file. These errors
triggered
a significant slowdown in the performance of Quill and the database. This
would only happen when a job attribute changed type from a string
type to a numeric type, or vice versa.
- In those unusual cases where Condor is unable to create a new process,
it shuts down cleanly, eliminating a small possibility of data corruption.
- Fixed a bug with the gt4 and nordugrid grid universe jobs that
caused the stdout and stderr of a job to not be
transferred correctly, if the given file names had absolute paths.
- condor_dagman now echos warnings from condor_submit and
stork_submit to the dagman.out file.
- Fixed a bug introduced in 6.7.20, causing the condor_ckpt_server
to exit immediately after starting up, unless Condor's security
negotiation was disabled.
- MAX_<SUBSYS>_LOG defaults to one Megabyte, even if the
setting is missing from the configuration. Previously it was 64 Kilobytes.
- Fixed a bug related to non-blocking connect that could occasionally
cause Condor daemons to crash.
- Fixed a rare bug where an exceptionally large query to the
condor_collector could cause it to crash. The most common cause was a single
condor_schedd daemon restarting,
and trying to recover a large number of job leases at once.
More than approximately 250 running jobs on a single condor_schedd daemon
would be necessary to trigger this bug.
- When using the JOB_PROXY_OVERRIDE_FILE configuration
parameter, the X.509 proxy will now be properly forwarded for Condor-C jobs.
- Greatly reduced the chance that a Condor-C job in the REMOVED state
will be HELD due to an expired proxy or failure to talk to the remote
condor_schedd.
- Fixed error and debug messages added in Condor version 6.7.20 that
incorrectly reported IP and port numbers. These messages were
intended to report the peer's address, but they were instead reporting the
local address of the network socket.
- Fixed a bug introduced in Condor version 6.7.20
which could cause Condor daemons to
die with the message
PANIC -- OUT OF FILE DESCRIPTORS
The conditions
causing this related to failed attempts to send updated status
to the condor_collector daemon,
with both non-blocking updates and security negotiation
enabled (the defaults).
- Also fixed a bug in the negotiator with the same effect as
above, except it only happened with the configuration setting
NEGOTIATOR_USE_NONBLOCKING_STARTD_CONTACT=False.
- Fixed a bug in condor_schedd under Solaris that could also
cause file descriptors to become exhausted over time when many
machines were claimed in a short spans of time (e.g. over 100) and the
condor_schedd process file descriptor limit was near 256.
- Fixed a bug in condor_schedd under Windows that could cause
network sockets to be allocated and never released back to the system.
The circumstances that could cause this were very rare. The error
message in the logs indicating that this problem was happening is
ERROR: DuplicateHandle() failed in Sock::set_inheritable
In cases where this error message is displayed, the network socket
is closed.
- Under some conditions, when making TCP connections, Condor was
still trying to connect for the full duration of the operation timeout
(often 10 or 20 seconds), even if the connection attempt was refused
(for example, because the port being accessed is not accepting connections).
Now, the connect operation finishes immediately after the first such
failure, allowing the Condor process to continue with other tasks.
- Fixed the problems relating to credential cache problems in the Kerberos
authentication mechanism. The current version of Kerberos is 1.4.3.
- Fixed bugs in the SSL authentication mechanism that caused the
condor_schedd to crash when submitting a job (on Unix) and caused
all tools and daemons to crash on Windows when using SSL.
- Some of the binaries required to use Condor-C on Windows were
mistakenly not included in previous releases of Condor. This has been
fixed.
- Fixed a problem on Windows where the condor_startd could fail to
include some attributes in its ClassAd. This would result in some jobs
incorrectly not being matched to that machine. This only happened if
CREDD_HOST was defined and Condor daemons on the execute
machine were unable to authenticate with the condor_credd.
- Fixed a condor_dagman bug which had prevented the
$(DAGManJobId) attribute from being expanded in job submit files
(for example,
when used as the value to define the Priority command).
- Fixed a bug in condor_submit that caused parallel universe jobs
submitted via Condor-C to become mpi universe jobs.
- Fixed a bug which could cause Condor daemons to hang if they try
to write to the standard error stream (stderr) on some platforms. In
general, this should never happen, but can, due to third party
libraries (beyond our control) trying to write error or other messages.
- Fixed condor_status to report error messages.
- Fixed a bug in which setting the configuration variable
NEGOTIATOR_CONSIDER_PREEMPTION = False
caused an incorrect calculation.
The fraction of the pool already being claimed by a user was
calculated using the wrong total number of condor_startd daemons.
This could cause some condor_startd daemons to remain unclaimed,
even when there were jobs available to run on them.
- Fixed a security vulnerability in Condor's FS and FS_REMOTE
authentication methods. The vulnerability allowed an attacker to impersonate
another user on the system, potentially allowing submission of jobs as a
different user. This may allow escalation to root privilege if the Condor
binaries and configuration files have improper permissions. The fix is not
backwards compatible, which means all daemons and tools using FS authentication
must be running Condor 6.8.1 or greater. The same applies to FS_REMOTE; All
daemons and tools using FS_REMOTE must be using Condor 6.8.1 or greater. In
practice, this means that for FS, all Condor binaries on one host must be
version 6.8.1 or greater, but versions can be different from host to host. For
FS_REMOTE it means all binaries across all hosts must be 6.8.1 or greater.
- Fixed a couple race conditions in stork and the credd where credential
files were possibly created with improper permissions before being set to owner
permissions.
- Fixed a bug in the condor_gridmanager that caused it to delegate
12-hour proxies for grid-type gt4 jobs and then not refresh them.
- Fixed a bug in the condor_gridmanager that caused a directory
needed for staging-in of grid-type gt4 job files to be removed when
the condor_Gridmanager exited, causing the stage-in to fail.
- Fixed a bug that caused the checkpoint server to restart
because of (ostensibly) getting an unexpected errno from select().
- Fixed a bug on Windows where setting output or
error to a relative or absolute path (as opposed to a
simple file name without path information) would not work properly.
- History file rotation did not previously work on Windows because
the name of a rotated files would contain an ISO 8601 extended format
timestamp, which contains colon characters. The naming convention for
rotated files has been modified to use ISO 8601 basic format, avoiding
this problem.
- The CLAIMTOBE authentication method (which is inherently
insecure and should only be used for testing or other special
circumstances) previously would authenticate without providing the
``domain'' portion of the user name. As an example, a user would be
authenticated as simply ``user'' rather than
``user@cs.wisc.edu''. This problem has been fixed, but the new
protocol is not backwards compatible so the fix is turned off by
default. Correct behavior can be enabled by setting the
SEC_CLAIMTOBE_INCLUDE_DOMAIN parameter to True.
- Fixed a bug with the NEGOTIATOR_MATCHLIST_CACHING that
would cause very low-priority jobs (like jobs submitted with
nice_user=True) to not match even if resources were available.
- Fixed a buffer overflow that could crash the condor_negotiator.
- SCHEDD_ROUND_ATTR_<xxxx> preserves the value being
rounded up when it is a multiple of the power of 10 specified for
rounding. Previously, the value would be incremented; now it remains
the same. For example, if SCHEDD_ROUND_ATTR_<xxxx>=2 and the value
being rounded up is 100, it now remains 100, rather than being
incremented to 200.
- Fixed condor_updates_stats to report it's version number
correctly.
Known Bugs:
- The -completedsince option to condor_history works
when Quill is enabled. The behavior of condor_history
-completedsince is undefined when Quill is not
enabled.
Version 6.8.0
Release Notes:
- The default configuration for Condor now requires that
HOSTALLOW_WRITE be explicitly set. Condor will refuse
to start if the default configuration is used unmodified.
Existing installations should not need to change anything. For
those who desire the earlier default, you can set it to "*", but
note that this is potentially a security hole allowing anyone to
submit jobs or machines to your pool.
- Most Linux distributions are now supported using dynamically
linked binaries built on a RedHat Enterprise Linux 3 machine.
Recent security patches to a number of Linux distributions have
rendered the binaries built on RedHat 9 machines ineffective.
The download pages have been changed to reflect this, but Linux users
should be aware of this change.
The recommended download for most x86 Linux users is now:
condor-6.8.0-linux-x86-rhel3-dynamic.tar.gz.
- Some log messages have been clarified or moved to different
debugging levels.
For example, certain messages that looked like errors were printed
to D_ALWAYS, even though nothing was wrong and the system was
behaving as expected.
- The new features and bugs fixed in the rest of this section only
refer to changes made since the 6.7.20 release, not the last stable
release (6.6.11).
For a complete list of changes since 6.6.11, read the 6.7 version
history in section
on
page
.
New Features:
- Version 1.4 of the Condor DRMAA libraries are now included
with the Condor release.
For more information about DRMAA, see section 4.5.2 on
page
.
- Version 1.0.15 of the Condor GAHP is now used for Condor-G and
Condor-C.
- Added the -outfile_dir command-line argument to
condor_submit_dag. This allows you to change the directory in which
condor_dagman writes the dagman.out file.
- Added a new -summary (also -s) option to the
condor_update_stats tool. If enabled, this prevents it from
displaying the entire history for each machine and only displays the
summary info.
Bugs Fixed:
- Fixed a number of potential static buffer overflows in various
Condor daemons and libraries.
- Fixed some small memory leaks in the condor_startd,
condor_schedd, and a potential leak that effected all Condor
daemons.
- Fixed a bug in Quill which caused it to crash when certain
long attributes appeared in a job ad.
- The startd would crash after a reconfig if the address of a
collector had not been resolved since the previous reconfig
(e.g. because DNS was down during that time).
- Once a Condor daemon failed to lookup the IP address of the
collector (e.g. because DNS was down), it would fail to contact the
collector from that time until the next reconfig. Now, each time Condor
tries to contact the collector, it generates a fresh DNS query if the
previous attempt failed.
- When using Condor-C or the -s or -r command-line options to
condor_submit, the job's standard output and error would be placed
in the job's initial working directory, even if the job ad said to
place them in a different directory.
- Greatly sped up the parsing of large DAGs (by a factor of 50
or so) by using a hash table instead of linear search to find DAG nodes.
- Fixed a bug in condor_dagman that caused an EXECUTABLE_ERROR
event from a node job to abort the DAG instead of just marking the
relevant node as failed.
- Fixed a bug in condor_collector that caused it to discard
machine ads that don't have an IP address field (either StartdIpAddr
or STARTD_IP_ADDR). The condor_startd will always produce a
StartdIpAddr field, but machine ads published through
condor_advertise may not.
- When using BIND_ALL_INTERFACES on a dual-homed
machine, a bug introduced in 6.7.18 was causing Condor daemons to
sometimes incorrectly report their IP addresses, which could cause
jobs to fail to start running.
- Made the event checking in condor_dagman less strict:
added the new "allow duplicate events" value to the
DAGMAN_ALLOW_EVENTS macro (this value is part of the
default); 16 value now also allows terminate event before submit;
changed "allow all events" to "allow almost all events"
(all except "run after terminal event"), so it is more useful.
- condor_dagman and condor_submit_dag now report
-NoEventChecks as ignored rather than deprecated.
- Fixed a bug in the condor_dagman -maxidle feature:
a shadow exception event now puts the corresponding job into the
idle state in condor_dagman's internal count.
- Fixed a problem on Windows where daemons would sometimes crash
when dealing with UNC path names.
- Fixed a problem where the condor_schedd on Windows would
incorrectly reject a job if the client provided an Owner
attribute that was correct but differed in case from the authenticated
name.
- Fixed a condor_startd crash introduced in version 6.7.20. This
crash would appear if an execute machine was matched for preemption
but then not claimed in time by the appropriate condor_schedd.
- Resolved an issue where the condor_startd was unable to clean
up jobs' execute directories on Windows when the condor_master was
started from the command line rather than as a service.
- Added more patches to Condor's DRMAA interface to make it more
compatible with Sun Grid Engine's DRMAA interface.
- Removed the unused D_UPDOWN debug level and added the
D_CONFIG debug level.
- Fixed a bug that caused condor_q with the -l or -xml
arguments to print out duplicate attributes when using Quill.
- Fixed a bug that prevented Condor-C jobs (universe grid jobs of type condor)
from submitting correctly if QUEUE_ALL_USERS_TRUSTED is set to
True.
- Fixed a bug that could cause the condor_negotiator to crash if the
pool contains several different versions of the condor_schedd and in the
config file NEGOTIATOR_MATCHLIST_CACHING is set to True.
- Changed the default value for config file entry
NEGOTIATOR_MATCHLIST_CACHING from False to True. When set to
True, this will instruct the negotiator to safely cache data in order to
improve matchmaking performance.
- The condor_master now recognizes condor_quill as a valid
Condor daemon without any manual configuration on the part of site
administrators.
This simplifies the configuration changes required to enable Quill.
- Fixed a rare bug in the condor_starter where if there was a
failure transferring job output files back to the submitting host,
it could hang indefinitely, and the job appeared as if it was
continuing to run.
Known Bugs:
- The -completedsince option to condor_history works
when Quill is enabled. The behavior of condor_history
-completedsince is undefined when Quill is not
enabled.
Next: 9. Command Reference Manual
Up: 8. Version History and
Previous: 8.6 Development Release Series
Contents
Index
condor-admin@cs.wisc.edu