Next: 9.6 Stable Release Series
Up: 9. Version History and
Previous: 9.4 Stable Release Series
Contents
Index
Subsections
9.5 Development Release Series 7.5
This is the development release series of Condor.
The details of each version are described below.
Version 7.5.6
Release Notes:
- Condor version 7.5.6 released on March 21, 2011.
- What used to be known as the condor_startd and condor_schedd cron
mechanisms are now collectively called Daemon ClassAd Hooks.
The significant changes in this Condor version 7.5.6 release are
given in the New Features section.
- In the release directory, the subdirectory lib/glite/ has
been moved to libexec/glite/.
- This development series of Condor is no longer officially released
for the platforms PowerPC AIX, PowerPC-64 SLES 9, PowerPC MacOS 10.4,
Solaris 5.9 on all architectures,
Solaris 5.10 on all architectures,
Itanium IA64 RHEL 3, PS3 (PowerPC) YDL 5.0, and x86 Debian 4.
- Support for GCB has been removed.
- The default Unix Sys-V init script has been completely reworked.
In addition to new features, this changes the following:
- The default location of the Condor configuration file is now
/etc/condor/condor_config. This location can be changed by
editing the sysconfig file or the init script itself.
- The default location of the Condor installation is now
/usr/, with binaries in /usr/bin and /usr/sbin.
These locations can also be changed by editing the sysconfig file
or the init script itself.
New Features:
- Condor no longer relies on DNS to determine its IP address.
Instead, it examines the list of system network devices.
- condor_dagman now gives a warning if a node category has no
nodes assigned to it or no throttle set.
- condor_dagman now has a $MAX_RETRIES macro for PRE and
POST script arguments.
Also, condor_dagman now prints a warning if an unrecognized macro is
used for a PRE or POST script argument.
See for details.
- The condor_schedd is now more efficient in handling the exit of
condor_shadow processes, when there are large numbers of
condor_shadow processes.
- Condor's Chirp protocol has been updated with new commands.
The Chirp C++ client
and condor_chirp command are updated to use the new commands.
See section 10 for details on the new commands.
- The Daemon ClassAd Hooks mechanism is described in
section 4.4.3,
with configuration variables defined in section 3.3.36.
The mechanism has the following new features:
- The condor_startd's benchmarks are no longer hard coded into
the condor_startd. Instead, the benchmarks are now implemented
via the Daemon ClassAd Hooks mechanism. Two new programs are
shipped with Condor version 7.5.6:
condor_mips and condor_kflops.
These programs are in the libexec directory).
They implement the original mips and kflops benchmarks for this
new implementation.
Additional benchmarks can now easily be implemented;
the list of benchmarks is controlled
via the new BENCHMARKS_JOBLIST configuration variable.
- Several fixes to the the mips and kflops benchmarks should
increase the reproducibility of their results.
- Two new job types have been implemented in the Daemon ClassAd
Hooks mechanism. They are called OneShot and OnDemand.
Currently, OnDemand is used only by the new BENCHMARKS
mechanism.
- condor_dagman now prints out all boolean configuration
variable values as True or False,
instead of 1 or 0 within the dagman.out file.
- Because of the new DAGMAN_VERBOSITY configuration setting
(see section 3.3.25),
the -debug flag is no longer propagated from a top-level DAG to a
sub-DAG; furthermore, -debug is no longer set in a
.condor.sub file unless it is set on the condor_submit_dag
command line.
- When job ClassAd attributes are modified via condor_qedit,
the changes are now propagated to the condor_shadow and condor_gridmanager.
This allows a user's changes to the job ClassAd to affect the job policy
expressions while the job is managed by these daemons.
- Several improvements for CREAM grid jobs:
- CREAM commands are retried if the server closes the connection
prematurely.
- All jobs going to a CREAM server share the same lease handle.
- Multiple CREAM status requests for single jobs are now batched
into a single command to the server.
- When there are too many commands to be issued to a CREAM server
simultaneously, new job submissions have lower priority than commands
operating on existing jobs.
- The new script condor_gather_info, located in bin/,
creates reports with
information from a Condor pool about a specific job ID.
It also gathers some understanding of the pool under which it runs.
- Added support for hierarchical accounting groups and group quotas.
- condor_q -better-analyze now identifies jobs that have not yet been
considered by matchmaking, instead of characterizing them as not
matching for unknown reasons.
- The default Unix Sys-V init script has been completely reworked.
The new version should now work on all Unix and Linux systems.
Major features and changes in the new script:
- Supports the use of a Linux-style sysconfig file
- Supports the use of a Linux-style PID file
- Supports the following commands:
- start
- stop
- restart
- try-restart
- reload
- force-reload
- status
- The default location of the Condor configuration file is now
/etc/condor/condor_config. This location can be changed by
editing the sysconfig file or the init script itself.
- The default location of the Condor installation is now
/usr/, with binaries in /usr/bin and /usr/sbin.
These locations can be changed by editing the sysconfig file
or the init script itself.
Configuration Variable and ClassAd Attribute Additions and Changes:
- The default value of configuration variable GLITE_LOCATION
has changed to
$(LIBEXEC)/glite
. This reflects the change made in the
layout of the Condor release files.
- Values for configuration variables NETWORK_INTERFACE and
PRIVATE_NETWORK_INTERFACE may now specify a network
device name or an IP address. The asterisk character (
*
)
may be used as a wild card in either a name or IP address.
This makes it easier to apply the same
configuration to a large number of machines, because the IP address
does not have to be customized for each host.
- The new configuration variable
DELEGATE_JOB_GSI_CREDENTIALS_LIFETIME specifies the
maximum number of seconds for which delegated job proxies should be
valid. The default is one day. A value of 0 indicates that the
delegated proxy should be valid for as long as allowed by the
credential used to create the proxy; this was the behavior in
previous releases of Condor. This configuration variable currently
only applies to proxies delegated for non-grid jobs and Condor-C
jobs. It does not currently apply to globus grid jobs. The job may
override this configuration variable by using the
delegate_job_GSI_credentials_lifetime submit description file
command.
- The new configuration variable
DELEGATE_JOB_GSI_CREDENTIALS_REFRESH specifies a
floating point number between 0 and 1 that indicates when
delegated credentials with limited lifetime should be renewed, as
a fraction of the delegated lifetime. The default is 0.25.
- The new configuration variable
SHADOW_CHECKPROXY_INTERVAL specifies the number of
seconds between tests to see if the job proxy has been updated or
should be refreshed. The default is 600 (10 minutes). Previously,
the condor_shadow checked for proxy updates once per minute.
- Daemon ClassAd Hooks no longer support what was identified as
the old syntax.
Due to this, variables
STARTD_CRON_JOBS and HAWKEYE_JOBS no longer exist.
In previous versions of Condor, the condor_startd would issue a
warning if this syntax was found, but, starting with 7.5.6, any use
of these macros will be ignored.
- New configuration variables DAGMAN_VERBOSITY ,
DAGMAN_MAX_PRE_SCRIPTS , DAGMAN_MAX_POST_SCRIPTS ,
and DAGMAN_ALLOW_LOG_ERROR
are defined in section 3.3.25.
- The new configuration variable
STARTD_PUBLISH_WINREG can contain a list of Windows
registry key names,
whose values will be published in the condor_startd daemon's ClassAd.
- The new configuration variable
CONDOR_VIEW_CLASSAD_TYPES is a string list that specifies
the types of the ClassAds that will be forwarded to
the location defined by CONDOR_VIEW_HOST.
See the definition at section 3.3.16.
- Added a -local-name command line option to
condor_config_val to inspect the values of attributes that use
local names.
Bugs Fixed:
Known Bugs:
- If a cycle exists in the set of jobs to be removed defined by
the job ClassAd attribute OtherJobRemoveRequirements,
removing any of the jobs in the set will cause the
condor_schedd to go into an infinite loop.
OtherJobRemoveRequirements is defined on
page .
- In a condor_dagman workflow, if a splice contains nothing but
another splice, parsing the DAG will fail. This can be worked around
by putting any non-splice job, including a DAG-level NOOP job,
into the offending splice.
This bug has apparently existed since the splice
feature was introduced in condor_dagman.
- If an individual Daemon ClassAd Hook manager is not named,
the jobs under it will attempt to use incorrectly named configuration
variables.
For example, the following correct configuration will not work,
because the Daemon ClassAd Hook manager will fail to look up the job's
executable variable, given the error in configuration variable naming:
STARTD_CRON_JOBLIST = TEST
...
STARTD_CRON_TEST_MODE = periodic
STARTD_CRON_TEST_EXECUTABLE = $(LIBEXEC)/test
...
Condor version 7.5.6 and all previous 7.x Condor versions will incorrectly
name the variables from this example STARTD_TEST_MODE and
STARTD_TEST_EXECUTABLE instead.
If instead, the Daemon ClassAd Hook Manager had been named,
using the no-longer-supported STARTD_CRON_NAME,
the code works as expected. For example:
STARTD_CRON_NAME = HAWKEYE
HAWKEYE_JOBLIST = TEST
...
HAWKEYE_TEST_MODE = periodic
HAWKEYE_TEST_EXECUTABLE = $(LIBEXEC)/test
...
This old behavior is, as of Condor version 7.5.6,
documented as unsupported and is going away,
primarily because it is confusing.
But, for this release, it still works.
It is believed that this same behavior exists in all 7.x releases of Condor,
but because the naming feature is used, the incorrect behavior went undetected.
This affects the STARTD_CRON and SCHEDD_CRON
Daemon ClassAd Hook managers, and will be fixed in Condor version 7.6.0.
Additions and Changes to the Manual:
Version 7.5.5
Release Notes:
- Condor version 7.5.5 released on January 26, 2011.
- This version of Condor uses a different layout in the spool
directory for storing files belonging to jobs that are in the queue.
Conversion of the spool directory is automatic when upgrading, but
be aware that downgrading to a previous version of Condor
requires extra effort. The procedure for downgrading is either
to drain all jobs with spooled files from the queue, or to manually
convert the spool back to the older format. To manually convert
back to the older format, stop Condor and back up the spool directory
in case of problems. Then move all subdirectories matching the form
$(SPOOL)/<#>/<#>/cluster<#>.proc<#>.subproc<#>
into
$(SPOOL)
. Also do this for any files of the form
$(SPOOL)/<#>/cluster<#>.ickpt.subproc<#>
. Edit
$(SPOOL)/job_queue.log
with a text editor, and change all
references to the old paths to the new paths. Then, remove
$(SPOOL)/spool_version
. Finally, start up Condor.
- For those who compile Condor from the source code rather than
using packages of pre-built executables, be aware that in this
release Condor is built using cmake instead of imake.
See the README.building file for the new instructions on how
to build Condor.
- This release note serves to remind users that as of Condor version 7.5.1,
the RPMs come with native packaging.
Therefore, items are in different locations, as given by FHS locations,
such as /usr/bin, /usr/sbin, /etc, and /var/log.
Please see section 3.2.6 for installation documentation.
- Quill is now available only within the source code distribution
of Condor.
It is no longer included in the builds of Condor provided by UW,
but it is available as a feature that can be enabled by those who compile
Condor from the source code.
Find the code within the condor_contrib directory, in the
directories condor_tt and condor_dbmsd.
- The AIX 5.2 packages in this release have been found to be
incompatible with AIX 5.3.
- We are planning to drop support for AIX. Please contact us if
this is a problem for you.
- The directory structure within the Unix tar file package of Condor
has changed. Previously, the tar file contained a top level
directory named condor-<version>. The top level
directory is now the same as the tar file name, but without the
.tar.gz extension.
- On Unix platforms, the following executables used to be located in both
the sbin and bin directories,
but are now only located in the bin
directory: condor, condor_checkpoint, condor_reschedule, and
condor_vacate.
- The size of the Condor installation has increased by as much as
60% compared to Condor version 7.5.4. We hope to eliminate most of this
increase in Condor version 7.5.6.
- Previously, packages containing debug symbols were available for
many Unix platforms. In this release, the debug packages contain
full, `unstripped' executables instead of just the debug symbols.
- The contents of the Windows zip and MSI packages of Condor have
changed. The lib and libexec folders no longer exist,
and all contents previously within them are now in bin.
condor_setup and condor_set_acls have been moved from the top
level directory into bin.
- The Windows MSI installer for Condor version 7.5.5 requires that
all previous
MSI installations of Condor be uninstalled. Before uninstalling
previous versions, make backup copies of configuration files.
Any settings that
need to be preserved must be reapplied to the configuration of the
new installation.
- The following list itemizes changes included in this Condor version
7.5.5 release that belong to Condor version 7.4.5. That stable series
version will not yet have been released as this development version
is released.
- condor_dagman now prints a message in the dagman.out file
whenever it truncates a node job user log file.
condor_dagman now prints additional diagnostic information in the
case of certain log file errors.
- Fixed a bug in which
a network disconnect between the submit machine and execute
machine during the transfer of output files caused the
condor_starter daemon to immediately give up, rather than waiting
for the condor_shadow to reconnect. This problem was introduced
in Condor version 7.4.4.
- Fixed a bug in which
if condor_ssh_to_job attempted to connect to a job while the
job's input files were being transferred, this caused the file
transfer to fail, which resulted in the job returning to the idle
state in the queue.
- In privsep mode, the transfer of output failed if a job's execute
directory contained symbolic links to non-existent paths.
New Features:
- Negotiation is now handled asynchronously in the condor_schedd daemon.
This means that the condor_schedd remains responsive during
negotiation and is less prone to falling behind on communication
with condor_shadow processes.
- Improved monitoring and avoidance of a lock convoy problem
observed when there were more than 30,000 condor_shadow processes.
At this scale,
locking the condor_shadow daemon's log on each write to the log file
has been observed
on Linux platforms to sometimes result in a situation where the system does
very little productive work, and is instead consumed by rapid context
switching between the condor_shadow daemons that are waiting for the lock.
- On Linux platforms, if the condor_schedd daemon's spool directory is
on an ext3 file system, Condor can now scale to a larger number
of spooled jobs. Previously, Condor created two subdirectories
within the spool directory for each spooled job and for each running
job. The ext3 file system only supports 31,997 subdirectories. This
effectively limited the number of spooled jobs to less than 16,000.
Now, Condor creates a hierarchy of subdirectories within
the spool directory, to increase the limit on the number of spooled jobs
in ext3 to 320,000,000, which is likely to be larger than other limits
on the size of the job queue, such as memory.
- The condor_shadow daemon uses less memory than it has since
Condor version 7.5.0.
Memory usage should now be similar to the 7.4 series.
- The condor_dagman and condor_submit_dag command-line flag
-DumpRescue causes the dump of an incomplete Rescue DAG,
when the parsing of the DAG input file fails.
This may help in figuring out what went wrong.
See section 2.10.8 for complete details on Rescue DAGs.
- condor_dagman now has the capability to create the
jobstate.log file needed for the Pegasus workflow manager.
See section 2.10.12 for details.
- condor_wait can now work on jobs with logs that are only
readable by the user running condor_wait. Previously, write
access to the job's user log was required.
- Added a new value for the job ClassAd attribute JobStatus.
The TRANSFERRING_OUTPUT status is used
when transferring a job's output files after the job has finished running.
Jobs with this status will have their JobStatus attribute set to 6.
The standard condor_q display will show the job's status as >.
Configuration Variable and ClassAd Attribute Additions and Changes:
- The new configuration variable LOCK_DEBUG_LOG_TO_APPEND
controls whether a daemon's debug lock is used when appending to the log.
When the default value of False,
the debug lock is only used when rotating the log file.
When True, the debug lock is used when writing to
the log as well as when rotating the log file.
See section 3.3.4 for the complete definition.
- The new configuration variable
LOCAL_CONFIG_DIR_EXCLUDE_REGEXP may be set to a regular
expression that specifies file names to ignore when looking for
configuration files within the directories specified via
LOCAL_CONFIG_DIR.
See section 3.3.3 for the
complete definition.
Bugs Fixed:
- In previous versions of Condor, the condor_starter could not
write the .machine.ad and .job.ad files to the execute
directory when PrivSep was enabled. This has now been fixed, and these files
are correctly emitted in all cases.
- Since Condor version 7.5.2, the speed of condor_q was not as high
as earlier 7.5 and 7.4 releases,
especially when retrieving large numbers of jobs.
Viewing 100K jobs took about four times longer.
This release fixes the performance,
making it about the same as before Condor version 7.5.2.
- A bug introduced in Condor version 7.5.4 prevented parallel
universe jobs with multiple queue statements in
the submit description file from working with condor_dagman.
This is now fixed.
- Improved the way Condor daemons send heartbeat messages to their parent
process. This resolves a problem observed on busy submit machines using the
condor_shared_port daemon. The condor_master daemon sometimes incorrectly
determined that the condor_schedd was hung, and would kill and restart it.
- When the configuration variable NOT_RESPONDING_WANT_CORE
is True,
the condor_master daemon now follows up with a SIGKILL,
if the child process does not exit within ten minutes of receiving
SIGABRT.
This addresses observed cases in
which the child process hangs while writing a core file.
- Host name-based authorization failed in Condor version 7.5.4
under Mac OS X 10.4,
because look ups of the host name for incoming connections failed.
- A bug introduced in Condor version 7.5.0 caused
the attributes MyType and TargetType
in offline ClassAds to get set to "(unknown type)"
when the offline ClassAd was matched to a job.
- condor_dagman now excepts in the case of certain log file errors,
when continuing would be likely to put condor_dagman into an incorrect
internal state.
- Fixed a bug that caused DAG node jobs to have their coredumpsize
limit set according to the CREATE_CORE_FILES configuration
variable, rather than the user's coredumpsize limit.
- Fixed a case introduced in Condor version 7.5.4 on Windows platforms,
in which the following spurious log message was produced:
SharedPortEndpoint: Destructor: Problem in thread shutdown notification: 0
- Since Condor version 7.4.1,
Condor-C jobs submitted without file transfer enabled could
fail with the following error in the condor_starter log:
FileTransfer: DownloadFiles called on server side
- Fixed a memory leak caused by use of the ClassAd eval()
function. This problem was introduced in Condor version 7.5.2.
- Fixed a bug that could cause the condor_negotiator daemon to
crash when groups are configured with
GROUP_QUOTA_DYNAMIC_<group_name>, or when
GROUP_QUOTA_<group_name> is not defined to be something
greater than 0.
- Fixed a bug that caused random characters to appear for the
value of AuthMethods when logging with D_FULLDEBUG
and D_SECURITY enabled.
This problem was introduced in Condor version 7.5.3.
- Fixed a memory leak in the condor_schedd
introduced in Condor version 7.5.4.
- Fixed a problem introduced in Condor version 7.5.4 that could cause the
condor_schedd daemon to enter an infinite loop while in the
process of shutting down. For the problem to happen, it was
necessary for flocking to have been enabled.
- Configuration variable SCHEDD_QUERY_WORKERS was effectively
ignored when condor_q authenticated itself to the condor_schedd.
The query was always
processed in the main condor_schedd process rather than in a sub-process.
This problem has existed since before Condor version 7.0.0.
- Fixed a problem affecting jobs that store their output in the
condor_schedd's spool directory. The problem affected jobs that
include an empty directory in their list of output files to
transfer. This problem was introduced in Condor version 7.5.4,
when support for the transfer of directories was added.
- Fixed a problem affecting the condor_master daemon since
Condor version 7.5.3.
The condor_master daemon would crash if it was instructed
to shut down a daemon that was not currently running,
but which was waiting to be restarted.
- Fixed a bug in condor_submit that prevented the submission of
multiple vm universe jobs in a single submit file.
- Fixed a bug in the condor_schedd that could cause it to temporarily
under count the number of running local and scheduler universe jobs.
In Condor version 7.5.4,
this under counting could cause the daemon to crash.
- Fixed a bug that could cause the condor_gridmanager to crash if
a GAHP server did not behave as expected on start up.
- Improved the hold reason reported in several failure cases for
CREAM grid jobs.
- The KFlops attribute reported by
condor_status -run -total
could erroneously be reported as negative. This has been fixed.
- Since Condor version 7.5.4, the refreshing of the proxy for the job in the
remote queue did not work in Condor-C. Therefore, if the original job proxy
expired, the job was halted and put on hold, even if the proxy had
been renewed on the submit machine.
Known Bugs:
- In Condor version 7.5.5, when a running job is put on hold, the job
is removed from the job queue.
Additions and Changes to the Manual:
Version 7.5.4
Release Notes:
- Condor version 7.5.4 released on October 20, 2010.
- All of the bug fixes and features which are in
Condor version 7.4.4 are in this 7.5.4 release.
- The release now contains all header files necessary to compile
code that uses the job log reading and writing utilities contained
in libcondorapi. Some headers were missing starting in Condor 7.5.1.
New Features:
- Concurrency limits now work with parallel universe jobs
scheduled by the dedicated scheduler.
- Transfer of directories is now supported by
transfer_input_files and
transfer_output_files for non-grid universes and
Condor-C. The auto-selection of output files, however, remains the
same: new directories in the job's output sandbox are not
automatically selected as outputs to be transferred.
- Paths other than simple file names with no directory information
in transfer_output_files previously did not have well
defined behavior. Now, paths are supported for non-grid universes
and Condor-C. When a path to an output file or directory is
specified, this specifies the path to the file on the execute side.
On the submit side, the file is placed in the job's initial working
directory and it is named using the base name of the original path.
For example, path/to/output_file becomes output_file
in the job's initial working directory. The name and path of the
file that is written on the submit side may be modified by using
transfer_output_remaps.
- The condor_shared_port daemon is now supported on Windows platforms.
- Jobs can now by submitted to multiple EC2 servers via the amazon
grid type. The server's URL must be specified via the grid_resource
submit description file command for each job.
See section 5.3.8 for details.
- The grid universe's amazon grid type can now be used to submit
virtual machine jobs to Eucalyptus systems via the EC2 interface.
- condor_q now uses the queue-management API's projection feature when
used with -run, -hold, -goodput, -cputime,
-currentrun, and -io options when called with no display options
or with -format.
- Decreased the CPU utilization of condor_dagman when it is
submitting ready jobs into Condor.
- condor_dagman now logs the number of queued jobs in the DAG
that are on hold,
as part of the DAG status message in the dagman.out file.
- condor_dagman now logs a note in the dagman.out file
when the condor_submit_dag and condor_dagman versions differ,
even if the difference is permissible.
- Added the capability for condor_dagman to create and periodically
rewrite a file that lists the status of all nodes within a DAG.
Alternatively, the file may be continually updated as the DAG runs.
See section 2.10.11 for details.
- The condor_schedd daemon now uses a better algorithm for
determining which flocking level is being negotiated. No special
configuration is required for the new algorithm to work. In the
past, the algorithm depended on DNS and the
configuration variables NEGOTIATOR_HOST and
FLOCK_NEGOTIATOR_HOSTS. In some networking environments,
such as that of a multi-homed central manager, it was difficult to
configure things correctly. When wrongly configured, negotiation
would be aborted with the message, Unknown negotiator. The new
algorithm is only used when the condor_negotiator is version 7.5.4 or
newer. Of course, the condor_schedd daemon must still be configured to
authorize the condor_negotiator daemon at the NEGOTIATOR
authorization level.
- condor_advertise has a new option, -multiple, which
allows multiple ClassAds to be published. This is more efficient than
publishing each ClassAd in a separate invocation of condor_advertise.
- The condor_job_router is no longer restricted to routing only vanilla
universe jobs. It also now automatically avoids recursively routing jobs.
- The condor_schedd now writes the submit event to the user job log.
Previously, condor_submit wrote the event.
- The condor_schedd daemon now scales better when there are many
job auto clusters.
- The condor_q command with option -run, -hold,
-goodput, -cputime, -currentrun or -io
is now much more efficient in its communication with the condor_schedd.
Configuration Variable and ClassAd Attribute Additions and Changes:
- The new configuration variable SOAP_SSL_SKIP_HOST_CHECK
can be used to disable the standard check that a SOAP server's host name
matches the host name in its X.509 certificate. This is useful when submitting
grid type amazon jobs to Eucalyptus servers, which often have certificates
with a host name of localhost.
- Added default values for <SUBSYS>_LOG configuration variables.
If a <SUBSYS>_LOG configuration variable is not set in
files condor_config or condor_config.local,
it will default to $(LOG)/<SUBSYS>LOG.
- The new job ClassAd attribute CommittedSuspensionTime
is a running total of the number of seconds the job has spent in
suspension during time in which the job was not evicted without a
checkpoint. This complements the existing attribute
CumulativeSuspensionTime, which includes all time spent in
suspension, regardless of job eviction.
- The new job ClassAd attributes CommittedSlotTime and
CumulativeSlotTime are just like CommittedTime and
RemoteWallClockTime respectively, except the new attributes
are weighted by the SlotWeight of the machine(s) that ran
the job.
- The new configuration variable
SYSTEM_JOB_MACHINE_ATTRS specifies a list of machine
attributes that should be recorded in the job ClassAd. The default
attributes are Cpus and SlotWeight. When there are
multiple run attempts, history of machine attributes from previous
run attempts may be kept. The number of run attempts to store is
specified by the new configuration variable
SYSTEM_JOB_MACHINE_ATTRS_HISTORY_LENGTH , which defaults
to 1. A machine attribute named X will be inserted into the
job ClassAd as an attribute named MachineAttrX0. The previous
value of this attribute will be named MachineAttrX1, the
previous to that will be named MachineAttrX2, and so on, up to
the specified history length. Additional attributes to record may be
specified on a per-job basis by using the new job_machine_attrs
submit file command. The history length may also be extended on a
per-job basis by using the new submit file command
job_machine_attrs_history_length.
- The new configuration variable
NEGOTIATION_CYCLE_STATS_LENGTH specifies how many
recent negotiation cycles should be included in the history that is
published in the condor_negotiator's ClassAd. The default is 3. See
page for the
definition of this configuration variable, and see
page for a
list of attributes that are published.
- The configuration variable FLOCK_NEGOTIATOR_HOSTS is now
optional. Previously, the condor_schedd daemon refused to flock without
this setting. When this is not set, the addresses of the flocked
condor_negotiator daemons are found by querying the flocked
condor_collector daemons.
Of course, the condor_schedd daemon must still be configured to
authorize the condor_negotiator daemon at the NEGOTIATOR
authorization level. Therefore, when using host-based security,
FLOCK_NEGOTIATOR_HOSTS may still be useful as a macro for inserting
the negotiator hosts into the relevant authorization lists.
- The configuration variable FLOCK_HOSTS is no longer used.
For backward compatibility, this setting used to be treated as a default
for FLOCK_COLLECTOR_HOSTS and FLOCK_NEGOTIATOR_HOSTS.
- The configuration variable AMAZON_EC2_URL is now only used
for previously-submitted jobs when upgrading Condor to version 7.5.4 or
beyond. New grid type amazon jobs must specify which EC2 service to use
by setting the grid_resource submit description file command.
- The new job ClassAd attribute NumPids is the total number of
child processes a running job has.
- The new configuration variable DAGMAN_MAX_JOB_HOLDS
specifies the maximum number of times a DAG node job is allowed to go
on hold. See section 3.3.25 for details.
- The configuration variable STARTD_SENDS_ALIVES now only
needs to be set for the condor_schedd daemon. Also, the default value has
changed to True.
- The job ClassAd attributes amazon_user_data and
amazon_user_data_file can now both be used for the same
job. When both are provided, the two blocks of data are concatenated,
with the value of the one specified by amazon_user_data
occurring first.
- The new configuration variable GRAM_VERSION_DETECTION
can be used to disable Condor's attempts to distinguish between gt2
(GRAM2) and gt5 (GRAM5) servers.
The default value is True.
If set to False, Condor trusts the gt2 or gt5 value
provided in the job's grid_resource attribute.
- The new job ClassAd attribute ResidentSetSize is an integer
measuring the amount of physical memory in use by the job on the execute
machine in kilobytes.
- The new job ClassAd attribute X509UserProxyExpiration is an
integer representing when the job's X.509 proxy credential will expire,
measured in the
number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).
- The new configuration variable SCHEDD_CLUSTER_MAXIMUM_VALUE
is an upper bound on assigned job cluster ids. If set to
value M, the maximum job cluster id assigned to any job will be M-1.
When the maximum id is reached, assignment of cluster ids will wrap around
back to SCHEDD_CLUSTER_INITIAL_VALUE. The default value is zero,
which does not set a maximum cluster id.
- The default value of configuration variable
MAX_ACCEPTS_PER_CYCLE has been changed from 1 to 4.
- The configuration variable NEW_LOCKING , introduced in
Condor version 7.5.2, has been changed to
CREATE_LOCKS_ON_LOCAL_DISK and now defaults to True.
Bugs Fixed:
- Fixed a bug that occurred with x64 flavors of the Windows operating system.
Condor was setting the default value of Arch to INTEL when it
should have been X86_64. This was a consequence of the fact that the
Condor runs in the WOW64 sandbox on 64-bit Windows. This was fixed so that
Arch would contain the value for the native architecture rather than
the WOW64 sandbox architecture.
- Fixed a bug in the user privilege switching code in Windows that
caused the condor_shadow daemon to except when the condor_schedd
daemon attempted to re-use it.
- Fixed the output in the condor_master daemon log file to be
clearer when an authorized user tries to use condor_config_val
-set and ENABLE_PERSISTENT_CONFIG is False.
The previous
output implied that the operation succeeded when, in fact, it did not.
- Since Condor version 7.5.2,
the following condor_job_router features were
effectively non-functional: UseSharedX509UserProxy,
JobShouldBeSandboxed, and JobFailureTest.
- The submit description file command copy_to_spool
did not work properly in Condor version 7.5.3.
When sending the executable to the execute machine, it was
transferred from the original path rather than from the spooled copy
of the file.
- When output files were auto-selected and spooled, Condor-C and
condor_transfer_data would copy back both the output files and
all other contents of the job's spool directory, which typically
included the spooled input and the user log.
Now, only the output files are retrieved.
To adjust which files are retrieved, the job
attribute SpooledOutputFiles can be manipulated, but this
typically should be managed by Condor.
- The condor_master daemon now invalidates its ClassAd,
as represented in the condor_collector daemon, before it shuts down.
- Fixed a bug that caused vm universe jobs to not run
if the VMware .vmx file contained a space.
- Fixed a bug introduced in Condor version 7.5.1 that caused integers
in ClassAd expressions that had leading zeros to be read as octal (base eight).
- Fixed a bug introduced in Condor version 7.5.1 that did not recognize
a semicolon as a separator of function arguments in ClassAds.
- Fixed a bug that caused integers larger than in a ClassAd
expression to be parsed incorrectly. Now, when these integers are
encountered, the largest 32-bit integer (with matching sign) is used.
- Fixed a bug that caused the condor_gridmanager to exit when
receiving badly-formatted error messages from the nordugrid_gahp.
- Fixed a problem affecting the use of version 7.5.3 condor_startd and
condor_master daemons in a pool with a condor_collector from before
version 7.5.2. On shutdown, the condor_startd and the condor_master
caused all condor_startd and condor_master ClassAds, respectively,
to be removed from the condor_collector.
- Fixed a bug that caused delegation of an X.509 RFC proxy between
two Condor processes to fail.
- Fixed a bug in condor_submit that would cause failures if a file
name containing a space was used with the submit description file commands
append_files, jar_files or
vmware_dir.
- Fixed a bug that could cause the condor_gridmanager to lock up if
a GAHP server it was using wrote a large amount of data to its stderr.
- Fixed a bug that could cause the condor_gridmanager to wrongly
conclude that a gt2 (that is, GRAM2) server was a gt5
(that is, GRAM5) server.
Such a conclusion can be disastrous, as Condor's mechanisms to
prevent overloading a gt2 server are then disabled. The new
configuration variable GRAM_VERSION_DETECTION can be used
to disable Condor's attempts to distinguish between the two.
- Fixed a bug introduced in Condor version 7.5.3.
When file transfer failed for a grid universe job of grid type
cream,
Condor would write a hold event to the job log,
but not actually put the job on hold.
- Fixed a bug in the condor_gridmanager that could cause it to crash
while handling cream grid type jobs destined for different resources.
- Fixed a bug that prevented the condor_shadow from managing
additional jobs after its first job completed when
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION was set to True.
- The timestamps in the log defined by PROCD_LOG
now print the real time.
- Fixed how some daemons advertise themselves to the condor_collector.
Now, all daemons set the attribute MyType to indicate what
type of daemon they are.
- condor_chirp no longer crashes on a put operation,
if the remote file name is omitted.
- Fixed the packaging of Hadoop File System support in Condor. This includes
updating to HDFS 0.20.2 and making the HDFS web interface work properly.
- Condor no longer tries to invoke glexec if the job's X.509 proxy
is expired.
Known Bugs:
- Using host names for host-based authentication,
such as in the definitions of configuration variables
ALLOW_* and DENY_*,
does not work on Mac OS X 10.4.
Later versions of the OS are not affected.
As a work around, IP addresses can be used instead of host names.
Additions and Changes to the Manual:
Version 7.5.3
Release Notes:
- Condor version 7.5.3 released on June 29, 2010.
New Features:
- condor_q -analyze now notices the -l option, and if both
are given, then the analysis prints out the list of machines
in each analysis category.
- The behavior of macro expansion in the configuration file has
changed. Previously, most macros were effectively treated as
undefined unless explicitly assigned a value in the configuration
file. Only a small number of special macros had pre-defined values
that could be referred to via macro expansion. Examples include
FULL_HOSTNAME and DETECTED_MEMORY. Now, most
configuration settings that have default values can be referred to
via macro expansion. There are a small number of exceptions where
the default value is too complex to represent in the current
implementation of the configuration table. Examples include the
security authorization settings. All such configuration settings
will also be reported as undefined by condor_config_val unless
they are explicitly set in the configuration file.
- Unauthenticated connections are now identified as
unauthenticated@unmapped
. Previously, unauthenticated
connections were not assigned a name, so some authorization policies
that needed to distinguish between authenticated and unauthenticated
connections were not expressible. Connections that are
authenticated but not mapped to a name by the mapfile used to be
given the name auth-method@unmappeduser
, where
auth-method is the authentication method that was used. Such
connections are now given the name auth-method@unmapped
.
Connections that match *@unmapped
are now forbidden from
doing operations that require a user id, regardless of configuration
settings. Such operations include job submission, job removal, and
any other job management commands that modify jobs.
- There has been a change of behavior when authentication fails.
Previously, authentication failure always resulted in the command
being rejected, regardless of whether the ALLOW/DENY settings
permitted unauthenticated access or not. This is still true if either
the client or server specifies that authentication is required.
However, if both sides specify that authentication is not required
(i.e. preferred or optional), then authentication failure only results
in the command being rejected if the ALLOW/DENY settings reject
unauthenticated access. This change makes it possible to have some
commands accept unauthenticated users from some network addresses
while only allowing authenticated users from others.
- Improved log messages when failing to authenticate requests. At
least the IP address of the requester is identified in all cases.
- The new submit file command job_ad_information_attrs
may be used to specify attributes from the job ad that should be saved
in the user log whenever a new event is being written. See
page for details.
- Administrative commands now support the -constraint option, which
accepts a ClassAd expression. This applies to condor_checkpoint,
condor_off, condor_on, condor_reconfig, condor_reschedule,
condor_restart, condor_set_shutdown, and condor_vacate.
- File transfer plugins can be used for vm universe jobs. Notably,
file:// URLs can be used to allow VM image files to be pre-staged
on the execute machine. The submit description file command
vmware_dir is now optional.
If it is not given, then all relevant VMware image files
must be listed in transfer_input_files, possibly as URLs.
- File transfers for CREAM grid universe jobs are now initiated by
the condor_gridmanager. This removes the need for a GridFTP server
on the client machine.
- Improved the parallelism of file transfers for nordugrid jobs.
- Removed the distinction between regular and full reconfiguration
of Condor daemons. Now, all reconfigurations are full and require the
WRITE authorization level. condor_reconfig accepts but ignores the
-full command-line option.
- The batch_gahp, used for pbs and lsf grid universe jobs, has been
updated from version 1.12.2 to 1.16.0.
- condor_dagman now prints a message to the dagman.out file
when it truncates a node job user log file.
- condor_dagman now allows node categories to include
nodes from different splices. See section 2.10.7
for details.
- condor_dagman now allows category throttles in splices to
be overridden by higher levels in the DAG splicing structure.
See section 2.10.7 for details.
- Daemon logs can now be rotated several times instead of only once
into a single .old file. In order to do so, the newly introduced
configuration variable MAX_NUM_<SUBSYS>_LOG needs to be set
to a value greater than 1. The file endings will be ISO timestamps, and
the oldest rotated file will still have the ending .old.
Configuration Variable and ClassAd Attribute Additions and Changes:
- The new configuration variable JOB_ROUTER_LOCK specifies a
lock file used to
ensure that multiple instances of the condor_job_router never run
with the same values of JOB_ROUTER_NAME.
Multiple instances running
with the same name could lead to mismanagement of routed jobs.
- The new configuration variable ROOSTER_MAX_UNHIBERNATE
is an integer
specifying the maximum number of machines to wake up per cycle.
The default value of 0 means no limit.
- The new configuration variable ROOSTER_UNHIBERNATE_RANK
is a ClassAd
expression specifying which machines should be woken up first in a
given cycle. Higher ranked machines are woken first.
If the number of machines to be woken up is limited by
ROOSTER_MAX_UNHIBERNATE, the rank may be used for
determining which machines are woken before reaching the limit.
- The new configuration variable CLASSAD_USER_LIBS
is a list of libraries
containing additional ClassAd functions to be used during ClassAd
evaluation.
- The new configuration variable SHADOW_WORKLIFE
specifies the number of seconds after which the condor_shadow will exit,
when the current job finishes, instead of fetching a new job to
manage. Having the condor_shadow continue managing jobs helps
reduce overhead and can allow the condor_schedd to achieve higher
job completion rates. The default is 3600, one hour. The value 0
causes condor_shadow to exit after running a single job.
- The new configuration variable MAX_NUM_<SUBSYS>_LOG
will determine how often the daemon log of SUBSYS will rotate.
The default value is 1 which leads to the old behavior of a single
rotation into a .old file.
Bugs Fixed:
- Configuration variables with a default value of 0
that were not defined in the configuration file
were treated as though they were undefined by condor_config_val.
Now Condor treats this case like any other:
the default value is displayed.
- Starting in Condor version 7.5.1,
using literals with a logical operator
in a ClassAd expression (for example, 1 || 0) caused the expression
to evaluate to the value ERROR. The previous behavior has been
restored: zero values are treated as False,
and non-zero values are treated as True.
- Starting in Condor version 7.5.0,
the condor_schedd no longer supported queue
management commands when security negotiation was disabled,
for example, if SEC_DEFAULT_NEGOTIATION = NEVER.
- Fixed a bug introduced in Condor version 7.5.1.
ClassAd string literals containing
characters with negative ASCII values were not accepted.
- Fixed a bug introduced in Condor version 7.5.0,
which caused Condor to not renew
job leases for CREAM grid jobs in most situations.
- Question marks occurring in a ClassAd string are no longer preceded
by a backslash when the ClassAd is printed.
Known Bugs:
Additions and Changes to the Manual:
Version 7.5.2
Release Notes:
- Condor version 7.5.2 released on April 26, 2010.
- Condor no longer supports SuSE 8 Linux on the Itanium 64 architecture.
- The following submit description file commands are no longer recognized.
Their functionality is replaced by the command grid_resource.
- grid_type
- globusscheduler
- jobmanager_type
- remote_schedd
- remote_pool
- unicore_u_site
- unicore_v_site
New Features:
- The condor_schedd daemon uses less disk bandwidth when logging
updates to job ClassAds from running jobs and also when removing jobs
from the queue and handling job eviction and condor_shadow exceptions.
This should improve performance in situations where
disk bandwidth is a limiting factor.
Some cases of updates to the job user log
have also been optimized to be less disk intensive.
- The condor_schedd daemon uses less CPU when scheduling
some types of job queues. Most likely to benefit from this improvement is
a large queue of short-running, non-local, and non-scheduler universe jobs,
with at least one idle local or scheduler universe job.
- The condor_schedd automatically grants the condor_startd
authority to renew leases on claims and to evict claims.
Previously, this required that the condor_startd be trusted for
general DAEMON-level command access. Now this only
requires READ-level command access. The specific commands
that the condor_startd sends to the condor_schedd can
effectively only operate on the claims associated with that condor_startd,
so this change does not open up these operations to access by anyone
with READ access. It reduces the level of trust that
the condor_schedd must have in the condor_startd.
- The condor_procd's log now rotates if logging is activated.
The default maximum size is 10Mbytes. To change the default,
use the configuration variable MAX_PROCD_LOG .
- For Unix systems only,
user job log and global job event log lock files can now optionally
be created in a directory on a
local drive by setting NEW_LOCKING to True.
See section 3.3.4 for
the details of this configuration variable.
- condor_dagman and condor_submit_dag now default to lazy
creation of the .condor.sub files for nested DAGs.
condor_submit_dag no longer creates them, and condor_dagman
itself creates the files as the DAG is run.
The previous "eager" behavior can
be obtained with a combination of command-line and configuration settings.
There are several advantages to the "lazy" submit file creation:
- The DAG file for a nested DAG does not have to exist until that node
is ready to run, so the DAG file can be dynamically created by earlier
parts of the top-level DAG (including by the PRE script of the nested
DAG node).
- It is now possible to have nested DAGs within splices, which is not
possible with "eager" submit file creation.
Configuration Variable and ClassAd Attribute Additions and Changes:
- The new configuration variable
DAGMAN_GENERATE_SUBDAG_SUBMITS controls whether
condor_dagman itself generates the .condor.sub files for
nested DAGs, rather than relying on condor_submit_dag "eagerly"
creating them. See section 3.3.25 for
more information.
- The new configuration variable NEW_LOCKING can specify that
job user logs and the global job event log to be written to a local drive,
avoiding locking problems with NFS.
See section 3.3.4 for
the details of this configuration variable.
Bugs Fixed:
- The condor_job_router failed to work on SLES 9 PowerPC,
AIX 5.2 PowerPC,
and YDL 5 PowerPC due to a problem in how it detected EOF in the job queue log.
- When jobs are removed, the condor_schedd sometimes did not
quickly reschedule a different job to run on the slot to which the
removed job had been matched. Instead, it would take up to
SCHEDD_INTERVAL seconds to do so.
- Fixed a bug introduced in Condor version 7.5.1 that caused the
gahp_server to crash when
first communicating with most gt2 or gt5 GRAM servers.
Known Bugs:
Additions and Changes to the Manual:
Version 7.5.1
Release Notes:
- Condor version 7.5.1 released on March 2, 2010.
- Some, but not all of the bug fixes and features which are in
Condor version 7.4.2, are in this 7.5.1 release.
- The Condor release is now available as a proper RPM or Debian
package.
- Condor now internally uses the version of New ClassAds provided
as a stand-alone library (http://www.cs.wisc.edu/condor/classad/).
Previously, Condor
used an older version of ClassAds that was heavily tied to the Condor
development libraries. This change should be transparent in the
current development series. In the next development series (7.7.x),
Condor will begin to use features of New ClassAds that were unavailable in
Old ClassAds.
Section 4.1.1 details the differences.
- HPUX 11.00 is no longer a supported platform.
New Features:
- A port number defined within CONDOR_VIEW_HOST may now use
a shared port.
- The condor_master no longer pauses for 3 seconds after starting
the condor_collector. However, if the configuration variable
COLLECTOR_ADDRESS_FILE defines a file,
the condor_master will wait for that file to be created
before starting other daemons.
- In the grid universe, Condor can now automatically distinguish
between GRAM2 and GRAM5 servers, that is grid types gt2 and
gt5.
Users can submit jobs using a grid type of gt2 or gt5
for either type of server.
- Grid universe jobs using the CREAM grid system now batch up
common requests into larger single requests. This
reduces network traffic, increases the number of parallel tasks
the Condor can handle at once, and reduces the load on the remote
gatekeeper.
- The new submit description file command cream_attributes
sets additional attribute/value pairs for the CREAM job description
that Condor creates when submitting a grid universe job
destined for the CREAM grid system.
- The condor_q command with option -analyze is now performs
the same analysis as previously occurred with the -better-analyze option.
Therefore, the output of condor_q with the -analyze option
has different output than before.
The -better-analyze option is still recognized and behaves the same
as before, though it may be removed from a future version.
- Security sessions that are not used for longer than an hour are
now removed from the security session cache to limit memory usage.
- The number of security sessions in the cache is now advertised in
the daemon ClassAd as MonitorSelfSecuritySessions.
- condor_dagman now has the capability to run DAGs containing nodes
that are declared to be NOOPs - for these nodes, a job is never actually
submitted. See section 2.10.2 for information.
- The submit file attribute vm_macaddr can now be used to set
the MAC address for vm universe jobs that use VMware. The range of valid
MAC addresses is constrained by limits imposed by VMware.
- The condor_q command with option -globus
is now much more efficient in its communication with the condor_schedd.
Configuration Variable and ClassAd Attribute Additions and Changes:
- The new configuration variable STRICT_CLASSAD_EVALUATION
controls whether new or old ClassAd expression evaluation semantics are
used. In new ClassAd semantics, an unscoped attribute reference is only
looked up in the local ad. The default is False (use old ClassAd semantics).
- The configuration variable
DELEGATE_FULL_JOB_GSI_CREDENTIALS now applies to all proxy
delegations done between Condor daemons and tools.
The value is a boolean and defaults to False,
which means that when doing delegation Condor will now create a limited proxy
instead of a full proxy.
- The new configuration variable
SEC_<access-level>_SESSION_LEASE specifies the maximum
number of seconds an unused security session will be kept in a daemon's
session cache before being removed to save memory. The default is 3600.
If the server and client have different configurations, the smaller
one will be used.
Bugs Fixed:
Known Bugs:
Additions and Changes to the Manual:
Version 7.5.0
Release Notes:
- All bug fixes and features which are in 7.4.1 are in this 7.5.0 release.
New Features:
- Added the new daemon condor_shared_port for Unix platforms
(except for HPUX).
It allows Condor daemons to share a
single network port. This makes opening access to Condor through a
firewall easier and safer. It also increases the scalability of a
submit node by decreasing port usage. See
section 3.3.34 for more information.
- Improved CCB's handling of rude NAT/firewalls that silently drop
TCP connections.
- Simplified the publication of daemon addresses.
PublicNetworkIpAddr and PrivateNetworkIpAddr have been removed.
MyAddress contains both public and private addresses. For now,
<Subsys>IpAddr contains the same information. In a future release,
the latter may be removed.
- Changes to TCP_FORWARDING_HOST,
PRIVATE_NETWORK_ADDRESS, and
PRIVATE_NETWORK_NAME can now be made without requiring a
full restart. It may take up to one condor_collector update interval
for the changes to become visible.
- Network compatibility with Condor prior to 6.3.3 is no longer
supported unless SEC_CLIENT_NEGOTIATION is set to
NEVER. This change removes the risk of communication errors
causing performance problems resulting from automatic fall-back to the
old protocol.
- For efficiency, authentication between the condor_shadow and
condor_schedd daemons is now able to be cached and reused in more
cases. Previously, authentication for updating job information was
only cached if read access was configured to require authentication.
- condor_config_val will now report the default value for
configuration variables that are not set in the configuration files.
- The condor_gridmanager now uses a single status call to obtain
the status of all CREAM grid universe jobs from the remote server.
- The condor_gridmanager will now retry CREAM commands that time out.
- Forwarding a renewed proxy for CREAM grid universe jobs to the
remote server is now much more efficient.
Configuration Variable and ClassAd Attribute Additions and Changes:
- Removed the configuration variable
COLLECTOR_SOCKET_CACHE_SIZE.
Configuration of this parameter used to be mandatory to enable TCP updates
to the condor_collector. Now no special configuration of the
condor_collector is required to allow TCP updates, but it is
important to ensure that there are sufficient file descriptors for
efficient operation. See section 3.7.4 for
more information.
- The new configuration variable USE_SHARED_PORT
is a boolean value that specifies
whether a Condor process should rely on the condor_shared_port daemon for
receiving incoming connections. Write access to
DAEMON_SOCKET_DIR is required for this to take effect.
The default is False. If set to True, SHARED_PORT
should be added to DAEMON_LIST. See
section 3.3.34 for more information.
- Added the new configuration variable CCB_HEARTBEAT_INTERVAL.
It is the maximum
number of seconds of silence on a daemon's connection to the CCB server
after which it will ping the server to verify that the connection still
works.
The default value is 1200 (20 minutes).
This feature serves to both speed
up detection of dead connections and to generate a guaranteed minimum
frequency of activity to attempt to prevent the connection from being
dropped.
Bugs Fixed:
- Fixed problem with a ClassAd debug function,
so it now properly emits debug information for ClassAd IfThenElse
clauses.
Known Bugs:
Additions and Changes to the Manual:
Next: 9.6 Stable Release Series
Up: 9. Version History and
Previous: 9.4 Stable Release Series
Contents
Index
condor-admin@cs.wisc.edu