This is an outdated version of the HTCondor Manual. You can find current documentation at http://htcondor.org/manual.

Next: 10. Command Reference Manual Up: 9. Version History and Previous: 9.5 Development Release Series Contents Index

Subsections

9.6 Stable Release Series 7.4

This is a stable release series of Condor. As usual, only bug fixes (and potentially, ports to new platforms) will be provided in future 7.4.x releases. New features will be added in the 7.5.x development series.

The details of each version are described below.

Version 7.4.5

Release Notes:

Condor version 7.4.5 not yet released.

New Features:

condor_dagman now prints a message in the dagman.out file whenever it truncates a node job user log file.
condor_dagman now prints additional diagnostic information in the case of certain log file errors.

Configuration Variable and ClassAd Attribute Additions and Changes:

None.

Bugs Fixed:

A network disconnect between the submit machine and execute machine during the transfer of output files caused the condor_starter daemon to immediately give up, rather than waiting for the condor_shadow to reconnect. This problem was introduced in Condor version 7.4.4.
If condor_ssh_to_job attempted to connect to a job while the job's input files were being transferred, this caused the file transfer to fail, which resulted in the job returning to the idle state in the queue.
In privsep mode, the transfer of output failed if a job's execute directory contained symbolic links to non-existent paths.

Known Bugs:

None.

Additions and Changes to the Manual:

None.

Version 7.4.4

Release Notes:

Condor version 7.4.4 released on October 18, 2010.
Security Item: This release fixes a bug in which Amazon EC2 jobs (jobs with universe = grid and grid_resource = amazon) that use the amazon_keypair_file command may expose the private SSH key to other users. The created file had insecure permissions, allowing other users on the submit host to read the file. Any other user who could see the file could learn about these EC2 jobs using condor_q, and the other user could then connect to them with the private SSH key.
To work around the bug without installing this release, do one or both of the following:
- Do not use the submit description file command amazon_keypair_file.
- Ensure that the directory holding the private SSH key has suitably restrictive permissions, such that other users cannot read files inside the directory.
Condor can now be built on Mac OS X 10.6.
The condor_master shutdown program, which is configured via the MASTER_SHUTDOWN_<Name> configuration variable, is now run with root (Unix) or administrator (Windows) privileges. The adminstrator must ensure that this cannot be used in such a way as to violate system integrity.

New Features:

load_profile is now supported by the Unix version of condor_submit when submitting jobs to Windows. Previously, this command was only supported by the Windows version of condor_submit.
Added an example Mac OS X launchd configuration file for starting Condor.

Configuration Variable and ClassAd Attribute Additions and Changes:

None.

Bugs Fixed:

Fixed bad behavior in condor_quill where, under certain error conditions, many copies of the schedd_sql.log file would be inserted into the database, filling up the disk volume used by the database. As a consequence of this bug fix, the LogBody column for each row in the Error_SqlLogs table will be empty. Please consult the condor_quill daemon log file for the error instead.
Fixed a bug with how the standard universe remote system call getrlimit() functioned. Under certain conditions with 32-bit and 64-bit standard universe binaries, getrlimit() would erroneously report 2147483647 bytes as a limit, when RLIM_INFINITY should have been the correct response.
Fixed a misleading error message issued by condor_run, which stated
```
The DAGMan job was aborted by the user.
```
when the job submitted by condor_run was aborted by the user. It now states
```
The job was aborted by the user.
```
When the condor_startd daemon is running with an execute directory on a very large file system, with more than 32 bits worth of free blocks on a 32-bit system, it would incorrectly report 0 free bytes. This has been fixed.
For spooled jobs, input files were sometimes transferred twice from the submit machine to the execute machine. This happened if the input files were specified without any path information, as with a file name with no directory specified. This problem has existed since before Condor version 7.0.0.
The configuration variable NETWORK_INTERFACE did not work in some situations, because of Condor's attempts to automatically rewrite published addresses to match the IP address of the network interface used to make the publication.
Fixed a bug in which the default unit of configuration variable STARTD_CRON_TEST_PERIOD should have been seconds, but instead was Undefined.
Fixed a bug in which condor_submit checked for bad condor_schedd cron arguments incorrectly within a submit description file. Now condor_submit will detect the problem and print out an error message.
With some versions of ssh, condor_ssh_to_job failed if the SHELL environment variable was set to /bin/csh.
Submission of vm universe jobs via Globus was not possible, because the Globus Condor jobmanager explicitly set the input, output, and error files to /dev/null, and condor_submit refused any setting of these files for vm universe jobs. Now, /dev/null is an allowed setting for the input, output, and error files for vm universe jobs.
Fixed a bug that caused a vm universe job's output files to be incorrectly transferred back to the submit machine, when the submit description file command vm_no_output_vm was set to false, indicating that no files should be transferred.
String literals within $$([]) expressions within a submit description file failed to be evaluated and resulted in the job going on hold. This problem has existed since before Condor 7.0.0.
condor_preen was not able to clean up files in the EXECUTE directory when in privsep mode.
A problem was fixed that could cause a Condor daemon that connects to other daemons via CCB to permanently run out of space for more registered sockets until restarted. This problem appeared in the logs as the following message:
```
file descriptor safety level exceeded
```
Fixed a problem that could cause the condor_collector to crash when receiving updated matchmaking information for offline ClassAds that do not exist.
condor_ssh_to_job did not work when SEC_DEFAULT_NEGOTIATION was set to OPTIONAL.
The vm universe now works properly on machines that have Condor's Privilege Separation mechanism enabled.
condor_submit no longer pads a vm universe job's disk usage estimation by 100MB.
Fixed a bug with the vm_cdrom_files submit file command, that caused VMware vm universe jobs to fail if the virtual machine already had a CD-ROM image associated with it.
Configuration variables SOAP_SSL_CA_DIR and SOAP_SSL_CA_FILE are now properly used when authenticating with Amazon EC2 servers.
Fix a bug with the <subsys>_LOCK configuration variable. It could let daemons writing to the same daemon log overwrite each other's entries and cause daemons to exit when the log is rotated.
Fixed a bug that caused nordugrid jobs to fail if the grid_resource attribute included a port as part of the server host name.
Fixed a confusing error message mentioning LocalUserLog::logStarterError() when the condor_starter fails to communicate with the condor_shadow before the job has started.
Fixed the event log and shadow log for standard universe jobs to identify the checkpoint server on which a job might have failed to store its checkpoint or from which it might have failed to restore its checkpoint.
Fixed a bug in the condor_gridmanager that could cause it to crash while handling grid-type cream jobs.
Improved the condor_gridmanager's handling of grid-type cream jobs that are held or removed by the user. Canceling the cream job is much less likely to fail and jobs can no longer get stuck in the cream state of CANCELED.
Fixed the web server feature controlled by ENABLE_WEB_SERVER . Previously, all HTTP GET requests would fail on non-linux Unix machines.

Known Bugs:

None.

Additions and Changes to the Manual:

The Windows platform installation instructions have been updated.
Section 2.5.4 on Condor's File Transfer Mechanism has been revised and updated.
Section , providing examples of utilizing ClassAd expressions within the -constraint option of condor_q or condor_status commands has been expanded to clarify both Unix and Windows platform specifics.

Version 7.4.3

Release Notes:

Condor version 7.4.3 released on August 16, 2010.

New Features:

None.

Configuration Variable and ClassAd Attribute Additions and Changes:

The new configuration variable ENABLE_CHIRP defaults to True. An administrator may set it to False, which disables Chirp remote file access from execute machines.
The new configuration variable ENABLE_ADDRESS_REWRITING defaults to True. It may be set to False to disable Condor's dynamic algorithm for choosing which IP address to publish in multi-homed environments. The dynamic algorithm chooses the IP address associated with the network interface used to make the publication, for example, the network interface used to communicate with the condor_collector.
Configuration variable VM_BRIDGE_SCRIPT has been removed and is no longer valid.
The new configuration variable VM_NETWORKING_BRIDGE_INTERFACE specifies the networking interface that Xen or KVM vm universe jobs will use. See section 3.3.28 for documentation.
Allowed the configuration file entries GSI_DAEMON_TRUSTED_CA_DIR and GSI_DAEMON_DIRECTORY to be set with environment variables, as the rest of Condor configuration variables can be.

Bugs Fixed:

When using file transfer semantics, if the job exited in such a manner so as to not produce all output files specified in transfer_output_files, then which files were transferred was potentially incorrect. Now, all existing files are transferred back, and the files which are not able to be transferred back due to non-existence appear as zero length files. An example of when this occurred would be the job dumping core and then being placed on hold.
Fetch work hooks to prepare are now invoked as the condor user, instead of as the job user.
Improved the file extension detection on Windows platforms.
condor_wait could occasionally get stuck in an infinite loop, if it missed the execution event of the job it is waiting for. This is now fixed.
Fixed a bug within the condor_startd cron capabilities, that only occurred on Windows platforms. condor_startd cron scripts failed to run if an initial directory was left unspecified.
Fixed a bug in which a job would be incorrectly placed on hold, with a confusing error message that appeared similar to
```
Condor failed to start your job 9090.-1 because job attribute Args contains $$(VirtualMachineID).
```
This occurred if the submit command copy_to_spool was true, the submit description file for the job contained $$ macros, and condor_preen ran after the job was submitted and before it started.
Added the jobs_vertical_history table to the list of tables that condor_quill periodically re-indexes.
Fixed bug in condor_preen in which it would delete condor_startd daemon history files.
Fixed a bug where if a user job using file transfer with transfer_output_files and when_to_transfer_output is set to ON_EXIT_OR_EVICT fails to produce all of the specified files and exit, as when core dumping due to a fault, then the stdout, stderr, and core file of the job were not transferred back to the submitting machine.
Fixed numerous, small, rare memory leaks.
Fixed a bug in which processor affinity settings were ignored if privilege separation was enabled.
Network connections for Condor file transfers were ignoring private network settings. The connection from the execute node to the submit node always attempted to use the public network address of the submit machine.
The configuration variable TCP_FORWARDING_HOST did not work in some situations because of Condor's attempts to automatically rewrite published addresses to match the IP address of the network interface used to make the publication.
A single job could match multiple offline slots in a single negotiation cycle. This problem could cause condor_rooster to wake up too many offline machines for the number of jobs available to run on them. The fix for this problem requires updating both the condor_negotiator and the condor_schedd.
Fixed a problem that caused the condor_startd daemon to crash in some cases when STARTD_SENDS_ALIVES was True. This setting is False by default.
Fixed a problem where the condor_kbdd has a chance of entering an infinite loop on platforms that use X-Windows. Microsoft Windows and Mac OS X platforms were not affected. This bug is present in all earlier 7.4.x Condor releases.
The default path to sftp-server has been improved, so that condor_ssh_to_job can use sftp out-of-the-box on RedHat Enterprise Linux 5 platforms.
A nordugrid_gahp binary built on RedHat Enterprise Linux 3 no longer crashes when run on a RedHat Enterprise Linux 4 or Scientific Linux 4 machine.
Fixed a bug in condor_rm that caused it to misinterpret user names that begin with a digit, such as 4abc. It incorrectly used them as cluster numbers.
Fixed a bug that caused the condor_startd to invoke the ``power_state'' plug-in as the condor user. This caused hibernation to fail because power_state requires root privileges to function properly.
Fixed a bug that could cause the condor_schedd to crash if there were any idle scheduler universe jobs when files were staged into the condor_schedd for a new job.
Fixed a bug in the nordugrid_gahp that could cause it to exit when connecting to a misconfigured LDAP server.
Fixed a bug that prevented the log file defined with the configuration variable NEGOTIATOR_MATCH_LOG from rotating. See section 3.3.4 for the definition of this variable.
Fixed a bug that caused startd_cron jobs to fail on Windows. This bug is present in all earlier 7.4.x Condor releases.
The submit description file command vm_cdrom_files now works properly with Windows execute machines. Previously, creation of the ISO file would fail, causing job execution to be aborted.
Fixed a bug that caused the condor_startd to invoke the power_state plug-in as the condor user. This caused hibernation to fail, because power_state requires root privileges to function properly.

Known Bugs:

None.

Additions and Changes to the Manual:

Searching the PDF version of the manual for items containing underscore characters, such as many configuration variable names, now works correctly.
The new subsection 4.1.3 provides examples of evaluation results when using the operators ==, =?=, !=, and =!=.
Section 2.11 with specifics on vm universe jobs has been updated to contain more details about both checkpoints and vm universe jobs in general.

Version 7.4.2

Release Notes:

Condor version 7.4.2 released on April 6, 2010.

New Features:

None.

Configuration Variable and ClassAd Attribute Additions and Changes:

When WANT_SUSPEND is defined and evaluates to anything other than the value True, it is now utilized as if it were False. If WANT_SUSPEND is not explicitly defined, the condor_startd exits with an error message. Previously, if Undefined, it was treated as an error, which caused the condor_startd to exit with an error message.

Bugs Fixed:

Fixed a bug in which the condor_schedd would sometimes negotiate for and try to run more jobs than specified by MAX_RUNNING_JOBS. Once the jobs started running, it would then kill them off to get back below the limit. This was more likely to happen with slow preemption caused by MaxJobRetirementTime or by a large timeout imposed by KILL. This problem has existed since before Condor 6.5. When this problem happened, the following message appeared in the condor_schedd log:
```
Preempting X jobs due to MAX_JOBS_RUNNING change
```
Fixed a problem that caused condor_ssh_to_job to fail to connect to a job running on a slot with multiple '@' signs in its name. This bug has existed since the introduction of condor_ssh_to_job in 7.3.2.
In all previous versions of Condor, condor_status refused to accept -long, -xml, and -format when followed by an argument such as -master that specified which type of daemon to look at. The order of the arguments had to be reversed or it would produce a message such as the following:
```
Error:  arg 4 (-master) contradicts arg 1 (-format)
```
Fixed a bug which caused the condor_master to crash if VIEW_SERVER was included in DAEMON_LIST and CONDOR_VIEW_HOST was unset.
Fixed a bug that caused configuration parameter LOCAL_CONFIG_DIR to be ignored if it was set in a local configuration file, as opposed to the top-level configuration file.
Fixed a bug that could cause the condor_schedd to behave incorrectly when reading an invalid job queue log on startup.
Fixed a bug that could corrupt the job queue log if the condor_schedd daemon's attempt to compact it fails.
Fixed a problem that in rare cases caused the condor_schedd to crash shortly after the condor_gridmanager exited. This bug has existed since before Condor version 6.8.

Fixed a problem that was resulting in messages such as the following:

ERROR: receiving new UDP message but found a long message still waiting
to be closed (consumed=0). Closing it now.

The file extension specified to condor_fetch_log can no longer contain a path delimiter.
When in graceful shutdown mode, the condor_schedd was sometimes starting idle scheduler universe jobs. With a large enough number of scheduler universe jobs, this could lead to a cycle of stopping and restarting jobs until the graceful shutdown time expired.
Fixed multiple bugs that prevented Condor from building on or running correctly on OpenSolaris X86/64 version 2009.06.
Fixed a bug which caused the condor_startd to incorrectly count the number of processors on some machines with Hyper-threading enabled. This bug was introduced in Condor version 7.3.2, and exists in 7.4.0 and 7.4.1.
Fixed a problem with GSI authentication in Condor that would cause daemons to consume more and more memory over time. The biggest source of trouble was introduced in Condor version 7.3.2. However, a smaller memory leak that existed in all previous versions of Condor has also been fixed.
Fixed a bug where if condor_compile is invoked in a manner such as:
```
  condor_compile gcc -print-prog-name=ld
```
an error would be emitted, and condor_compile would exit with a bad exit code.
The sort based on condor_status output accidentally changed in Condor version 7.3, so that the output was based on the slot name first, then machine name. The behavior is now restored to the original sorting: first on machine name, then slot name.
If one machine running a parallel job crashed, and job leases are enabled (which they are by default), the job would not exit until the job lease duration expired. As the condor_starter will not get respawned, there is no need to wait. Many sites set long job lease durations, to prevent jobs from being killed when the machine running the condor_schedd daemon reboots. Now, if one node goes away, the whole computation is shut down immediately.
Fixed the verbosity level of some condor_dagman messages written to the dagman.out file.
Fixed a bug introduced in Condor version 7.3.2 that resulted in messages such as the following even in cases where no problem in communicating with the condor_collector had been encountered:
```
Collector <X> is still being avoided if an alternative succeeds.
```
This problem was believed to be fixed in Condor 7.4.1, but some cases of the problem remained in that version.
Fixed a bug from Condor version 6.1.14, that resulted in the condor_schedd performing the operation scheduled via WALL_CLOCK_CKPT_INTERVAL at the specified frequency (default time of 1 hour), multiplied by the number of times the condor_schedd daemon had been reconfigured during its lifetime. This could lead to degraded performance, especially prior to Condor version 7.4.1, when this operation was more disk-intensive.
32-bit Linux versions of Condor running in a 64-bit environment would sometimes not detect the existence of some processes and sometimes wrongly detect that a tracked process belonged to root when it actually belonged to some other user. This could lead to failure to run jobs or failure to properly monitor and clean up after them. When the wrong process ownership problem happened, the following message appeared in the condor_master and/or condor_procd logs:
```
ProcAPI: fstat failed in /proc! (errno=75)
```
If condor_procd failed to detect the existence of its own parent process, it would exit with the following message in its log:
```
ERROR: master has exited
```
Fixed a problem in the condor_job_router daemon, introduced in Condor version 7.2.2, that could cause the daemon to crash when failing to carry out the change of state dictated by a job's periodic policy expressions, for example, the failure to put a job on hold when periodic_hold becomes True.
Fixed a bug introduced in Condor 7.3.2 that caused Grid Monitor jobs to receive a full X.509 proxy. Now, it always receives a limited proxy, which was the previous behavior.
Fixed a bug that could cause the nordugrid_gahp to crash.
Fixed a problem introduced in 7.4.0 that could cause two condor_schedd daemons with a match to the same slot to both fail to claim it, rather than letting the first one to claim it succeed. This sort of situation can happen when the condor_negotiator has a stale view of the pool, either because the gap between negotiation cycles is configured to be shorter than usual, or because updates from the condor_startd to the condor_collector are not reliably delivered and processed.
The condor_kbdd is no longer ignored by the condor_startd when the configuration variable CONSOLE_DEVICES is defined.
When using the -d command line argument with a daemon, the values of LOG, SPOOL, and EXECUTE no longer change every time a condor_reconfig command is received.

Known Bugs:

The condor_kbdd has a chance of entering an infinite loop on platforms that use X-Windows. Microsoft Windows and Mac OS X are not affected. Removing KBDD from DAEMON_LIST is a workaround, although this impairs Condor's ability to detect console usage. This bug is fixed in Condor version 7.4.3.

Additions and Changes to the Manual:

Descriptions of all the commands that may be placed into a submit description file are now located within the condor_submit manual page, instead of within Chapter 2, the Users' Manual.
An initial, but not yet complete set of configuration variables that require a restart when changed, is listed in section 3.3.1. Using condor_reconfig to change these variables' values is not sufficient.

Version 7.4.1

Release Notes:

Security Item: A flaw was found that could allow a user who already is authorized to submit jobs into Condor, to queue a job under the guise of a different user. In this way, someone who has access to a Condor submission service and is allowed to submit jobs into Condor could gain access to another non-root or non-administrator account on the system. This flaw was discovered during the development process; no incidents have been reported. Details of the problem will be made available on Feb 1st, 2010.
The default value of JOB_ROUTER_NAME has changed from an empty string to jobrouter in order to address problems caused by the previous default. Without special handling, this means that jobs being managed by condor_job_router before upgrading will not be adopted by the new version of condor_job_router if the default JOB_ROUTER_NAME was being used. To correct this, follow the instructions given in the description of JOB_ROUTER_NAME on page .

New Features:

Condor allows submit files to specify an IwdFlushNFSCache expression, to control whether or not Condor tries to flush the NFS cache for a job's initial working directory on job completion.
The new -attributes option to condor_status explicitly specifies the attributes to be listed when using the -xml or -long options.

Configuration Variable and ClassAd Attribute Additions and Changes:

New VOMS attributes have been introduced into the job ad to keep them separate from the X509UserProxySubjectName.
The default for JOB_ROUTER_NAME has changed from an empty string to jobrouter. See the release notes for more information about upgrading from an old version.
The configuration variable TCP_FORWARDING_HOST has existed in Condor since version 7.0.0, but was not documented. See section 3.3.6 for documentation.
The new configuration variable STARTD_PER_JOB_HISTORY_DIR allows ClassAds of completed jobs to be stored in a directory separate from the existing one specified with PER_JOB_HISTORY_DIR.

Bugs Fixed:

Condor no longer creates the job sandbox in its SPOOL directory if it is not needed.
Fixed a problem introduced in Condor version 7.4.0 that caused GSI authentication between Condor processes to fail with using a non-legacy format X.509 proxy.
Fixed a problem with CCB under Windows platforms that has existed since Condor version 7.3.0. This problem caused CCB-enabled daemons to become unresponsive after the exit of a child process.
Improved the handling of previously-submitted gt2 grid jobs upon release from hold, when there is no Globus job manager for the job running on the remote resource.
Fixed a problem with job leases for jobs that use a condor_shadow. Previously, while these jobs were running, lease renewals from the submitter would not be noticed, and the job would be aborted when the original lease expired.
Fixed a bug that only allowed approximately 50 splices to be included into a DAG input file. There is now no limit to the number of splices one may include into a DAG input file except, of course, for the implicit memory allocation limit of the condor_dagman process.
Removed attempted limiting of swap space via ulimit -v using the VirtualMemory machine ClassAd attribute in the script condor_limits_wrapper.sh.
Fixed a bug that caused ALLOW_CONFIG and HOSTALLOW_CONFIG, as well as the corresponding DENY configuration variables to incorrectly handle a setting consisting of a single * or the equivalent */*. This also fixes a bug that caused incorrect merging of ALLOW and HOSTALLOW settings when one, but not both, consisted of a single * or the equivalent */*. These bugs have existed since before Condor version 6.8.
Fixed a bug introduced in Condor version 7.3.0 that could cause Condor daemons to crash when reading malformed network addresses.
Removed a check for root ownership of a script specified by the configuration variable VM_SCRIPT.
Fixed a bug in writing the header of the file identified by the configuration variable EVENT_LOG.
Fixed a bug that could cause the condor_startd to segfault on shutdown when using dynamic slots.
Fixed a problem introduced in Condor version 7.3.2 that changed the behavior of an undocumented method for selecting attributes to be displayed in condor_q -xml. Prior to this bug, the following command would produce XML output with the attributes A and B, plus a few other attributes that were always shown.
```
condor_q -xml -format "%s" A+B
```
In Condor versions 7.3.2 and 7.4.0, this same command produced an empty XML ClassAd. The workaround was to use multiple -format options, each listing just one desired attribute, rather than a single one with an expression of all desired attributes. Although this is now fixed, the more straightforward way to select attributes since Condor version 7.3.2 is to use the -attributes option.
Fixed a bug introduced in Condor version 7.3.2 that resulted in messages such as the following even in cases where no problem in communicating with the condor_collector had been encountered:
```
Collector <X> is still being avoided if an alternative succeeds.
```
Fixed a bug that has been in the condor_startd since before Condor version 6.8. If the condor_startd ever failed to send signals to the condor_starter process, it could fail to properly clean up the machine ClassAd, leaving attributes from STARTD_JOB_EXPRS in the ClassAd but not making them visible in condor_status queries. One possible problem resulting from this could be matches being made by the condor_negotiator that are then rejected by the condor_startd. Repeated messages such as the following would then result in the condor_startd log:
```
slot1: Request to claim resource refused.
```
Fixed a problem that resulted in the following message in the condor_startd log:
```
Timer -1 not found
```
Fixed a problem in which security sessions were not cached correctly when using CCB. This resulted in re-authentication in some cases where a cached security session could have been used.
Fixed multiple problems with the handling of VOMS attributes in GSI proxies.
Fixed a bug that caused condor_dagman to hang when running a DAG with POST scripts, if the global event log is turned on.
Improved how the private network address is published when using the configuration variables PRIVATE_NETWORK_NAME and PRIVATE_NETWORK_INTERFACE. In some cases, this information was not being used, and therefore connections were made to the public address when they could have been made to the private address.
Fixed a bug exhibited under Windows XP, where using USE_VISIBLE_DESKTOP would cause strange behavior after a job completed.
CCB now works with TCP_FORWARDING_HOST. Previously, the reverse connection was made to the private address rather than to the host defined by TCP_FORWARDING_HOST.
Removed a bad optimization that caused some information about job execution to be lost during job completion or removal, if a history file was not configured.
Condor now checks whether the configuration variable GRIDFTP_URL_BASE is set before submitting cream grid jobs, as that variable is required for cream jobs to function properly. If the variable is not set, cream jobs are put on hold with an appropriate message.
Fixed a bug that allowed running virtual machines to be leaked if the condor_startd crashed.
Fixed a bug in cream_gahp which could cause crashes when there were more than 500 cream jobs queued.
Improved recovery when Condor crashes during the submission of a cream grid job. Before, affected jobs would remain in REGISTERED state on the cream server, but never run.
Improved the HoldReason message when cream grid jobs are held by the condor_gridmanager.
When naming a resource for a cream grid job, Condor now properly recognizes the format used by the standard cream client UI: https://foo.edu:8443/cream-pbs-cream_queue.
The configuration variable SOAP_SSL_CA_FILE is now consulted in addition to SOAP_SSL_CA_DIR when authenticating an https proxy for Amazon EC2, when AMAZON_HTTP_PROXY is defined.
Previously, if condor_rm and friends were given both a constraint and a user name or cluster id, they would act on all jobs matching the constraint and all jobs associated with the user or cluster. Now, this combination of arguments results in an error.
Failure to purge a cream grid universe job from the remote server because it was previously purged no longer results in the job being held.
The condor_gridmanager now recognizes VOMS attributes in X.509 proxies and will handle them appropriately. For example, it recognizes that two proxies with the same identity but different VOMS attributes may be mapped to different accounts on a remote machine.
Fixed a bug in condor_dagman, introduced in 7.3.2, that will cause condor_dagman running on Windows to hang on any DAG using more than one log file for the node jobs.
Fixed a bug in condor_dagman, introduced in 7.3.2, that could cause condor_dagman to fail on a DAG using node job log files on multiple devices, if log files on different devices happened to have the same inode number.
Fixed a bug that caused the condor_schedd daemon to segfault when spooling more than 9 files.
Fixed a bug that caused the condor_startd daemon to crash on Debian Stable.
Fixed keyboard activity detection on the Windows XP platform.
Fixed a bug in the condor_had daemon that caused it to not start the controlled daemon if CCB was enabled.

Known Bugs:

The condor_kbdd has a chance of entering an infinite loop on platforms that use X-Windows. Microsoft Windows and Mac OS X are not affected. Removing KBDD from DAEMON_LIST is a workaround, although this impairs Condor's ability to detect console usage. This bug is fixed in Condor version 7.4.3.
condor_dagman may fail on Windows if the set of node job log file names includes multiple paths that are hard links (not symbolic links) to the same file.
condor_dagman PRE and POST script arguments (and the names of the scripts themselves) cannot contain spaces.
condor_dagman VARS values cannot contain single quotes.

Additions and Changes to the Manual:

Added documentation about how to include spaces (and other special characters) in condor_dagman VARS values.

Version 7.4.0

Release Notes:

The default configuration file within the release now uses ALLOW/DENY in place of HOSTALLOW/HOSTDENY for security related settings. We recommend making this same change throughout all configuration files. That way, a policy that depends on the default policy should continue to work as it did before. The behavior of these configuration variables remains unchanged. The ALLOW/DENY lists are added to the HOSTALLOW/HOSTDENY lists to form the authorization policy. Both styles support the same syntax. This change permits an anticipated phasing out of the HOSTALLOW/HOSTDENY configuration variables, in order to simplify configuration.
As of Condor version 7.3.2, condor_q -xml output no longer begins with the non-XML consisting of two blank lines followed by a line of the following form:
```
-- Submitter: schedd-name : <IP> : hostname
```
All Stork data placement is now supported by the Stork project at the LSU Center for Computation and Technology (http://www.cct.lsu.edu/www.cct.lsu.edu). Please see the home page of the Stork project at http://www.cct.lsu.edu/~kosar/stork/index.php for details and software.

New Features:

Condor is now integrated with the Hadoop Distributed File System (HDFS). See documentation in section 8.2 and section 8.2.1.
condor_q using the options -analyze and -better-analyze now provide analysis for scheduler and local universe jobs. Specifically, the START_SCHEDULER_UNIVERSE and START_LOCAL_UNIVERSE expressions are checked.
Added the ClassAd attributes TotalLocalRunningJobs, TotalLocalIdleJobs, TotalSchedulerRunningJobs, and TotalSchedulerIdleJobs to the published ClassAd for the condor_schedd. This means that condor_q -analyze can still give helpful information about why local or scheduler universe jobs are idle when the configuration variables START_LOCAL_UNIVERSE or START_SCHEDULER_UNIVERSE refer to these attributes. These attributes were already present internally within the condor_schedd daemon, just not published.
The condor_vm-gahp now supports KVM and links with libvirt, rather than calling virsh command-line tools.
Greatly improved the condor_gridmanager's scalability when handling many grid type gt2 grid universe jobs. Improvements include more quickly processing updated X.509 certificates, not checking jobs for status updates if they have not been submitted to the remote site, and eliminating unnecessary updates to the condor_schedd daemon.
Latency in the submission and cleaning up of Condor-C jobs has been improved by changing the default value of C_GAHP_CONTACT_SCHEDD_DELAY from 20 to 5.
The eval() ClassAd function added in Condor version 7.3.2 is now also understood by the condor_job_router and condor_q using the -better-analyze option.
The submit command run_as_owner is now supported for Unix platforms. Previously, it was only supported on Windows platforms.
When setting AMAZON_HTTP_PROXY, a username and password can now be given as part of the proxy URL. The value of SOAP_SSL_CA_DIR is now consulted when authenticating an https proxy for Amazon EC2, when AMAZON_HTTP_PROXY is defined.
The condor_collector daemon now advertises to itself, and will appear in the output of condor_status -collector.
Optimizations in core Condor systems should provide minor speed improvements.
Updated the maximum log size to the maximum operating system's file size.

Configuration Variable and ClassAd Attribute Additions and Changes:

The undocumented configuration variable TOOLS_PROVIDE_OLD_MESSAGES is no longer recognized by Condor.
The new configuration variable SCHEDD_JOB_QUEUE_LOG_FLUSH_DELAY sets an upper bound in seconds on how long it takes for changes to the job ClassAd to be visible to the Condor Job Router and to Quill. The default value is 5 seconds. Previously, there was no upper limit. Typically, other activity in the job queue, such as jobs being submitted or completed would cause buffered data to be flushed to disk, such that the effective upper bound was a function of how busy the job queue was.
The default configuration file now uses ALLOW/DENY in place of HOSTALLOW/HOSTDENY. See the release notes above for more information.
The default value for MAX_JOBS_RUNNING has changed. Previously, it was 200. Now it is defined by an expression that depends on the total amount of memory and the operating system. The default expression requires 1MByte of RAM per running job, on the submit machine. In some environments and configurations, this is overly generous and can be cut by as much as 50%. Under Windows, the number of running jobs is still capped at 200. A 64-bit version of Windows is recommended in order to raise the value above the default. Under Unix, the maximum default is now 10,000. To scale higher, we recommend that the system ephemeral port range is extended such that there are at least 2.1 ports per running job.
The default value of RESERVED_SWAP has changed to the value 0, which disables the condor_schedd daemon's check for sufficient swap space before starting more jobs. The new expression defined with MAX_JOBS_RUNNING has a more appropriate memory check, so the configuration variable RESERVED_SWAP will no longer be used in the near future. For cases where RESERVED_SWAP is not set to 0, the default value of SHADOW_SIZE_ESTIMATE has changed to 800 Kbytes. Previously, it was 200 if not set, but it was set to 1800 in the example configuration file.
The default values of START_LOCAL_UNIVERSE and START_SCHEDULER_UNIVERSE have changed. Previously, these were set to True. Now, they are set using an expression that allows up to 200 local universe and 200 scheduler universe jobs to run.
The default value of GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE has changed from 100 to 1000.
The default value of NEGOTIATOR_INTERVAL has changed from 300 to 60.
The default value of ENABLE_GRID_MONITOR has been changed from False to True. This variable was assigned to True in the example configuration file, so the change in default value now matches the value set in the example configuration.
The configuration variable VM_VERSION has been removed, as has the machine ClassAd attribute of the same name. When the virtual machine version information is needed in the machine ClassAd, the configuration variable STARTD_ATTRS can be used to add it.
The default configuration now uses VM_BRIDGE_SCRIPT and VM_SCRIPT in place of XEN_BRIDGE_SCRIPT and XEN_SCRIPT due to the support of KVM. Submit description file commands have also been added, and they include: kvm_disk, kvm_transfer_files, and kvm_cd_rom_device.
The configuration variables XEN_DEFAULT_KERNEL and XEN_DEFAULT_INITRD have been removed. Corresponding to this, the submit description file command xen_kernel = any is no longer valid.

Bugs Fixed:

Fixed a bug that prevented parallel universe jobs from running on condor_startd daemons with dynamic slots enabled.
Fixed a race condition bug in the condor_startd which allowed it to send Unix signals, intended for condor_starter processes, as root to non-Condor related processes.
A Windows platform bug has been fixed. The bug caused a 20-second interval in which the condor_shadow, condor_startd, and condor_starter daemons appeared as deadlocked. The bug was visible if a job ClassAd update from the condor_starter caused the job's periodic hold or remove policy to become True.
Fixed a bug that could cause condor_dagman to generate an illegal rescue DAG, if it read events incorrectly in recovery mode. condor_dagman now checks for events that violate DAG semantics when reading events in recovery mode, and it exits without creating a rescue DAG if it reads such an event.
Fixed a bug that could cause condor_dagman to abort if it saw the combination of a terminated event and an aborted event on a node with retries.
Changed some logged warnings in condor_dagman to not be printed at the default verbosity setting.
The version compatibility checking between a .condor.sub file and the condor_dagman binary which is done at DAG startup is now much more permissive. Currently, .condor.sub files with Condor versions of 7.1.2 and later accepted by condor_dagman.
Fixed a bug introduced with the new condor_dagman lazy log file evaluation code in Condor version 7.3.2. The bug sometimes caused failure when running rescue DAGs.
Fixed a bug originating in Condor version 7.1.4. When a user submitted a job with an executable that did not have execute permission enabled, Condor was running as root, and file transfer was specified in the job, Condor failed to automatically turn on execute permission after transferring the file.
Fixed a bug that appeared in Condor version 7.3.2. The configuration variable COUNT_HYPERTHREAD_CPUS was ignored and was effectively treated as False in all cases.
Fixed a bug in which the Condor Job Router was not able to see matchmaking diagnostic attributes such as LastRejMatchTime. Therefore, when evaluating policy expressions that referred to these attributes, they were effectively treated as though Undefined. Quill was also not able to see these attributes.
Fixed a bug introduced in Condor version 7.3.2 that could cause the condor_gridmanager to crash repeatedly on startup, if the job queue contained grid type gt2 jobs that had been previously submitted.
Fixed two bugs introduced in Condor version 7.3.2, and related to VOMS. The first bug prevented jobs with X.509 proxies from being submitted on platforms on which Condor does not support VOMS. The second bug prevented submission of jobs with VOMS proxies, if the authenticity of the VOMS extensions could not be verified. At the same time, improved memory usage when VOMS extensions are not used.
Fixed a bad default in the file batch_gahp.config, that prevented Condor from observing job state changes for grid universe jobs with a grid type of pbs or lsf.
Fixed a bug that caused Condor-C jobs to fail if the submit description file command transfer_executable was set to False.
Fixed a bug that caused Condor-C jobs to fail if the executable or one of the stdin, stdout, or stderr file names contained a comma.
File transfer for grid type gt4 jobs requires an empty directory within /tmp, which the condor_gridmanager creates. If this directory is deleted, the condor_gridmanager will now recreate it.
Fixed a bug that could cause the user job log to become corrupted on Windows platforms. This bug would manifest itself only if the same log file was specified with different paths. For example, the following submit file could have triggered this bug:
```
...
initialdir = /data/job1
log = ../JobLog
queue

initialdir = /data/job2
log = ../JobLog
queue
```
Fixed a memory leak introduced into Condor version 7.3.2. The leak was in the condor_collector daemon.
Fixed a bug introduced in Condor version 7.3.2 that resulted in the condor_negotiator daemon refusing to run, if the configuration variable GROUP_QUOTA for any group was set to 0.
Fixed a bug that caused the ctime in the event log header to always be zero.
Fixed the output of condor_status when used with the command-line options -java or -vm.
Fixed a problem in the condor_schedd daemon introduced in 7.3.2. For condor_schedd daemons with lots of jobs having periodic release expressions, this bug could result in the condor_schedd taking a long time while evaluating periodic expressions, causing it to become unresponsive to queries and other tasks. With a job queue of 30,000 jobs, a period of unresponsiveness of an hour was observed, whereas the evaluation of periodic expressions in this same environment normally takes less than 5 seconds.
Potential bugs and memory leaks were identified and fixed throughout Condor. The Condor Team is not aware of anyone having encountered these bugs.
The condor_starter cleans up working directories in more situations. Previously during some error conditions, the working directory within $(EXECUTE) might be left behind.
If the user log cannot be accessed when a local universe job starts, the job would fail and immediately be retried. Now the job is placed on hold.
Fixed a bug in the condor_startd in which vacating jobs would not respect the value of JobLeaseDuration.
Updated the detection of HasVM within the condor_startd to publish an update to the condor_collector, when the configuration variable VM_RECHECK_INTERVAL is specified.
Fixed a bug in which the condor_gridmanager could, in rare cases, waste a small amount of memory and processor time checking for proxy files no longer being used by any active jobs.
The setting CREAM_GAHP was added to the default configuration file with a value of $(SBIN)/cream_gahp. Existing installations desiring to submit jobs to CREAM should add this setting.
Fixed a bug where condor_restart would fail on a condor_collector daemon configured for high availability with multiple condor_collector daemons.
Fixed a bug in which multiple entries of output from the command condor_status -negotiator would be on a single line. They are now listed one per line.
Fixed a bug in which the command condor_submit -dump would crash if multiple jobs were queued from within a single submit file.
Fixed a bug in which a slot whose associated job disappeared could remain in the Claimed/Idle state until the claim lease expired. The slot should now promptly return to the Unclaimed/Idle state.
Fixed a bug in which a condor_startd using dynamic slots could crash on shutdown or reconfiguration.

Known Bugs:

The condor_kbdd has a chance of entering an infinite loop on platforms that use X-Windows. Microsoft Windows and Mac OS X are not affected. Removing KBDD from DAEMON_LIST is a workaround, although this impairs Condor's ability to detect console usage. This bug is fixed in Condor version 7.4.3.
There are multiple bugs related to using VOMS attributes. In Condor version 7.4.0, VOMS support should be disabled by setting the configuration variable USE_VOMS_ATTRIBUTES = FALSE.
A configuration variable of USE_VISIBLE_DESKTOP set to True will corrupt the visible desktop. This bug is present back through Condor version 7.2.4. This configuration variable did not work at all in 7.2 releases prior to 7.2.4. This bug will be fixed in Condor version 7.4.1.
If the global event log (see section 3.3.4) is turned on, condor_dagman will hang when running any DAG that has POST scripts.
condor_dagman will hang on Windows when running any DAG that uses more than one log file for the node jobs.

Additions and Changes to the Manual:

See section 8.2 and section 8.2.1 for preliminary documentation of Condor's integration with the Hadoop Distributed File System (HDFS).

Next: 10. Command Reference Manual Up: 9. Version History and Previous: 9.5 Development Release Series Contents Index

condor-admin@cs.wisc.edu