next up previous contents index
Next: 3.11 The High Availability Up: 3. Administrators' Manual Previous: 3.9 DaemonCore   Contents   Index

Subsections


3.10 Pool Management

HTCondor provides administrative tools to help with pool management. This section describes some of these tasks.

All of the commands described in this section are subject to the security policy chosen for the HTCondor pool. As such, the commands must be either run from a machine that has the proper authorization, or run by a user that is authorized to issue the commands. Section 3.6 on page [*] details the implementation of security in HTCondor.


3.10.1 Upgrading - Installing a New Version on an Existing Pool

An upgrade changes the running version of HTCondor from the current installation to a newer version. The safe method to install and start running a newer version of HTCondor in essence is: shut down the current installation of HTCondor, install the newer version, and then restart HTCondor using the newer version. To allow for falling back to the current version, place the new version in a separate directory. Copy the existing configuration files, and modify the copy to point to and use the new version, as well as incorporate any configuration variables that are new or changed in the new version. Set the CONDOR_CONFIG environment variable to point to the new copy of the configuration, so the new version of HTCondor will use the new configuration when restarted.

When upgrading from a version of HTCondor earlier than 6.8 to more recent version, note that the configuration settings must be modified for security reasons. Specifically, the HOSTALLOW_WRITE configuration variable must be explicitly changed, or no jobs may be submitted, and error messages will be issued by HTCondor tools.

Another way to upgrade leaves HTCondor running. HTCondor will automatically restart itself if the condor_master binary is updated, and this method takes advantage of this. Download the newer version, placing it such that it does not overwrite the currently running version. With the download will be a new set of configuration files; update this new set with any specializations implemented in the currently running version of HTCondor. Then, modify the currently running installation by changing its configuration such that the path to binaries points instead to the new binaries. One way to do that (under Unix) is to use a symbolic link that points to the current HTCondor installation directory (for example, /opt/condor). Change the symbolic link to point to the new directory. If HTCondor is configured to locate its binaries via the symbolic link, then after the symbolic link changes, the condor_master daemon notices the new binaries and restarts itself. How frequently it checks is controlled by the configuration variable MASTER_CHECK_NEW_EXEC_INTERVAL , which defaults 5 minutes.

When the condor_master notices new binaries, it begins a graceful restart. On an execute machine, a graceful restart means that running jobs are preempted. Standard universe jobs will attempt to take a checkpoint. This could be a bottleneck if all machines in a large pool attempt to do this at the same time. If they do not complete within the cutoff time specified by the KILL policy expression (defaults to 10 minutes), then the jobs are killed without producing a checkpoint. It may be appropriate to increase this cutoff time, and a better approach may be to upgrade the pool in stages rather than all at once.

For universes other than the standard universe, jobs are preempted. If jobs have been guaranteed a certain amount of uninterrupted run time with MaxJobRetirementTime, then the job is not killed until the specified amount of retirement time has been exceeded (which is 0 by default). The first step of killing the job is a soft kill signal, which can be intercepted by the job so that it can exit gracefully, perhaps saving its state. If the job has not gone away once the KILL expression fires (10 minutes by default), then the job is forcibly hard-killed. Since the graceful shutdown of jobs may rely on shared resources such as disks where state is saved, the same reasoning applies as for the standard universe: it may be appropriate to increase the cutoff time for large pools, and a better approach may be to upgrade the pool in stages to avoid jobs running out of time.

Another time limit to be aware of is the configuration variable SHUTDOWN_GRACEFUL_TIMEOUT. This defaults to 30 minutes. If the graceful restart is not completed within this time, a fast restart ensues. This causes jobs to be hard-killed.


3.10.2 Shutting Down and Restarting an HTCondor Pool

Shutting Down HTCondor
There are a variety of ways to shut down all or parts of an HTCondor pool. All utilize the condor_off tool.

To stop a single execute machine from running jobs, the condor_off command specifies the machine by host name.

  condor_off -startd <hostname>
A running standard universe job will be allowed to take a checkpoint before the job is killed. A running job under another universe will be killed. If it is instead desired that the machine stops running jobs only after the currently executing job completes, the command is
  condor_off -startd -peaceful <hostname>
Note that this waits indefinitely for the running job to finish, before the condor_startd daemon exits.

Th shut down all execution machines within the pool,

  condor_off -all -startd

To wait indefinitely for each machine in the pool to finish its current HTCondor job, shutting down all of the execute machines as they no longer have a running job,

  condor_off -all -startd -peaceful

To shut down HTCondor on a machine from which jobs are submitted,

  condor_off -schedd <hostname>

If it is instead desired that the submit machine shuts down only after all jobs that are currently in the queue are finished, first disable new submissions to the queue by setting the configuration variable

  MAX_JOBS_SUBMITTED = 0
See instructions below in section 3.10.3 for how to reconfigure a pool. After the reconfiguration, the command to wait for all jobs to complete and shut down the submission of jobs is
  condor_off -schedd -peaceful <hostname>

Substitute the option -all for the host name, if all submit machines in the pool are to be shut down.

Restarting HTCondor, If HTCondor Daemons Are Not Running
If HTCondor is not running, perhaps because one of the condor_off commands was used, then starting HTCondor daemons back up depends on which part of HTCondor is currently not running.

If no HTCondor daemons are running, then starting HTCondor is a matter of executing the condor_master daemon. The condor_master daemon will then invoke all other specified daemons on that machine. The condor_master daemon executes on every machine that is to run HTCondor.

If a specific daemon needs to be started up, and the condor_master daemon is already running, then issue the command on the specific machine with

  condor_on -subsystem <subsystemname>
where <subsystemname> is replaced by the daemon's subsystem name. Or, this command might be issued from another machine in the pool (which has administrative authority) with
  condor_on <hostname> -subsystem <subsystemname>
where <subsystemname> is replaced by the daemon's subsystem name, and <hostname> is replaced by the host name of the machine where this condor_on command is to be directed.

Restarting HTCondor, If HTCondor Daemons Are Running
If HTCondor daemons are currently running, but need to be killed and newly invoked, the condor_restart tool does this. This would be the case for a new value of a configuration variable for which using condor_reconfig is inadequate.

To restart all daemons on all machines in the pool,

  condor_restart -all

To restart all daemons on a single machine in the pool,

  condor_restart <hostname>
where <hostname> is replaced by the host name of the machine to be restarted.


3.10.3 Reconfiguring an HTCondor Pool

To change a global configuration variable and have all the machines start to use the new setting, change the value within the file, and send a condor_reconfig command to each host. Do this with a single command,

  condor_reconfig -all

If the global configuration file is not shared among all the machines, as it will be if using a shared file system, the change must be made to each copy of the global configuration file before issuing the condor_reconfig command.

Issuing a condor_reconfig command is inadequate for some configuration variables. For those, a restart of HTCondor is required. Those configuration variables that require a restart are listed in section 3.3.1 on page [*]. The manual page for condor_restart is at  11.


3.10.4 Absent ClassAds

By default, HTCondor assumes that resources are transient: the condor_collector will discard ClassAds older than CLASSAD_LIFETIME seconds. Its default configuration value is 15 minutes, and as such, the default value for UPDATE_INTERVAL will pass three times before HTCondor forgets about a resource. In some pools, especially those with dedicated resources, this approach may make it unnecessarily difficult to determine what the composition of the pool ought to be, in the sense of knowing which machines would be in the pool, if HTCondor were properly functioning on all of them.

This assumption of transient machines can be modified by the use of absent ClassAds. When a machine ClassAd would otherwise expire, the condor_collector evaluates the configuration variable ABSENT_REQUIREMENTS against the machine ClassAd. If True, the machine ClassAd will be saved in a persistent manner and be marked as absent; this causes the machine to appear in the output of condor_status -absent. When the machine returns to the pool, its first update to the condor_collector will invalidate the absent machine ClassAd.

Absent ClassAds, like offline ClassAds, are stored to disk to ensure that they are remembered, even across condor_collector crashes. The configuration variable COLLECTOR_PERSISTENT_AD_LOG defines the file in which the ClassAds are stored, and replaces the no longer used variable OFFLINE_LOG. Absent ClassAds are retained on disk as maintained by the condor_collector for a length of time in seconds defined by the configuration variable ABSENT_EXPIRE_ADS_AFTER . A value of 0 for this variable means that the ClassAds are never discarded, and the default value is thirty days.

Absent ClassAds are only returned by the condor_collector and displayed when the -absent option to condor_status is specified, or when the absent machine ClassAd attribute is mentioned on the condor_status command line. This renders absent ClassAds invisible to the rest of the HTCondor infrastructure.

A daemon may inform the condor_collector that the daemon's ClassAd should not expire, but should be removed right away; the daemon asks for its ClassAd to be invalidated. It may be useful to place an invalidated ClassAd in the absent state, instead of having it removed as an invalidated ClassAd. An example of a ClassAd that could benefit from being absent is a system with an uninterruptible power supply that shuts down cleanly but unexpectedly as a result of a power outage. To cause all invalidated ClassAds to become absent instead of invalidated, set EXPIRE_INVALIDATED_ADS to True. Invalidated ClassAds will instead be treated as if they expired, including when evaluating ABSENT_REQUIREMENTS.


next up previous contents index
Next: 3.11 The High Availability Up: 3. Administrators' Manual Previous: 3.9 DaemonCore   Contents   Index