This is an outdated version of the HTCondor Manual. You can find current documentation at http://htcondor.org/manual.

Next: 10.3 Stable Release Series Up: 10. Version History and Previous: 10.1 Introduction to HTCondor Contents Index

10.2 Upgrading from the 7.8 series to the 8.0 series of HTCondor

While upgrading from the 7.8 series of HTCondor to the 8.0 series will bring many new features and improvements introduced in the 7.9 series of HTCondor, it will also introduce changes that administrators of sites running from an older HTCondor version should be aware of when planning an upgrade. Here is a list of items that administrators should be aware of.

There is an issue with DAGMan jobs upon upgrade from HTCondor version 7.8.x or an earlier version to version 8.0.0. Without administrative intervention, queued DAGMan jobs will restart from the beginning of the DAG after the upgrade. There will be no issue if the upgrade is from HTCondor version 7.8.x or an earlier version to HTCondor version 8.0.1 or later versions.
To avoid starting DAGMan jobs from the beginning after the upgrade, the administrator should ensure that no condor_dagman jobs are queued. Do a condor_rm on all condor_dagman jobs and wait for Rescue DAGs to be written before shutting down HTCondor to perform the upgrade. Any condor_dagman jobs that are on hold should be released before being removed. After the upgrade is complete and HTCondor has restarted, all of these DAGMan jobs should be re-submitted. This will cause them to read the appropriate Rescue DAGs and continue on.
To avoid losing work within partially-completed node jobs, an alternative is to use the halt file feature, as described in section 2.10.6. This will cause all condor_dagman jobs to eventually drain from the queue(s). This will take longer than doing a condor_rm on those jobs. condor_dagman jobs drained via the halt file method will also have to be re-submitted after the upgrade.
The upgrade will change the machine ClassAd attribute CheckpointPlatform for all machines. This implies that any standard universe job with a checkpoint from before the upgrade will not resume after the upgrade. To work around this potential difficulty, either change the attribute CheckpointPlatform on all machines to their previous value by setting the CHECKPOINT_PLATFORM configuration variable, or change the LastCheckpointPlatform attribute for all jobs that have produced a checkpoint. Make the change by using condor_qedit.
For example, if machine attribute CheckpointPlatform changed from LINUX INTEL 2.6.x normal N/A to LINUX INTEL 2.6.x normal N/A ssse3 sse4_1 sse4_2, use the following command:
```
condor_qedit -constraint 'LastCheckpointPlatform == "LINUX INTEL 2.6.x normal N/A"'
    LastCheckpointPlatform "LINUX INTEL 2.6.x normal N/A ssse3 sse4_1 sse4_2"
```

Next: 10.3 Stable Release Series Up: 10. Version History and Previous: 10.1 Introduction to HTCondor Contents Index

htcondor-admin@cs.wisc.edu