HTCondor Checkpointing
Overview
The ability to cause a job to
vacate a non-idle
workstation and
migrate to an idle workstation is central to
HTCondor's job scheduling.
To allow these migrating jobs to continue their progress,
HTCondor must be able to continue a vacated job without
a complete loss of computation time.
HTCondor does this by using a
checkpoint
of a job's state.
A checkpoint is generated just before a job vacates
a machine.
The checkpoint is then used
as a starting point for the job after migration.
HTCondor gives a program the
ability to checkpoint itself by providing a checkpointing library
which contains a signal handler for SIGTSTP, the HTCondor checkpoint
signal. When a HTCondor job receives a checkpoint signal, it writes a
checkpoint file, which contains the process's data and stack segments,
as well as information about open files, pending signals, and CPU
state. When the job is restarted, the state contained in the
checkpoint file is restored (by startup code in the checkpoint
library) and the process resumes the computation at the point
where the checkpoint was generated.
Programs submitted to be run by the HTCondor system are re-linked (but not
re-compiled) to include this library.
A technical description of the checkpointing and process migration
mechanisms implemented in HTCondor is given in this
technical report.
Standalone Checkpointing
The HTCondor distribution includes a checkpointing library,
to provide checkpoints for Unix
processes that are not submitted for execution under HTCondor.
Information on using standalone checkpointing is available
in Chapter 4, section 2, of the
6.4 manual.
Periodic Checkpointing
In addition to writing a checkpoint at vacate time, HTCondor can be
configured to write checkpoints periodically while the job is
executing, to provide additional reliability.
Checkpoint Server
By default, a checkpoint is written to a file on the local disk of the
machine where the job was submitted. A checkpoint server has been
developed to serve as a repository for checkpoints. When a host is
configured to use a checkpoint server, jobs submitted on that machine
write and read checkpoints to and from the server rather than the
local disk of the submitting machine, taking the burden of storing
checkpoint files off of the submitting machines and placing it instead
on server machines (with disk space dedicated to the purpose of
storing checkpoints).
Other Checkpointing Resources