HTCondor
Overview
Download
Documentation
Release Highlights
Release Schedule

HTCondor-CE
Overview
Download
Documentation
Release Highlights

General
News
Security
Documentation
Download
License

Documentation
HTCondor
HTCondor-CE

Support
HTCSS Contract Support
Community Mailing List
External Support Organizations

Email Lists
Users / Help
World / Announcements

Workshops
Throughput Computing 2026
Materials from past Workshops

HTCSS
What is HTCSS?
Logos

CHTC
Website ⬏
Staff ⬏
Jobs ⬏
Research & Publications ⬏

Checkpointing

Overview

The ability to cause a job to vacate a non-idle workstation and migrate to an idle workstation is central to HTCondor’s job scheduling. To allow these migrating jobs to continue their progress, HTCondor must be able to continue a vacated job without a complete loss of computation time. HTCondor does this by using a checkpoint of a job’s state. A checkpoint is generated just before a job vacates a machine. The checkpoint is then used as a starting point for the job after migration.

HTCondor gives a program the ability to checkpoint itself by providing a checkpointing library which contains a signal handler for SIGTSTP, the HTCondor checkpoint signal. When a HTCondor job receives a checkpoint signal, it writes a checkpoint file, which contains the process’s data and stack segments, as well as information about open files, pending signals, and CPU state. When the job is restarted, the state contained in the checkpoint file is restored (by startup code in the checkpoint library) and the process resumes the computation at the point where the checkpoint was generated. Programs submitted to be run by the HTCondor system are re-linked (but not re-compiled) to include this library.

A technical description of the checkpointing and process migration mechanisms implemented in HTCondor is given in this technical report.

Standalone Checkpointing

The HTCondor distribution includes a checkpointing library, to provide checkpoints for Unix processes that are not submitted for execution under HTCondor. Information on using standalone checkpointing is available in the manual.

Periodic Checkpointing

In addition to writing a checkpoint at vacate time, HTCondor can be configured to write checkpoints periodically while the job is executing, to provide additional reliability.

Checkpoint Server

By default, a checkpoint is written to a file on the local disk of the machine where the job was submitted. A checkpoint server has been developed to serve as a repository for checkpoints. When a host is configured to use a checkpoint server, jobs submitted on that machine write and read checkpoints to and from the server rather than the local disk of the submitting machine, taking the burden of storing checkpoint files off of the submitting machines and placing it instead on server machines (with disk space dedicated to the purpose of storing checkpoints).

Other Checkpointing Resources

Checkpointing Research at the University of Tennessee

OSPool and Its Computing Capacity Helps Researchers Design New Proteins

Researcher Receives David Swanson Award for Work Powered by High Throughput Computing

One Researcher’s Leap into Throughput Computing: Bringing Machine Learning to Dairy Farm Management