Banner
Title: Condor Practical
Subtitle: Handling a DAG that fails
Tutor: Alain Roy
Authors: Alain Roy

8.0 Handling a DAG that fails

DAGMan can handle a situation where some of the nodes in a DAG fails. DAGMan will run as many nodes as possible, then create a rescue DAG making it easy to continue when the problem is fixed.

Let's create an alternate program that fails. Copy simple.c to simplefail.c, and change the last line from "return failure" to "return 1". In real-life, of course, we wouldn't have a job that is coded to always fail, but it would just happen on occasion. It will look like this:
 
#include <stdio.h>

main(int argc, char **argv)
{
    int sleep_time;
    int input;
    int failure;

    if (argc != 3) {
        printf("Usage: simple <sleep-time> <integer>\n");
        failure = 1;
    } else {
        sleep_time = atoi(argv[1]);
        input      = atoi(argv[2]);

        printf("Thinking really hard for %d seconds...\n",
        sleep_time);
        sleep(sleep_time);
        printf("We calculated: %d\n", input * 2);
        failure = 0;
    }
    return 1;
}

% gcc -o simplefail simplefail.c

Modify job.work2.submit to run simplefail instead of simple:

Universe = vanilla
Executable = simplefail
Arguments = 4 12
Log = simple.log
Output = results.work2.out
Error  = results.work2.err
Queue

Submit the DAG again:

% condor_submit_dag simple.dag

Checking all your submit files for log file names.
This might take a while... 
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor           : simple.dag.condor.sub
Log of DAGMan debugging messages                 : simple.dag.dagman.out
Log of Condor library debug messages             : simple.dag.lib.out
Log of the life of condor_dagman itself          : simple.dag.dagman.log

Condor Log file for all jobs of this DAG         : /home/users/roy/condor-test/simple.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 34.
-----------------------------------------------------------------------

Use watch_condor_q to watch the jobs until they finish.

In a separate window, use tail --lines=500 -f simple.dag.dagman.out to watch what DAGMan does.

% tail --lines=500 -f simple.dag.dagman.out
2/4 23:40:27 ******************************************************
2/4 23:40:27 ** condor_scheduniv_exec.25.0 (CONDOR_DAGMAN) STARTING UP
2/4 23:40:27 ** /usr/local/condor/bin/condor_dagman
2/4 23:40:27 ** $CondorVersion: 6.8.6 Sep 13 2007 $
2/4 23:40:27 ** $CondorPlatform: I386-LINUX_RH9 $
2/4 23:40:27 ** PID = 857
2/4 23:40:27 ** Log last touched time unavailable (No such file or
directory)
2/4 23:40:27 ******************************************************
2/4 23:40:27 Using config source: /usr/local/condor/etc/condor_config
2/4 23:40:27 Using local config sources: 
2/4 23:40:27    /var/local/condor/condor_config.local
2/4 23:40:27 DaemonCore: Command Socket at <193.206.208.141:9665>
2/4 23:40:27 DAGMAN_SUBMIT_DELAY setting: 0
2/4 23:40:27 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
2/4 23:40:27 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
2/4 23:40:27 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
2/4 23:40:27 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION,DAGMAN_ALLOW_EVENTS) setting: 114
2/4 23:40:27 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
2/4 23:40:27 DAGMAN_RETRY_NODE_FIRST setting: 0
2/4 23:40:27 DAGMAN_MAX_JOBS_IDLE setting: 0
2/4 23:40:27 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
2/4 23:40:27 DAGMAN_MUNGE_NODE_NAMES setting: 1
2/4 23:40:27 DAGMAN_DELETE_OLD_LOGS setting: 1
2/4 23:40:27 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
2/4 23:40:27 DAGMAN_ABORT_DUPLICATES setting: 0
2/4 23:40:27 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
2/4 23:40:27 argv[0] == "condor_scheduniv_exec.25.0"
2/4 23:40:27 argv[1] == "-Debug"
2/4 23:40:27 argv[2] == "3"
2/4 23:40:27 argv[3] == "-Lockfile"
2/4 23:40:27 argv[4] == "simple.dag.lock"
2/4 23:40:27 argv[5] == "-Condorlog"
2/4 23:40:27 argv[6] == "/condor/aroy/condor-test/simple.log"
2/4 23:40:27 argv[7] == "-Dag"
2/4 23:40:27 argv[8] == "simple.dag"
2/4 23:40:27 argv[9] == "-Rescue"
2/4 23:40:27 argv[10] == "simple.dag.rescue"
2/4 23:40:27 DAG Lockfile will be written to simple.dag.lock
2/4 23:40:27 DAG Input file is simple.dag
2/4 23:40:27 Rescue DAG will be written to simple.dag.rescue
2/4 23:40:27 All DAG node user log files:
2/4 23:40:27   /condor/aroy/condor-test/simple.log (Condor)
2/4 23:40:27 Parsing simple.dag ...
2/4 23:40:27 Dag contains 4 total jobs
2/4 23:40:27 Truncating any older versions of log files...
2/4 23:40:27 Bootstrapping...
2/4 23:40:27 Number of pre-completed nodes: 0
2/4 23:40:27 Registering condor_event_timer...
2/4 23:40:28 Got node Setup from the ready queue
2/4 23:40:28 Submitting Condor Node Setup job(s)...
2/4 23:40:28 submitting: condor_submit 
                         -a dag_node_name' '=' 'Setup 
                         -a +DAGManJobId' '=' '25 
                         -a DAGManJobId' '=' '25 
                         -a submit_event_notes' '=' 'DAG' 'Node:' 'Setup 
                         -a +DAGParentNodeNames' '=' '""
                        job.setup.submit
2/4 23:40:28 From submit: Submitting job(s).
2/4 23:40:28 From submit: Logging submit event(s).
2/4 23:40:28 From submit: 1 job(s) submitted to cluster 26.
2/4 23:40:28 assigned Condor ID (26.0)
2/4 23:40:28 Just submitted 1 job this cycle...
2/4 23:40:28 Event: ULOG_SUBMIT for Condor Node Setup (26.0)
2/4 23:40:28 Number of idle job procs: 1
2/4 23:40:28 Of 4 nodes total:
2/4 23:40:28  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/4 23:40:28   ===     ===      ===     ===     ===        ===      ===
2/4 23:40:28     0       0        1       0       0          3        0
2/4 23:40:53 Event: ULOG_EXECUTE for Condor Node Setup (26.0)
2/4 23:40:53 Number of idle job procs: 0
2/4 23:40:58 Event: ULOG_JOB_TERMINATED for Condor Node Setup (26.0)
2/4 23:40:58 Node Setup job proc (26.0) completed successfully.
2/4 23:40:58 Node Setup job completed
2/4 23:40:58 Number of idle job procs: 0
2/4 23:40:58 Of 4 nodes total:
2/4 23:40:58  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/4 23:40:58   ===     ===      ===     ===     ===        ===      ===
2/4 23:40:58     1       0        0       0       2          1        0
2/4 23:41:03 Got node Work1 from the ready queue
2/4 23:41:03 Submitting Condor Node Work1 job(s)...
2/4 23:41:03 submitting: condor_submit 
                         -a dag_node_name' '=' 'Work1 
                         -a +DAGManJobId' '=' '25 
                         -a DAGManJobId' '=' '25 
                         -a submit_event_notes' '=' 'DAG' 'Node:' 'Work1 
                         -a +DAGParentNodeNames' '=' '"Setup" 
                         job.work1.submit
2/4 23:41:04 From submit: Submitting job(s).
2/4 23:41:04 From submit: Logging submit event(s).
2/4 23:41:04 From submit: 1 job(s) submitted to cluster 27.
2/4 23:41:04 assigned Condor ID (27.0)
2/4 23:41:04 Got node Work2 from the ready queue
2/4 23:41:04 Submitting Condor Node Work2 job(s)...
2/4 23:41:04 submitting: condor_submit 
                         -a dag_node_name' '=' 'Work2 
                         -a +DAGManJobId' '=' '25 
                         -a DAGManJobId' '=' '25 
                         -a submit_event_notes' '=' 'DAG' 'Node:' 'Work2 
                         -a +DAGParentNodeNames' '=' '"Setup" 
                         job.work2.submit
2/4 23:41:04 From submit: Submitting job(s).
2/4 23:41:04 From submit: Logging submit event(s).
2/4 23:41:04 From submit: 1 job(s) submitted to cluster 28.
2/4 23:41:04 assigned Condor ID (28.0)
2/4 23:41:04 Just submitted 2 jobs this cycle...
2/4 23:41:04 Event: ULOG_SUBMIT for Condor Node Work1 (27.0)
2/4 23:41:04 Number of idle job procs: 1
2/4 23:41:04 Event: ULOG_SUBMIT for Condor Node Work2 (28.0)
2/4 23:41:04 Number of idle job procs: 2
2/4 23:41:04 Of 4 nodes total:
2/4 23:41:04  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/4 23:41:04   ===     ===      ===     ===     ===        ===      ===
2/4 23:41:04     1       0        2       0       0          1        0
2/4 23:41:14 Event: ULOG_EXECUTE for Condor Node Work1 (27.0)
2/4 23:41:14 Number of idle job procs: 1
2/4 23:41:19 Event: ULOG_EXECUTE for Condor Node Work2 (28.0)
2/4 23:41:19 Number of idle job procs: 0
2/4 23:41:19 Event: ULOG_JOB_TERMINATED for Condor Node Work1 (27.0)
2/4 23:41:19 Node Work1 job proc (27.0) completed successfully.
2/4 23:41:19 Node Work1 job completed
2/4 23:41:19 Number of idle job procs: 0
2/4 23:41:19 Event: ULOG_JOB_TERMINATED for Condor Node Work2 (28.0)
2/4 23:41:19 Node Work2 job proc (28.0) failed with status 1.
2/4 23:41:19 Number of idle job procs: 0
2/4 23:41:19 Of 4 nodes total:
2/4 23:41:19  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/4 23:41:19   ===     ===      ===     ===     ===        ===      ===
2/4 23:41:19     2       0        0       0       0          1        1
2/4 23:41:19 ERROR: the following job(s) failed:
2/4 23:41:19 ---------------------- Job ----------------------
2/4 23:41:19       Node Name: Work2
2/4 23:41:19          NodeID: 2
2/4 23:41:19     Node Status: STATUS_ERROR    
2/4 23:41:19 Node return val: 1
2/4 23:41:19           Error: Job proc (28.0) failed with status 1
2/4 23:41:19 Job Submit File: job.work2.submit
2/4 23:41:19   Condor Job ID: (28)
2/4 23:41:19       Q_PARENTS: 0, 
2/4 23:41:19       Q_WAITING: 
2/4 23:41:19      Q_CHILDREN: 3, 
2/4 23:41:19 ---------------------------------------
2/4 23:41:19 Aborting DAG...
2/4 23:41:19 Writing Rescue DAG to simple.dag.rescue...
2/4 23:41:19 Note: 0 total job deferrals because of -MaxJobs limit (0)
2/4 23:41:19 Note: 0 total job deferrals because of -MaxIdle limit (0)
2/4 23:41:19 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
2/4 23:41:19 Note: 0 total POST script deferrals because of -MaxPost limit (0)
2/4 23:41:19 **** condor_scheduniv_exec.25.0 (condor_DAGMAN) EXITING WITH STATUS 1

DAGMan notices that one of the jobs failed because it's exit code was non-zero. DAGMan ran as much of the DAG as possible and logged enough information to continue the run when the situation is resolved.

Look at the rescue DAG. It's the same structurally as your original DAG, but nodes that finished are marked DONE. (DAGMan also reorganized the file.) When you submit the rescue DAG, DONE nodes will be skipped.

% cat simple.dag.rescue 
# Rescue DAG file, created after running
#   the simple.dag DAG file
#
# Total number of Nodes: 4
# Nodes premarked DONE: 2
# Nodes that failed: 1
#   Work2,

JOB Setup job.setup.submit DONE

JOB Work1 job.work1.submit DONE

JOB Work2 job.work2.submit 

JOB Final job.finalize.submit 


PARENT Setup CHILD Work1 Work2
PARENT Work1 CHILD Final
PARENT Work2 CHILD Final

From the comment near the top, we know that the Work2 node failed. Let's "fix" it.

% rm simplefail
% cp simple simplefail

Now we can submit our rescue DAG and DAGMan will pick up where it left off. (If you didn't fix the problem, DAGMan would have generated another rescue DAG, this time "simple.dag.rescue.rescue".)

% condor_submit_dag simple.dag.rescue 
-----------------------------------------------------------------------
File for submitting this DAG to Condor           :
simple.dag.rescue.condor.sub
Log of DAGMan debugging messages                 :
simple.dag.rescue.dagman.out
Log of Condor library debug messages             :
simple.dag.rescue.lib.out
Log of the life of condor_dagman itself          :
simple.dag.rescue.dagman.log

Condor Log file for all Condor jobs of this DAG:
simple.dag.rescue.dummy_log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 6110.
-----------------------------------------------------------------------

Watch what DAGMan does:

% tail --lines 500 -f simple.dag.rescue.dagman.out
2/4 23:49:41 ******************************************************
2/4 23:49:41 ** condor_scheduniv_exec.33.0 (CONDOR_DAGMAN) STARTING UP
2/4 23:49:41 ** /usr/local/condor/bin/condor_dagman
2/4 23:49:41 ** $CondorVersion: 6.8.6 Sep 13 2007 $
2/4 23:49:41 ** $CondorPlatform: I386-LINUX_RH9 $
2/4 23:49:41 ** PID = 973
2/4 23:49:41 ** Log last touched time unavailable (No such file or directory)
2/4 23:49:41 ******************************************************
2/4 23:49:41 Using config source: /usr/local/condor/etc/condor_config
2/4 23:49:41 Using local config sources: 
2/4 23:49:41    /var/local/condor/condor_config.local
2/4 23:49:41 DaemonCore: Command Socket at <193.206.208.141:9604>
2/4 23:49:41 DAGMAN_SUBMIT_DELAY setting: 0
2/4 23:49:41 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
2/4 23:49:41 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
2/4 23:49:41 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
2/4 23:49:41 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
2/4 23:49:41 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
2/4 23:49:41 DAGMAN_RETRY_NODE_FIRST setting: 0
2/4 23:49:41 DAGMAN_MAX_JOBS_IDLE setting: 0
2/4 23:49:41 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
2/4 23:49:41 DAGMAN_MUNGE_NODE_NAMES setting: 1
2/4 23:49:41 DAGMAN_DELETE_OLD_LOGS setting: 1
2/4 23:49:41 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
2/4 23:49:41 DAGMAN_ABORT_DUPLICATES setting: 0
2/4 23:49:41 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
2/4 23:49:41 argv[0] == "condor_scheduniv_exec.33.0"
2/4 23:49:41 argv[1] == "-Debug"
2/4 23:49:41 argv[2] == "3"
2/4 23:49:41 argv[3] == "-Lockfile"
2/4 23:49:41 argv[4] == "simple.dag.rescue.lock"
2/4 23:49:41 argv[5] == "-Condorlog"
2/4 23:49:41 argv[6] == "/condor/aroy/condor-test/simple.log"
2/4 23:49:41 argv[7] == "-Dag"
2/4 23:49:41 argv[8] == "simple.dag.rescue"
2/4 23:49:41 argv[9] == "-Rescue"
2/4 23:49:41 argv[10] == "simple.dag.rescue.rescue"
2/4 23:49:41 DAG Lockfile will be written to simple.dag.rescue.lock
2/4 23:49:41 DAG Input file is simple.dag.rescue
2/4 23:49:41 Rescue DAG will be written to simple.dag.rescue.rescue
2/4 23:49:41 All DAG node user log files:
2/4 23:49:41   /condor/aroy/condor-test/simple.log (Condor)
2/4 23:49:41 Parsing simple.dag.rescue ...
2/4 23:49:41 Dag contains 4 total jobs
2/4 23:49:41 Truncating any older versions of log files...
2/4 23:49:41 MultiLogFiles: truncating older version of /condor/aroy/condor-test/simple.log
2/4 23:49:41 Bootstrapping...
DAGMan notices that some nodes are already finished:
2/4 23:49:41 Number of pre-completed nodes: 2
2/4 23:49:41 Registering condor_event_timer...
DAGMan starts with the previously failed node:
2/4 23:49:42 Got node Work2 from the ready queue
2/4 23:49:42 Submitting Condor Node Work2 job(s)...
2/4 23:49:42 submitting: condor_submit 
                         -a dag_node_name' '=' 'Work2 
                         -a +DAGManJobId' '=' '33 
                         -a DAGManJobId' '=' '33 
                         -a submit_event_notes' '=' 'DAG' 'Node:' 'Work2 
                         -a +DAGParentNodeNames' '=' '"Setup" 
                         job.work2.submit
2/4 23:49:42 From submit: Submitting job(s).
2/4 23:49:42 From submit: Logging submit event(s).
2/4 23:49:42 From submit: 1 job(s) submitted to cluster 34.
2/4 23:49:42 assigned Condor ID (34.0)
2/4 23:49:42 Just submitted 1 job this cycle...
2/4 23:49:42 Event: ULOG_SUBMIT for Condor Node Work2 (34.0)
2/4 23:49:42 Number of idle job procs: 1
2/4 23:49:42 Of 4 nodes total:
2/4 23:49:42  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/4 23:49:42   ===     ===      ===     ===     ===        ===      ===
2/4 23:49:42     2       0        1       0       0          1        0
2/4 23:50:07 Event: ULOG_EXECUTE for Condor Node Work2 (34.0)
2/4 23:50:07 Number of idle job procs: 0
2/4 23:50:12 Event: ULOG_JOB_TERMINATED for Condor Node Work2 (34.0)
2/4 23:50:12 Node Work2 job proc (34.0) completed successfully.
2/4 23:50:12 Node Work2 job completed
2/4 23:50:12 Number of idle job procs: 0
2/4 23:50:12 Of 4 nodes total:
2/4 23:50:12  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/4 23:50:12   ===     ===      ===     ===     ===        ===      ===
2/4 23:50:12     3       0        0       0       1          0        0
When it finishes, the final node is submitted:
2/4 23:50:17 Got node Final from the ready queue
2/4 23:50:17 Submitting Condor Node Final job(s)...
2/4 23:50:17 submitting: condor_submit 
                         -a dag_node_name' '=' 'Final 
                         -a +DAGManJobId' '=' '33 
                         -a DAGManJobId' '=' '33 
                         -a submit_event_notes' '=' 'DAG' 'Node:' 'Final 
                         -a +DAGParentNodeNames' '=' '"Work1,Work2" 
                         job.finalize.submit
2/4 23:50:17 From submit: Submitting job(s).
2/4 23:50:17 From submit: Logging submit event(s).
2/4 23:50:17 From submit: 1 job(s) submitted to cluster 35.
2/4 23:50:17 assigned Condor ID (35.0)
2/4 23:50:17 Just submitted 1 job this cycle...
2/4 23:50:17 Event: ULOG_SUBMIT for Condor Node Final (35.0)
2/4 23:50:17 Number of idle job procs: 1
2/4 23:50:17 Of 4 nodes total:
2/4 23:50:17  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/4 23:50:17   ===     ===      ===     ===     ===        ===      ===
2/4 23:50:17     3       0        1       0       0          0        0
2/4 23:50:27 Event: ULOG_EXECUTE for Condor Node Final (35.0)
2/4 23:50:27 Number of idle job procs: 0
2/4 23:50:32 Event: ULOG_JOB_TERMINATED for Condor Node Final (35.0)
2/4 23:50:32 Node Final job proc (35.0) completed successfully.
2/4 23:50:32 Node Final job completed
2/4 23:50:32 Number of idle job procs: 0
2/4 23:50:32 Of 4 nodes total:
2/4 23:50:32  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/4 23:50:32   ===     ===      ===     ===     ===        ===      ===
2/4 23:50:32     4       0        0       0       0          0        0
2/4 23:50:32 All jobs Completed!
2/4 23:50:32 Note: 0 total job deferrals because of -MaxJobs limit (0)
2/4 23:50:32 Note: 0 total job deferrals because of -MaxIdle limit (0)
2/4 23:50:32 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
2/4 23:50:32 Note: 0 total POST script deferrals because of -MaxPost limit (0)
2/4 23:50:32 **** condor_scheduniv_exec.33.0 (condor_DAGMAN) EXITING WITH STATUS 0

Success! Now go ahead and clean up.

% rm simple.dag.*
% rm results.*

Extra Credit Go back to the original simplefail program, the one that fails. Let's pretend that if the program exits with 0 or 1, it's considered correct, and only if it fails with another value does it really fail. Write a POST script that checks the return value. Check the Condor manual to see how to describe your post script. Make sure your post script works by having simplefail return 0, 1, or 2.

Top

Next: Finishing up

Top