Handling a DAG that fails

DAGMan can handle a situation where some of the nodes in a DAG fails. DAGMan will run as many nodes as possible, then create a rescue DAG making it easy to continue when the problem is fixed.

Let's create an alternate program that fails. Copy simple.c to simplefail.c, and change the last line from "return failure" to "return 1". In real-life, of course, we wouldn't have a job that is coded to always fail, but it would just happen on occasion. It will look like this:
 
#include <stdio.h>

main(int argc, char **argv)
{
    int sleep_time;
    int input;
    int failure;

    if (argc != 3) {
        printf("Usage: simple <sleep-time> <integer>\n");
        failure = 1;
    } else {
        sleep_time = atoi(argv[1]);
        input      = atoi(argv[2]);

        printf("Thinking really hard for %d seconds...\n",
        sleep_time);
        sleep(sleep_time);
        printf("We calculated: %d\n", input * 2);
        failure = 0;
    }
    return 1;
}

nova 1% gcc -o simplefail simplefail.c

Modify job.work2.submit to run myscript2.sh instead of myscript.sh:

Universe = vanilla
Executable = simplefail
Arguments = 4 12
Log = simple.log
Output = results.work2.out
Error  = results.work2.err
Queue

Submit the DAG again:

nova 2% condor_submitdag simple.dag
-----------------------------------------------------------------------
File for submitting this DAG to Condor           :
simple.dag.condor.sub
Log of DAGMan debugging messages                 :
simple.dag.dagman.out
Log of Condor library debug messages             : simple.dag.lib.out
Log of the life of condor_dagman itself          :
simple.dag.dagman.log

Condor Log file for all Condor jobs of this DAG: simple.dag.dummy_log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 6106.
-----------------------------------------------------------------------

Use watch_condor_q to watch the jobs until they finish.

In a separate window, use tail --lines=500 -f simple.dag.dagman.out to watch what DAGMan does.

nova 3% tail --lines=500 -f simple.dag.dagman.out
12/21 05:26:03 ******************************************************
12/21 05:26:03 ** condor_scheduniv_exec.6106.0 (CONDOR_DAGMAN) STARTING UP
12/21 05:26:03 ** /specific/a/home/cc/cs/condor/hosts/nova/spool/cluster6106.ickpt.subproc0
12/21 05:26:03 ** $CondorVersion: 6.6.7 Oct 11 2004 $
12/21 05:26:03 ** $CondorPlatform: I386-LINUX_RH72 $
12/21 05:26:03 ** PID = 4036
12/21 05:26:03 ******************************************************
12/21 05:26:03 Using config file: /usr/local/lib/condor/etc/condor_config
12/21 05:26:03 Using local config files: /usr/local/lib/condor/etc/nodes/condor_config.nova
12/21 05:26:03 DaemonCore: Command Socket at <132.67.192.133:40085>
12/21 05:26:03 argv[0] == "condor_scheduniv_exec.6106.0"
12/21 05:26:03 argv[1] == "-Debug"
12/21 05:26:03 argv[2] == "3"
12/21 05:26:03 argv[3] == "-Lockfile"
12/21 05:26:03 argv[4] == "simple.dag.lock"
12/21 05:26:03 argv[5] == "-Dag"
12/21 05:26:03 argv[6] == "simple.dag"
12/21 05:26:03 argv[7] == "-Rescue"
12/21 05:26:03 argv[8] == "simple.dag.rescue"
12/21 05:26:03 argv[9] == "-Condorlog"
12/21 05:26:03 argv[10] == "simple.dag.dummy_log"
12/21 05:26:03 DAG Lockfile will be written to simple.dag.lock
12/21 05:26:03 DAG Input file is simple.dag
12/21 05:26:03 Rescue DAG will be written to simple.dag.rescue
12/21 05:26:03 All DAG node user log files:
12/21 05:26:03   /specific/a/home/cc/cs/alainroy/simple.log
12/21 05:26:03 Parsing simple.dag ...
12/21 05:26:03 Dag contains 4 total jobs
12/21 05:26:03 Deleting any older versions of log files...
12/21 05:26:03 Deleting older version of /specific/a/home/cc/cs/alainroy/simple.log
12/21 05:26:03 Bootstrapping...
12/21 05:26:03 Number of pre-completed jobs: 0
12/21 05:26:03 Registering condor_event_timer...
12/21 05:26:04 Submitting Condor Job Setup ...
12/21 05:26:04 submitting: condor_submit  -a 'dag_node_name = Setup' -a '+DAGManJobID = 6106.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.setup.submit 2>&1
12/21 05:26:05  assigned Condor ID (6107.0.0)
12/21 05:26:05 Just submitted 1 job this cycle...
12/21 05:26:05 Event: ULOG_SUBMIT for Condor Job Setup (6107.0.0)
12/21 05:26:05 Of 4 nodes total:
12/21 05:26:05  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
12/21 05:26:05   ===     ===      ===     ===     ===        ===      ===
12/21 05:26:05     0       0        1       0       0          3        0
12/21 05:26:45 Event: ULOG_EXECUTE for Condor Job Setup (6107.0.0)
12/21 05:26:45 Event: ULOG_JOB_TERMINATED for Condor Job Setup (6107.0.0)
12/21 05:26:45 Job Setup completed successfully.
12/21 05:26:45 Of 4 nodes total:
12/21 05:26:45  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
12/21 05:26:45   ===     ===      ===     ===     ===        ===      ===
12/21 05:26:45     1       0        0       0       2          1        0
12/21 05:26:51 Submitting Condor Job Work1 ...
12/21 05:26:51 submitting: condor_submit  -a 'dag_node_name = Work1' -a '+DAGManJobID = 6106.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.work1.submit 2>&1
12/21 05:26:52  assigned Condor ID (6108.0.0)
12/21 05:26:53 Submitting Condor Job Work2 ...
12/21 05:26:53 submitting: condor_submit  -a 'dag_node_name = Work2' -a '+DAGManJobID = 6106.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.work2.submit 2>&1
12/21 05:26:54  assigned Condor ID (6109.0.0)
12/21 05:26:54 Just submitted 2 jobs this cycle...
12/21 05:26:54 Event: ULOG_SUBMIT for Condor Job Work1 (6108.0.0)
12/21 05:26:54 Event: ULOG_SUBMIT for Condor Job Work2 (6109.0.0)
12/21 05:26:54 Of 4 nodes total:
12/21 05:26:54  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
12/21 05:26:54   ===     ===      ===     ===     ===        ===      ===
12/21 05:26:54     1       0        2       0       0          1        0
12/21 05:27:34 Event: ULOG_EXECUTE for Condor Job Work1 (6108.0.0)
12/21 05:27:39 Event: ULOG_EXECUTE for Condor Job Work2 (6109.0.0)
12/21 05:27:39 Event: ULOG_JOB_TERMINATED for Condor Job Work1 (6108.0.0)
12/21 05:27:39 Job Work1 completed successfully.
12/21 05:27:39 Event: ULOG_JOB_TERMINATED for Condor Job Work2 (6109.0.0)
12/21 05:27:39 Job Work2 failed with status 1.
12/21 05:27:39 Of 4 nodes total:
12/21 05:27:39  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
12/21 05:27:39   ===     ===      ===     ===     ===        ===      ===
12/21 05:27:39     2       0        0       0       0          1        1
12/21 05:27:39 ERROR: the following job(s) failed:
12/21 05:27:39 ---------------------- Job ----------------------
12/21 05:27:39       Node Name: Work2
12/21 05:27:39          NodeID: 2
12/21 05:27:39     Node Status: STATUS_ERROR    
12/21 05:27:39           Error: Job failed with status 1
12/21 05:27:39 Job Submit File: job.work2.submit
12/21 05:27:39   Condor Job ID: (6109.0.0)
12/21 05:27:39       Q_PARENTS: 0, 
12/21 05:27:39       Q_WAITING: 
12/21 05:27:39      Q_CHILDREN: 3, 
12/21 05:27:39 ---------------------------------------  
12/21 05:27:39 Aborting DAG...
12/21 05:27:39 Writing Rescue DAG to simple.dag.rescue...
12/21 05:27:39 **** condor_scheduniv_exec.6106.0 (condor_DAGMAN) EXITING WITH STATUS 1

DAGMan notices that one of the jobs failed because it's exit code was non-zero. DAGMan ran as much of the DAG as possible and logged enough information to continue the run when the situation is resolved.

Look at the rescue DAG. It's the same structurally as your original DAG, but notes that finished are marked DONE. (DAGMan also reorganized the file.) When you submit the rescue DAG, DONE nodes will be skipped.

nova 4 cat simple.dag.rescue 
# Rescue DAG file, created after running
#   the simple.dag DAG file
#
# Total number of Nodes: 4
# Nodes premarked DONE: 2
# Nodes that failed: 1
#   Work2,

JOB Setup job.setup.submit DONE

JOB Work1 job.work1.submit DONE

JOB Work2 job.work2.submit 

JOB Final job.finalize.submit 

PARENT Setup CHILD Work1 Work2
PARENT Work1 CHILD Final
PARENT Work2 CHILD Final

From the comment near the top, we know that the Work2 node failed. Let's "fix" it.

nova 5% rm simplefail 
nova 6% cp simple simplefail

Now we can submit our rescue DAG and DAGMan will pick up where it left off. (If you didn't fix the problem, DAGMan would have generated another rescue DAG, this time "simple.dag.rescue.rescue".)

nova 7% condor_submit_dag simple.dag.rescue 
-----------------------------------------------------------------------
File for submitting this DAG to Condor           :
simple.dag.rescue.condor.sub
Log of DAGMan debugging messages                 :
simple.dag.rescue.dagman.out
Log of Condor library debug messages             :
simple.dag.rescue.lib.out
Log of the life of condor_dagman itself          :
simple.dag.rescue.dagman.log

Condor Log file for all Condor jobs of this DAG:
simple.dag.rescue.dummy_log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 6110.
-----------------------------------------------------------------------

Watch what DAGMan does:

nova 8 % tail --lines 500 -f simple.dag.rescue.dagman.out
12/21 05:54:01 ******************************************************
12/21 05:54:01 ** condor_scheduniv_exec.6110.0 (CONDOR_DAGMAN) STARTING UP
12/21 05:54:01 ** /specific/a/home/cc/cs/condor/hosts/nova/spool/cluster6110.ickpt.subproc0
12/21 05:54:01 ** $CondorVersion: 6.6.7 Oct 11 2004 $
12/21 05:54:01 ** $CondorPlatform: I386-LINUX_RH72 $
12/21 05:54:01 ** PID = 4847
12/21 05:54:01 ******************************************************
12/21 05:54:01 Using config file: /usr/local/lib/condor/etc/condor_config
12/21 05:54:01 Using local config files: /usr/local/lib/condor/etc/nodes/condor_config.nova
12/21 05:54:01 DaemonCore: Command Socket at <132.67.192.133:40638>
12/21 05:54:01 argv[0] == "condor_scheduniv_exec.6110.0"
12/21 05:54:01 argv[1] == "-Debug"
12/21 05:54:01 argv[2] == "3"
12/21 05:54:01 argv[3] == "-Lockfile"
12/21 05:54:01 argv[4] == "simple.dag.rescue.lock"
12/21 05:54:01 argv[5] == "-Dag"
12/21 05:54:01 argv[6] == "simple.dag.rescue"
12/21 05:54:01 argv[7] == "-Rescue"
12/21 05:54:01 argv[8] == "simple.dag.rescue.rescue"
12/21 05:54:01 argv[9] == "-Condorlog"
12/21 05:54:01 argv[10] == "simple.dag.rescue.dummy_log"
12/21 05:54:01 DAG Lockfile will be written to simple.dag.rescue.lock
12/21 05:54:01 DAG Input file is simple.dag.rescue
12/21 05:54:01 Rescue DAG will be written to simple.dag.rescue.rescue
12/21 05:54:01 All DAG node user log files:
12/21 05:54:01   /specific/a/home/cc/cs/alainroy/simple.log
12/21 05:54:01 Parsing simple.dag.rescue ...
12/21 05:54:01 Dag contains 4 total jobs
12/21 05:54:01 Deleting any older versions of log files...
12/21 05:54:01 Deleting older version of /specific/a/home/cc/cs/alainroy/simple.log
12/21 05:54:01 Bootstrapping...
12/21 05:54:01 Number of pre-completed jobs: 2
12/21 05:54:01 Registering condor_event_timer...
12/21 05:54:03 Submitting Condor Job Work2 ...
12/21 05:54:03 submitting: condor_submit  -a 'dag_node_name = Work2' -a '+DAGManJobID = 6110.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.work2.submit 2>&1
12/21 05:54:04  assigned Condor ID (6111.0.0)
12/21 05:54:04 Just submitted 1 job this cycle...
12/21 05:54:04 Event: ULOG_SUBMIT for Condor Job Work2 (6111.0.0)
12/21 05:54:04 Of 4 nodes total:
12/21 05:54:04  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
12/21 05:54:04   ===     ===      ===     ===     ===        ===      ===
12/21 05:54:04     2       0        1       0       0          1        0
12/21 05:54:39 Event: ULOG_EXECUTE for Condor Job Work2 (6111.0.0)
12/21 05:54:44 Event: ULOG_JOB_TERMINATED for Condor Job Work2 (6111.0.0)
12/21 05:54:44 Job Work2 completed successfully.
12/21 05:54:44 Of 4 nodes total:
12/21 05:54:44  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
12/21 05:54:44   ===     ===      ===     ===     ===        ===      ===
12/21 05:54:44     3       0        0       0       1          0        0
12/21 05:54:50 Submitting Condor Job Final ...
12/21 05:54:50 submitting: condor_submit  -a 'dag_node_name = Final' -a '+DAGManJobID = 6110.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.finalize.submit 2>&1
12/21 05:54:51  assigned Condor ID (6112.0.0)
12/21 05:54:51 Just submitted 1 job this cycle...
12/21 05:54:51 Event: ULOG_SUBMIT for Condor Job Final (6112.0.0)
12/21 05:54:51 Of 4 nodes total:
12/21 05:54:51  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
12/21 05:54:51   ===     ===      ===     ===     ===        ===      ===
12/21 05:54:51     3       0        1       0       0          0        0
12/21 05:55:31 Event: ULOG_EXECUTE for Condor Job Final (6112.0.0)
12/21 05:55:36 Event: ULOG_JOB_TERMINATED for Condor Job Final (6112.0.0)
12/21 05:55:36 Job Final completed successfully.
12/21 05:55:36 Of 4 nodes total:
12/21 05:55:36  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
12/21 05:55:36   ===     ===      ===     ===     ===        ===      ===
12/21 05:55:36     4       0        0       0       0          0        0
12/21 05:55:36 All jobs Completed!
12/21 05:55:36 **** condor_scheduniv_exec.6110.0 (condor_DAGMAN) EXITING WITH STATUS 0

Success! Now go ahead and clean up.

nova 9% rm simple.dag.*
nova 10% rm results.*

Next: Finishing up