DAGMan can handle a situation where some of the nodes in a DAG fails. DAGMan will run as many nodes as possible, then create a rescue DAG making it easy to continue when the problem is fixed.
Let's create an alternate program that fails. Copy simple.c to simplefail.c, and change the last line from "return failure" to "return 1". In real-life, of course, we wouldn't have a job that is coded to always fail, but it would just happen on occasion. It will look like this:#include <stdio.h> main(int argc, char **argv) { int sleep_time; int input; int failure; if (argc != 3) { printf("Usage: simple <sleep-time> <integer>\n"); failure = 1; } else { sleep_time = atoi(argv[1]); input = atoi(argv[2]); printf("Thinking really hard for %d seconds...\n", sleep_time); sleep(sleep_time); printf("We calculated: %d\n", input * 2); failure = 0; } return 1; } nova 1% gcc -o simplefail simplefail.c
Modify job.work2.submit to run myscript2.sh instead of myscript.sh:
Universe = vanilla Executable = simplefail Arguments = 4 12 Log = simple.log Output = results.work2.out Error = results.work2.err Queue
Submit the DAG again:
nova 2% condor_submitdag simple.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : simple.dag.condor.sub Log of DAGMan debugging messages : simple.dag.dagman.out Log of Condor library debug messages : simple.dag.lib.out Log of the life of condor_dagman itself : simple.dag.dagman.log Condor Log file for all Condor jobs of this DAG: simple.dag.dummy_log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 6106. -----------------------------------------------------------------------
Use watch_condor_q to watch the jobs until they finish.
In a separate window, use tail --lines=500 -f
simple.dag.dagman.out
to watch what DAGMan does.
nova 3% tail --lines=500 -f simple.dag.dagman.out 12/21 05:26:03 ****************************************************** 12/21 05:26:03 ** condor_scheduniv_exec.6106.0 (CONDOR_DAGMAN) STARTING UP 12/21 05:26:03 ** /specific/a/home/cc/cs/condor/hosts/nova/spool/cluster6106.ickpt.subproc0 12/21 05:26:03 ** $CondorVersion: 6.6.7 Oct 11 2004 $ 12/21 05:26:03 ** $CondorPlatform: I386-LINUX_RH72 $ 12/21 05:26:03 ** PID = 4036 12/21 05:26:03 ****************************************************** 12/21 05:26:03 Using config file: /usr/local/lib/condor/etc/condor_config 12/21 05:26:03 Using local config files: /usr/local/lib/condor/etc/nodes/condor_config.nova 12/21 05:26:03 DaemonCore: Command Socket at <132.67.192.133:40085> 12/21 05:26:03 argv[0] == "condor_scheduniv_exec.6106.0" 12/21 05:26:03 argv[1] == "-Debug" 12/21 05:26:03 argv[2] == "3" 12/21 05:26:03 argv[3] == "-Lockfile" 12/21 05:26:03 argv[4] == "simple.dag.lock" 12/21 05:26:03 argv[5] == "-Dag" 12/21 05:26:03 argv[6] == "simple.dag" 12/21 05:26:03 argv[7] == "-Rescue" 12/21 05:26:03 argv[8] == "simple.dag.rescue" 12/21 05:26:03 argv[9] == "-Condorlog" 12/21 05:26:03 argv[10] == "simple.dag.dummy_log" 12/21 05:26:03 DAG Lockfile will be written to simple.dag.lock 12/21 05:26:03 DAG Input file is simple.dag 12/21 05:26:03 Rescue DAG will be written to simple.dag.rescue 12/21 05:26:03 All DAG node user log files: 12/21 05:26:03 /specific/a/home/cc/cs/alainroy/simple.log 12/21 05:26:03 Parsing simple.dag ... 12/21 05:26:03 Dag contains 4 total jobs 12/21 05:26:03 Deleting any older versions of log files... 12/21 05:26:03 Deleting older version of /specific/a/home/cc/cs/alainroy/simple.log 12/21 05:26:03 Bootstrapping... 12/21 05:26:03 Number of pre-completed jobs: 0 12/21 05:26:03 Registering condor_event_timer... 12/21 05:26:04 Submitting Condor Job Setup ... 12/21 05:26:04 submitting: condor_submit -a 'dag_node_name = Setup' -a '+DAGManJobID = 6106.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.setup.submit 2>&1 12/21 05:26:05 assigned Condor ID (6107.0.0) 12/21 05:26:05 Just submitted 1 job this cycle... 12/21 05:26:05 Event: ULOG_SUBMIT for Condor Job Setup (6107.0.0) 12/21 05:26:05 Of 4 nodes total: 12/21 05:26:05 Done Pre Queued Post Ready Un-Ready Failed 12/21 05:26:05 === === === === === === === 12/21 05:26:05 0 0 1 0 0 3 0 12/21 05:26:45 Event: ULOG_EXECUTE for Condor Job Setup (6107.0.0) 12/21 05:26:45 Event: ULOG_JOB_TERMINATED for Condor Job Setup (6107.0.0) 12/21 05:26:45 Job Setup completed successfully. 12/21 05:26:45 Of 4 nodes total: 12/21 05:26:45 Done Pre Queued Post Ready Un-Ready Failed 12/21 05:26:45 === === === === === === === 12/21 05:26:45 1 0 0 0 2 1 0 12/21 05:26:51 Submitting Condor Job Work1 ... 12/21 05:26:51 submitting: condor_submit -a 'dag_node_name = Work1' -a '+DAGManJobID = 6106.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.work1.submit 2>&1 12/21 05:26:52 assigned Condor ID (6108.0.0) 12/21 05:26:53 Submitting Condor Job Work2 ... 12/21 05:26:53 submitting: condor_submit -a 'dag_node_name = Work2' -a '+DAGManJobID = 6106.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.work2.submit 2>&1 12/21 05:26:54 assigned Condor ID (6109.0.0) 12/21 05:26:54 Just submitted 2 jobs this cycle... 12/21 05:26:54 Event: ULOG_SUBMIT for Condor Job Work1 (6108.0.0) 12/21 05:26:54 Event: ULOG_SUBMIT for Condor Job Work2 (6109.0.0) 12/21 05:26:54 Of 4 nodes total: 12/21 05:26:54 Done Pre Queued Post Ready Un-Ready Failed 12/21 05:26:54 === === === === === === === 12/21 05:26:54 1 0 2 0 0 1 0 12/21 05:27:34 Event: ULOG_EXECUTE for Condor Job Work1 (6108.0.0) 12/21 05:27:39 Event: ULOG_EXECUTE for Condor Job Work2 (6109.0.0) 12/21 05:27:39 Event: ULOG_JOB_TERMINATED for Condor Job Work1 (6108.0.0) 12/21 05:27:39 Job Work1 completed successfully. 12/21 05:27:39 Event: ULOG_JOB_TERMINATED for Condor Job Work2 (6109.0.0) 12/21 05:27:39 Job Work2 failed with status 1. 12/21 05:27:39 Of 4 nodes total: 12/21 05:27:39 Done Pre Queued Post Ready Un-Ready Failed 12/21 05:27:39 === === === === === === === 12/21 05:27:39 2 0 0 0 0 1 1 12/21 05:27:39 ERROR: the following job(s) failed: 12/21 05:27:39 ---------------------- Job ---------------------- 12/21 05:27:39 Node Name: Work2 12/21 05:27:39 NodeID: 2 12/21 05:27:39 Node Status: STATUS_ERROR 12/21 05:27:39 Error: Job failed with status 1 12/21 05:27:39 Job Submit File: job.work2.submit 12/21 05:27:39 Condor Job ID: (6109.0.0) 12/21 05:27:39 Q_PARENTS: 0,12/21 05:27:39 Q_WAITING: 12/21 05:27:39 Q_CHILDREN: 3, 12/21 05:27:39 --------------------------------------- 12/21 05:27:39 Aborting DAG... 12/21 05:27:39 Writing Rescue DAG to simple.dag.rescue... 12/21 05:27:39 **** condor_scheduniv_exec.6106.0 (condor_DAGMAN) EXITING WITH STATUS 1
DAGMan notices that one of the jobs failed because it's exit code was non-zero. DAGMan ran as much of the DAG as possible and logged enough information to continue the run when the situation is resolved.
Look at the rescue DAG. It's the same structurally as your original DAG, but notes that finished are marked DONE. (DAGMan also reorganized the file.) When you submit the rescue DAG, DONE nodes will be skipped.
nova 4 cat simple.dag.rescue # Rescue DAG file, created after running # the simple.dag DAG file # # Total number of Nodes: 4 # Nodes premarked DONE: 2 # Nodes that failed: 1 # Work2,JOB Setup job.setup.submit DONE JOB Work1 job.work1.submit DONE JOB Work2 job.work2.submit JOB Final job.finalize.submit PARENT Setup CHILD Work1 Work2 PARENT Work1 CHILD Final PARENT Work2 CHILD Final
From the comment near the top, we know that the Work2 node failed. Let's "fix" it.
nova 5% rm simplefail nova 6% cp simple simplefail
Now we can submit our rescue DAG and DAGMan will pick up where it left off. (If you didn't fix the problem, DAGMan would have generated another rescue DAG, this time "simple.dag.rescue.rescue".)
nova 7% condor_submit_dag simple.dag.rescue ----------------------------------------------------------------------- File for submitting this DAG to Condor : simple.dag.rescue.condor.sub Log of DAGMan debugging messages : simple.dag.rescue.dagman.out Log of Condor library debug messages : simple.dag.rescue.lib.out Log of the life of condor_dagman itself : simple.dag.rescue.dagman.log Condor Log file for all Condor jobs of this DAG: simple.dag.rescue.dummy_log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 6110. -----------------------------------------------------------------------
Watch what DAGMan does:
nova 8 % tail --lines 500 -f simple.dag.rescue.dagman.out 12/21 05:54:01 ****************************************************** 12/21 05:54:01 ** condor_scheduniv_exec.6110.0 (CONDOR_DAGMAN) STARTING UP 12/21 05:54:01 ** /specific/a/home/cc/cs/condor/hosts/nova/spool/cluster6110.ickpt.subproc0 12/21 05:54:01 ** $CondorVersion: 6.6.7 Oct 11 2004 $ 12/21 05:54:01 ** $CondorPlatform: I386-LINUX_RH72 $ 12/21 05:54:01 ** PID = 4847 12/21 05:54:01 ****************************************************** 12/21 05:54:01 Using config file: /usr/local/lib/condor/etc/condor_config 12/21 05:54:01 Using local config files: /usr/local/lib/condor/etc/nodes/condor_config.nova 12/21 05:54:01 DaemonCore: Command Socket at <132.67.192.133:40638> 12/21 05:54:01 argv[0] == "condor_scheduniv_exec.6110.0" 12/21 05:54:01 argv[1] == "-Debug" 12/21 05:54:01 argv[2] == "3" 12/21 05:54:01 argv[3] == "-Lockfile" 12/21 05:54:01 argv[4] == "simple.dag.rescue.lock" 12/21 05:54:01 argv[5] == "-Dag" 12/21 05:54:01 argv[6] == "simple.dag.rescue" 12/21 05:54:01 argv[7] == "-Rescue" 12/21 05:54:01 argv[8] == "simple.dag.rescue.rescue" 12/21 05:54:01 argv[9] == "-Condorlog" 12/21 05:54:01 argv[10] == "simple.dag.rescue.dummy_log" 12/21 05:54:01 DAG Lockfile will be written to simple.dag.rescue.lock 12/21 05:54:01 DAG Input file is simple.dag.rescue 12/21 05:54:01 Rescue DAG will be written to simple.dag.rescue.rescue 12/21 05:54:01 All DAG node user log files: 12/21 05:54:01 /specific/a/home/cc/cs/alainroy/simple.log 12/21 05:54:01 Parsing simple.dag.rescue ... 12/21 05:54:01 Dag contains 4 total jobs 12/21 05:54:01 Deleting any older versions of log files... 12/21 05:54:01 Deleting older version of /specific/a/home/cc/cs/alainroy/simple.log 12/21 05:54:01 Bootstrapping... 12/21 05:54:01 Number of pre-completed jobs: 2 12/21 05:54:01 Registering condor_event_timer... 12/21 05:54:03 Submitting Condor Job Work2 ... 12/21 05:54:03 submitting: condor_submit -a 'dag_node_name = Work2' -a '+DAGManJobID = 6110.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.work2.submit 2>&1 12/21 05:54:04 assigned Condor ID (6111.0.0) 12/21 05:54:04 Just submitted 1 job this cycle... 12/21 05:54:04 Event: ULOG_SUBMIT for Condor Job Work2 (6111.0.0) 12/21 05:54:04 Of 4 nodes total: 12/21 05:54:04 Done Pre Queued Post Ready Un-Ready Failed 12/21 05:54:04 === === === === === === === 12/21 05:54:04 2 0 1 0 0 1 0 12/21 05:54:39 Event: ULOG_EXECUTE for Condor Job Work2 (6111.0.0) 12/21 05:54:44 Event: ULOG_JOB_TERMINATED for Condor Job Work2 (6111.0.0) 12/21 05:54:44 Job Work2 completed successfully. 12/21 05:54:44 Of 4 nodes total: 12/21 05:54:44 Done Pre Queued Post Ready Un-Ready Failed 12/21 05:54:44 === === === === === === === 12/21 05:54:44 3 0 0 0 1 0 0 12/21 05:54:50 Submitting Condor Job Final ... 12/21 05:54:50 submitting: condor_submit -a 'dag_node_name = Final' -a '+DAGManJobID = 6110.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.finalize.submit 2>&1 12/21 05:54:51 assigned Condor ID (6112.0.0) 12/21 05:54:51 Just submitted 1 job this cycle... 12/21 05:54:51 Event: ULOG_SUBMIT for Condor Job Final (6112.0.0) 12/21 05:54:51 Of 4 nodes total: 12/21 05:54:51 Done Pre Queued Post Ready Un-Ready Failed 12/21 05:54:51 === === === === === === === 12/21 05:54:51 3 0 1 0 0 0 0 12/21 05:55:31 Event: ULOG_EXECUTE for Condor Job Final (6112.0.0) 12/21 05:55:36 Event: ULOG_JOB_TERMINATED for Condor Job Final (6112.0.0) 12/21 05:55:36 Job Final completed successfully. 12/21 05:55:36 Of 4 nodes total: 12/21 05:55:36 Done Pre Queued Post Ready Un-Ready Failed 12/21 05:55:36 === === === === === === === 12/21 05:55:36 4 0 0 0 0 0 0 12/21 05:55:36 All jobs Completed! 12/21 05:55:36 **** condor_scheduniv_exec.6110.0 (condor_DAGMAN) EXITING WITH STATUS 0
Success! Now go ahead and clean up.
nova 9% rm simple.dag.* nova 10% rm results.*