Banner
Title: Condor Practical
Subtitle: Handling a DAG that fails
Tutors: Alain Roy and Todd Tannenbaum
Authors: Alain Roy and Ben Burnett

8.0 Handling a DAG that fails

DAGMan can handle a situation where some of the nodes in a DAG fails. DAGMan will run as many nodes as possible, then create a rescue DAG making it easy to continue when the problem is fixed.

Let's create an alternate program that fails. Name is simplefail.bat, and have it exit with an error. In real-life, of course, we wouldn't have a job that is coded to always fail, but it would just happen on occasion. It will look like this:

@echo off
exit 1

Modify job.work2.sub to run simplefail.bat instead of simple:

Universe   = vanilla
Executable = simplefail.bat
Arguments  = 4 12
Log        = simple.log.txt
Output     = results.work2.out.txt
Error      = results.work2.err.txt
Queue

Submit the DAG again:

C:\condor-test> condor_submit_dag -force simple.dag

Checking all your submit files for log file names.
This might take a while... 
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor           : simple.dag.condor.sub
Log of DAGMan debugging messages                 : simple.dag.dagman.out
Log of Condor library debug messages             : simple.dag.lib.out
Log of the life of condor_dagman itself          : simple.dag.dagman.log

Condor Log file for all jobs of this DAG         : /home/users/roy/condor-test/simple.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 34.
-----------------------------------------------------------------------

Use watch_condor_q.bat to watch the jobs until they finish.

In a separate window, use more simple.dag.dagman.out to see what DAGMan does.

11/20 12:15:59 ******************************************************
11/20 12:15:59 ** condor_scheduniv_exec.31.0 (CONDOR_DAGMAN) STARTING UP
11/20 12:15:59 ** C:\condor\bin\condor_dagman.exe
11/20 12:15:59 ** $CondorVersion: 6.9.5 Nov 18 2007 $
11/20 12:15:59 ** $CondorPlatform: INTEL-WINNT50 $
11/20 12:15:59 ** PID = 1412
11/20 12:15:59 ** Log last touched time unavailable (No such file or directory)
11/20 12:15:59 ******************************************************
11/20 12:15:59 Using config source: C:\condor\condor_config
11/20 12:15:59 Using local config sources: 
11/20 12:15:59    C:\condor/condor_config.local
11/20 12:15:59 DaemonCore: Command Socket at <128.105.48.96:65328>
11/20 12:15:59 DAGMAN_SUBMIT_DELAY setting: 0
11/20 12:15:59 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
11/20 12:15:59 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
11/20 12:15:59 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
11/20 12:15:59 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
11/20 12:15:59 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
11/20 12:15:59 DAGMAN_RETRY_NODE_FIRST setting: 0
11/20 12:15:59 DAGMAN_MAX_JOBS_IDLE setting: 0
11/20 12:15:59 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
11/20 12:15:59 DAGMAN_MUNGE_NODE_NAMES setting: 1
11/20 12:15:59 DAGMAN_DELETE_OLD_LOGS setting: 1
11/20 12:15:59 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
11/20 12:15:59 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0
11/20 12:15:59 DAGMAN_ABORT_DUPLICATES setting: 1
11/20 12:15:59 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1
11/20 12:15:59 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
11/20 12:15:59 argv[0] == "condor_scheduniv_exec.31.0"
11/20 12:15:59 argv[1] == "-Debug"
11/20 12:15:59 argv[2] == "3"
11/20 12:15:59 argv[3] == "-Lockfile"
11/20 12:15:59 argv[4] == "simple.dag.lock"
11/20 12:15:59 argv[5] == "-Condorlog"
11/20 12:15:59 argv[6] == "C:\condor-test\simple.log.txt"
11/20 12:15:59 argv[7] == "-Dag"
11/20 12:15:59 argv[8] == "simple.dag"
11/20 12:15:59 argv[9] == "-Rescue"
11/20 12:15:59 argv[10] == "simple.dag.rescue"
11/20 12:15:59 DAG Lockfile will be written to simple.dag.lock
11/20 12:15:59 DAG Input file is simple.dag
11/20 12:15:59 Rescue DAG will be written to simple.dag.rescue
11/20 12:15:59 All DAG node user log files:
11/20 12:15:59   C:\condor-test\simple.log.txt (Condor)
11/20 12:15:59 Parsing simple.dag ...
11/20 12:15:59 Dag contains 4 total jobs
11/20 12:15:59 Truncating any older versions of log files...
11/20 12:15:59 MultiLogFiles: truncating older version of C:\condor-test\simple.log.txt
11/20 12:15:59 Sleeping for 12 seconds to ensure ProcessId uniqueness
11/20 12:16:11 WARNING: ProcessId not confirmed unique
11/20 12:16:11 Bootstrapping...
11/20 12:16:11 Number of pre-completed nodes: 0
11/20 12:16:11 Registering condor_event_timer...
11/20 12:16:12 Submitting Condor Node Setup job(s)...
11/20 12:16:12 submitting: condor_submit -a dag_node_name' '=' 'Setup -a +DAGManJobId' '=' \
               '31 -a DAGManJobId' '=' '31 -a submit_event_notes' '=' 'DAG' 'Node:' 'Setup \
               -a +DAGParentNodeNames' '=' '"" job.setup.sub
11/20 12:16:12 From submit: Submitting job(s).
11/20 12:16:12 From submit: Logging submit event(s).
11/20 12:16:12 From submit: 1 job(s) submitted to cluster 32.
11/20 12:16:12 	assigned Condor ID (32.0)
11/20 12:16:12 Just submitted 1 job this cycle...
11/20 12:16:12 Event: ULOG_SUBMIT for Condor Node Setup (32.0)
11/20 12:16:12 Number of idle job procs: 1
11/20 12:16:12 Of 4 nodes total:
11/20 12:16:12  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
11/20 12:16:12   ===     ===      ===     ===     ===        ===      ===
11/20 12:16:12     0       0        1       0       0          3        0
11/20 12:21:07 Event: ULOG_EXECUTE for Condor Node Setup (32.0)
11/20 12:21:07 Number of idle job procs: 0
11/20 12:21:07 Event: ULOG_JOB_TERMINATED for Condor Node Setup (32.0)
11/20 12:21:07 Node Setup job proc (32.0) completed successfully.
11/20 12:21:07 Node Setup job completed
11/20 12:21:07 Number of idle job procs: 0
11/20 12:21:07 Of 4 nodes total:
11/20 12:21:07  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
11/20 12:21:07   ===     ===      ===     ===     ===        ===      ===
11/20 12:21:07     1       0        0       0       2          1        0
11/20 12:21:12 Submitting Condor Node Work1 job(s)...
11/20 12:21:12 submitting: condor_submit -a dag_node_name' '=' 'Work1 -a +DAGManJobId' '=' \
               '31 -a DAGManJobId' '=' '31 -a submit_event_notes' '=' 'DAG' 'Node:' 'Work1 \
               -a +DAGParentNodeNames' '=' '"Setup" job.work1.sub
11/20 12:21:12 From submit: Submitting job(s).
11/20 12:21:12 From submit: Logging submit event(s).
11/20 12:21:12 From submit: 1 job(s) submitted to cluster 33.
11/20 12:21:12 	assigned Condor ID (33.0)
11/20 12:21:12 Submitting Condor Node Work2 job(s)...
11/20 12:21:12 submitting: condor_submit -a dag_node_name' '=' 'Work2 -a +DAGManJobId' '=' \
               '31 -a DAGManJobId' '=' '31 -a submit_event_notes' '=' 'DAG' 'Node:' 'Work2 -a \
               +DAGParentNodeNames' '=' '"Setup" job.work2.sub
11/20 12:21:13 From submit: Submitting job(s).
11/20 12:21:13 From submit: Logging submit event(s).
11/20 12:21:13 From submit: 1 job(s) submitted to cluster 34.
11/20 12:21:13 	assigned Condor ID (34.0)
11/20 12:21:13 Just submitted 2 jobs this cycle...
11/20 12:21:13 Event: ULOG_SUBMIT for Condor Node Work1 (33.0)
11/20 12:21:13 Number of idle job procs: 1
11/20 12:21:13 Event: ULOG_SUBMIT for Condor Node Work2 (34.0)
11/20 12:21:13 Number of idle job procs: 2
11/20 12:21:13 Of 4 nodes total:
11/20 12:21:13  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
11/20 12:21:13   ===     ===      ===     ===     ===        ===      ===
11/20 12:21:13     1       0        2       0       0          1        0
11/20 12:21:28 Event: ULOG_EXECUTE for Condor Node Work2 (34.0)
11/20 12:21:28 Number of idle job procs: 1
11/20 12:21:28 Event: ULOG_EXECUTE for Condor Node Work1 (33.0)
11/20 12:21:28 Number of idle job procs: 0
11/20 12:21:28 Event: ULOG_JOB_TERMINATED for Condor Node Work2 (34.0)
11/20 12:21:28 Node Work2 job proc (34.0) failed with status 1.
11/20 12:21:28 Number of idle job procs: 0
11/20 12:21:28 Event: ULOG_JOB_TERMINATED for Condor Node Work1 (33.0)
11/20 12:21:28 Node Work1 job proc (33.0) completed successfully.
11/20 12:21:28 Node Work1 job completed
11/20 12:21:28 Number of idle job procs: 0
11/20 12:21:28 Of 4 nodes total:
11/20 12:21:28  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
11/20 12:21:28   ===     ===      ===     ===     ===        ===      ===
11/20 12:21:28     2       0        0       0       0          1        1
11/20 12:21:28 ERROR: the following job(s) failed:
11/20 12:21:28 ---------------------- Job ----------------------
11/20 12:21:28       Node Name: Work2
11/20 12:21:28          NodeID: 2
11/20 12:21:28     Node Status: STATUS_ERROR    
11/20 12:21:28 Node return val: 1
11/20 12:21:28           Error: Job proc (34.0) failed with status 1
11/20 12:21:28 Job Submit File: job.work2.sub
11/20 12:21:28   Condor Job ID: (34)
11/20 12:21:28       Q_PARENTS: 0, 
11/20 12:21:28       Q_WAITING: 
11/20 12:21:28      Q_CHILDREN: 3, 
11/20 12:21:28 ---------------------------------------	
11/20 12:21:28 Aborting DAG...
11/20 12:21:28 Writing Rescue DAG to simple.dag.rescue...
11/20 12:21:28 Note: 0 total job deferrals because of -MaxJobs limit (0)
11/20 12:21:28 Note: 0 total job deferrals because of -MaxIdle limit (0)
11/20 12:21:28 Note: 0 total job deferrals because of node category throttles
11/20 12:21:28 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
11/20 12:21:28 Note: 0 total POST script deferrals because of -MaxPost limit (0)
11/20 12:21:28 **** condor_scheduniv_exec.31.0 (condor_DAGMAN) EXITING WITH STATUS 1

DAGMan notices that one of the jobs failed because it's exit code was non-zero. DAGMan ran as much of the DAG as possible and logged enough information to continue the run when the situation is resolved.

Look at the rescue DAG. It's the same structurally as your original DAG, but nodes that finished are marked DONE. (DAGMan also reorganized the file.) When you submit the rescue DAG, DONE nodes will be skipped.

C:\condor-test> more simple.dag.rescue
# Rescue DAG file, created after running
#   the simple.dag DAG file
#
# Total number of Nodes: 4
# Nodes premarked DONE: 2
# Nodes that failed: 1
#   Work2,

JOB Setup job.setup.sub DONE

JOB Work1 job.work1.sub DONE

JOB Work2 job.work2.sub

JOB Final job.finalize.sub


PARENT Setup CHILD Work1 Work2
PARENT Work1 CHILD Final
PARENT Work2 CHILD Final

From the comment near the top, we know that the Work2 node failed. Let's "fix" it.

C:\condor-test> move simplefail.bat simplefail.bat-
C:\condor-test> copy simple.bat simplefail.bat

Now we can submit our rescue DAG and DAGMan will pick up where it left off. (If you didn't fix the problem, DAGMan would have generated another rescue DAG, this time simple.dag.rescue.rescue.)

C:\condor-test> condor_submit_dag simple.dag.rescue

Checking all your submit files for log file names.
This might take a while...
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor           : simple.dag.rescue.condor.sub
Log of DAGMan debugging messages                 : simple.dag.rescue.dagman.out
Log of Condor library output                     : simple.dag.rescue.lib.out
Log of Condor library error messages             : simple.dag.rescue.lib.err
Log of the life of condor_dagman itself          : simple.dag.rescue.dagman.log

Condor Log file for all jobs of this DAG         : C:\condor-test\simple.log.txt

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 35.
-----------------------------------------------------------------------

Watch what DAGMan does:

C:\condor-test> more simple.dag.rescue.dagman.out
11/20 13:31:57 ******************************************************
11/20 13:31:57 ** condor_scheduniv_exec.37.0 (CONDOR_DAGMAN) STARTING UP
11/20 13:31:57 ** C:\condor\bin\condor_dagman.exe
11/20 13:31:57 ** $CondorVersion: 6.9.5 Nov 18 2007 $
11/20 13:31:57 ** $CondorPlatform: INTEL-WINNT50 $
11/20 13:31:57 ** PID = 3068
11/20 13:31:57 ** Log last touched 11/20 13:23:25
11/20 13:31:57 ******************************************************
11/20 13:31:57 Using config source: C:\condor\condor_config
11/20 13:31:57 Using local config sources:
11/20 13:31:57    C:\condor/condor_config.local
11/20 13:31:57 DaemonCore: Command Socket at <128.105.48.96:50084>
11/20 13:31:57 DAGMAN_SUBMIT_DELAY setting: 0
11/20 13:31:57 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
11/20 13:31:57 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
11/20 13:31:57 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
11/20 13:31:57 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
11/20 13:31:57 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
11/20 13:31:57 DAGMAN_RETRY_NODE_FIRST setting: 0
11/20 13:31:57 DAGMAN_MAX_JOBS_IDLE setting: 0
11/20 13:31:57 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
11/20 13:31:57 DAGMAN_MUNGE_NODE_NAMES setting: 1
11/20 13:31:57 DAGMAN_DELETE_OLD_LOGS setting: 1
11/20 13:31:57 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
11/20 13:31:57 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0
11/20 13:31:57 DAGMAN_ABORT_DUPLICATES setting: 1
11/20 13:31:57 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1
11/20 13:31:57 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
11/20 13:31:57 argv[0] == "condor_scheduniv_exec.37.0"
11/20 13:31:57 argv[1] == "-Debug"
11/20 13:31:57 argv[2] == "3"
11/20 13:31:57 argv[3] == "-Lockfile"
11/20 13:31:57 argv[4] == "simple.dag.rescue.lock"
11/20 13:31:57 argv[5] == "-Condorlog"
11/20 13:31:57 argv[6] == "C:\condor-test\simple.log.txt"
11/20 13:31:57 argv[7] == "-Dag"
11/20 13:31:57 argv[8] == "simple.dag.rescue"
11/20 13:31:57 argv[9] == "-Rescue"
11/20 13:31:57 argv[10] == "simple.dag.rescue.rescue"
11/20 13:31:57 DAG Lockfile will be written to simple.dag.rescue.lock
11/20 13:31:57 DAG Input file is simple.dag.rescue
11/20 13:31:57 Rescue DAG will be written to simple.dag.rescue.rescue
11/20 13:31:57 All DAG node user log files:
11/20 13:31:57   C:\condor-test\simple.log.txt (Condor)
11/20 13:31:57 Parsing simple.dag.rescue ...
11/20 13:31:57 Dag contains 4 total jobs
11/20 13:31:57 Truncating any older versions of log files...
11/20 13:31:57 MultiLogFiles: truncating older version of C:\condor-test\simple.log.txt
11/20 13:31:57 Sleeping for 12 seconds to ensure ProcessId uniqueness
11/20 13:32:09 WARNING: ProcessId not confirmed unique
11/20 13:32:09 Bootstrapping...
11/20 13:32:09 Number of pre-completed nodes: 2
11/20 13:32:09 Registering condor_event_timer...
11/20 13:32:10 Submitting Condor Node Work2 job(s)...
11/20 13:32:10 submitting: condor_submit -a dag_node_name' '=' 'Work2 -a +DAGMan
               JobId' '=' '37 -a DAGManJobId' '=' '37 -a submit_event_notes' '=' 'DAG' 'Node:'
               'Work2 -a +DAGParentNodeNames' '=' '"Setup" job.work2.sub
11/20 13:32:11 From submit: Submitting job(s).
11/20 13:32:11 From submit: Logging submit event(s).
11/20 13:32:11 From submit: 1 job(s) submitted to cluster 38.
11/20 13:32:11  assigned Condor ID (38.0)
11/20 13:32:11 Just submitted 1 job this cycle...
11/20 13:32:11 Event: ULOG_SUBMIT for Condor Node Work2 (38.0)
11/20 13:32:11 Number of idle job procs: 1
11/20 13:32:11 Of 4 nodes total:
11/20 13:32:11  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
11/20 13:32:11   ===     ===      ===     ===     ===        ===      ===
11/20 13:32:11     2       0        1       0       0          1        0
11/20 13:32:36 Event: ULOG_EXECUTE for Condor Node Work2 (38.0)
11/20 13:32:36 Number of idle job procs: 0
11/20 13:32:36 Event: ULOG_JOB_TERMINATED for Condor Node Work2 (38.0)
11/20 13:32:36 Node Work2 job proc (38.0) completed successfully.
11/20 13:32:36 Node Work2 job completed
11/20 13:32:36 Number of idle job procs: 0
11/20 13:32:36 Of 4 nodes total:
11/20 13:32:36  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
11/20 13:32:36   ===     ===      ===     ===     ===        ===      ===
11/20 13:32:36     3       0        0       0       1          0        0
11/20 13:32:41 Submitting Condor Node Final job(s)...
11/20 13:32:41 submitting: condor_submit -a dag_node_name' '=' 'Final -a +DAGMan
               JobId' '=' '37 -a DAGManJobId' '=' '37 -a submit_event_notes' '=' 'DAG' 'Node:'
               'Final -a +DAGParentNodeNames' '=' '"Work1,Work2" job.finalize.sub
11/20 13:32:41 From submit: Submitting job(s).
11/20 13:32:41 From submit: Logging submit event(s).
11/20 13:32:41 From submit: 1 job(s) submitted to cluster 39.
11/20 13:32:41  assigned Condor ID (39.0)
11/20 13:32:41 Just submitted 1 job this cycle...
11/20 13:32:41 Event: ULOG_SUBMIT for Condor Node Final (39.0)
11/20 13:32:41 Number of idle job procs: 1
11/20 13:32:41 Of 4 nodes total:
11/20 13:32:41  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
11/20 13:32:41   ===     ===      ===     ===     ===        ===      ===
11/20 13:32:41     3       0        1       0       0          0        0
11/20 13:32:56 Event: ULOG_EXECUTE for Condor Node Final (39.0)
11/20 13:32:56 Number of idle job procs: 0
11/20 13:32:56 Event: ULOG_JOB_TERMINATED for Condor Node Final (39.0)
11/20 13:32:56 Node Final job proc (39.0) completed successfully.
11/20 13:32:56 Node Final job completed
11/20 13:32:56 Number of idle job procs: 0
11/20 13:32:56 Of 4 nodes total:
11/20 13:32:56  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
11/20 13:32:56   ===     ===      ===     ===     ===        ===      ===
11/20 13:32:56     4       0        0       0       0          0        0
11/20 13:32:56 All jobs Completed!
11/20 13:32:56 Note: 0 total job deferrals because of -MaxJobs limit (0)
11/20 13:32:56 Note: 0 total job deferrals because of -MaxIdle limit (0)
11/20 13:32:56 Note: 0 total job deferrals because of node category throttles
11/20 13:32:56 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
11/20 13:32:56 Note: 0 total POST script deferrals because of -MaxPost limit (0)
11/20 13:32:56 **** condor_scheduniv_exec.37.0 (condor_DAGMAN) EXITING WITH STATUS 0

Success! Now go ahead and clean up.

C:\condor-test> del *.txt

Extra Credit Go back to the original simplefail.bat program, the one that fails. Let's pretend that if the program exits with 0 or 1, it's considered correct, and only if it fails with another value does it really fail. Write a POST script that checks the return value. Check the Condor manual to see how to describe your post script. Make sure your post script works by having simplefail return 0, 1, or 2.

Top

Next: Submitting a VM job

Top