Condor-G and DAGMan Hands-On Lab

Part VI: Handling Jobs That Fail with DAGMan

DAGMan can handle a situation where some of the nodes in a DAG fails. DAGMan will run as many nodes as possible, then create a rescue DAG making it easy to continue when the problem is fixed.

Let's modify B.sh so that it fails. We will accomplish it by writing into /dev/stderr. Note that simply returning a non-zero exit code is not enough for Condor-G to notice an error. This is because Globus does not retain return values of programs, thus Condor-G will not notice the failure. So we need to do a little more work here. We need to write into /dev/sterr and have POST script check the contents of it and report if there's an error by returning a non-zero exit code (POST scripts do not run remotely, so in this case the exit code will be noticed and interpreted by DAGMan correctly).

First let's create a new executable for node B:

$ cat > B-fail.sh
#!/bin/sh
A=`cat A.output`
echo $((A+10)) > B.output
echo "ERROR: I'm going to fail on purpose!" >> /dev/stderr
exit 1
<Ctrl-D>
$ cat B-fail.sh
#! /bin/sh
A=`cat A.output`
echo $((A+10)) > B.output
echo "I'm going to fail on purpose!" >> /dev/stderr
exit 1

$ chmod +x B-fail.sh

Modify B.submit to do two things differently:

run B-fail.sh instead of B.sh:
output error stream into B.error

$ cat > B.submit
executable=B-fail.sh
universe=globus
globusscheduler=my-gatekeeper.cs.wisc.edu:/jobmanager-fork
log=results.log
error=B.error
notification=never
transfer_input_files=A.output
transfer_output_files=B.output
WhenToTransferOutput=ALWAYS
queue
<Ctrl-D>
$ cat B.submit
executable=B-fail.sh
universe=globus
globusscheduler=my-gatekeeper.cs.wisc.edu:/jobmanager-fork
log=results.log
error=results.error
notification=never
transfer_input_files=A.output
transfer_output_files=B.output
WhenToTransferOutput=ALWAYS
queue

Now create a script to check an error file. It will return a non-zero (error) code, if the specified error file exists and is non-empty.

$ cat > postscript_checker
#! /bin/sh
test ! -e "$1" || test -z "`cat $1`"
Ctrl-D

$ cat postscript_checker
#!/bin/sh
test ! -e "$1" || test -z "`cat $1`"

$ chmod a+x postscript_checker

Modify your ABCD.dag to add a line that specifies the new script as the POST script for node B: This script will check the error file and return an error code if the error file is non-empty (i.e. if node B wrote out error messages).

$ cat >> ABCD.dag 
Script POST B postscript_checker B.error
<Ctrl-D>
$ cat ABCD.dag 
Job A A.submit
Job B B.submit
Job C C.submit
Job D D.submit
Parent A Child B C
Parent B C Child D
Script POST B postscript_checker B.error
$

Submit the DAG again.

$ condor_submit_dag ABCD.dag

Checking your DAG input file and all submit files it references.
This might take a while... 
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor : ABCD.dag.condor.sub
Log of DAGMan debugging messages : ABCD.dag.dagman.out
Log of Condor library debug messages : ABCD.dag.lib.out
Log of the life of condor_dagman itself : ABCD.dag.dagman.log

Condor Log file for all Condor jobs of this DAG: results.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 24.
-----------------------------------------------------------------------

Use watch_condor_q to watch the jobs until they finish.

In separate windows run "tail -f --lines=500 results.log" and "tail -f --lines=500 ABCD.dag.dagman.out" to monitor the job's progress.

$ tail -f ABCD.dag.dagman.out

Check your results:

$ cat ABCD.dag.dagman.out
4/12 22:34:51 Registering condor_event_timer...
4/12 22:34:52 Submitting Condor Job A ...
4/12 22:34:52 submitting: condor_submit  -a 'dag_node_name = A' -a '+DAGManJobID = 36.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' A.submit 2>&1
4/12 22:34:52   assigned Condor ID (37.0.0)
4/12 22:34:52 Event: ULOG_SUBMIT for Condor Job A (37.0.0)

A little later...


4/12 22:37:35 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job B (38.0.0)
4/12 22:37:35 POST Script of Job B failed with status 1
4/12 22:37:35 Of 4 nodes total:
4/12 22:37:35  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
4/12 22:37:35   ===     ===      ===     ===     ===        ===      ===
4/12 22:37:35     2       0        0       0       0          1        1
4/12 22:37:35 ERROR: the following job(s) failed:
4/12 22:37:35 ---------------------- Job ----------------------
4/12 22:37:35       Node Name: B
4/12 22:37:35          NodeID: 1
4/12 22:37:35     Node Status: STATUS_ERROR    
4/12 22:37:35           Error: POST Script failed with status 1
4/12 22:37:35 Job Submit File: B.submit
4/12 22:37:35     POST Script: postscript_checker B.error
4/12 22:37:35   Condor Job ID: (38.0.0)
4/12 22:37:35       Q_PARENTS: 0, <END>
4/12 22:37:35       Q_WAITING: <END>
4/12 22:37:35      Q_CHILDREN: 3, <END>
4/12 22:37:35 ---------------------------------------   <END>
4/12 22:37:35 Aborting DAG...
4/12 22:37:35 Writing Rescue DAG to ABCD.dag.rescue...
4/12 22:37:35 **** condor_scheduniv_exec.36.0 (condor_DAGMAN) EXITING WITH STATUS 1
GMAN) EXITING WITH STATUS 0

DAGMan notices that one of the jobs failed. DAGMan ran as much of the DAG as possible and logged enough information to continue the run when the situation is resolved.

Look at the rescue DAG. It's the same structurally as your original DAG, but notes that finished are marked DONE. (DAGMan also reorganized the file.) When you submit the rescue DAG, DONE nodes will be skipped.

$ cat ABCD.dag.rescue 
# Rescue DAG file, created after running
#   the mydag.dag DAG file
#
# Total number of Nodes: 4 
# Nodes premarked DONE: 3
# Nodes that failed: 1
#   B,<ENDLIST>
JOB A A.submit DONE

JOB B B.submit 
SCRIPT POST B postscript_checker B.error

JOB C C.submit DONE

JOB D D.submit 


PARENT A CHILD B C
PARENT B CHILD D
PARENT C CHILD D

So we know there is a problem with the B step. Let's "fix" it (by coping the non-failing script onto B-fail.sh).

$ rm B-fail.sh
$ cp B.sh B-fail.sh

Also delete B.error so that the POST script doesn't trigger on the message from last run.

$ rm B.error

Submitting the Rescue DAG

Now we can submit our rescue DAG. (If you didn't fix the problem, DAGMan would have generated another rescue DAG, this time "mydag.dag.rescue.rescue".) In separate windows run "tail -f --lines=500 results.log" and "tail -f --lines=500 ABCD.dag.dagman.out" to monitor the job's progress.

$ condor_submit_dag ABCD.dag.rescue 

Checking your DAG input file and all submit files it references.
This might take a while... 
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor           : ABCD.dag.rescue.condor.sub
Log of DAGMan debugging messages                 : ABCD.dag.rescue.dagman.out
Log of Condor library debug messages             : ABCD.dag.rescue.lib.out
Log of the life of condor_dagman itself          : ABCD.dag.rescue.dagman.log

Condor Log file for all jobs of this DAG : results.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 27.
-----------------------------------------------------------------------

Continue to watch the DAGMan output ABCD.dag.rescue.dagman.out. Eventually it should say:

4/12 23:06:10 Of 4 nodes total:
4/12 23:06:10  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
4/12 23:06:10   ===     ===      ===     ===     ===        ===      ===
4/12 23:06:10     4       0        0       0       0          0        0
4/12 23:06:10 All jobs Completed!
4/12 23:06:10 **** condor_scheduniv_exec.40.0 (condor_DAGMAN) EXITING WITH STATUS 0

Note there is now D.output.

$ cat D.output
180

Congratulations! You're now ready for the Grid.

<-- Previous