DAGMan can handle a situation where some of the nodes in a DAG fails. DAGMan will run as many nodes as possible, then create a rescue DAG making it easy to continue when the problem is fixed.
Let's modify B.sh so that it fails. We will accomplish it by writing into /dev/stderr. Note that simply returning a non-zero exit code is not enough for Condor-G to notice an error. This is because Globus does not retain return values of programs, thus Condor-G will not notice the failure. So we need to do a little more work here. We need to write into /dev/sterr and have POST script check the contents of it and report if there's an error by returning a non-zero exit code (POST scripts do not run remotely, so in this case the exit code will be noticed and interpreted by DAGMan correctly).
First let's create a new executable for node B:
$ cat > B-fail.sh
#!/bin/sh
A=`cat A.output`
echo $((A+10)) > B.output
echo "ERROR: I'm going to fail on purpose!" >> /dev/stderr
exit 1
<Ctrl-D>
$ cat B-fail.sh
#! /bin/sh
A=`cat A.output`
echo $((A+10)) > B.output
echo "I'm going to fail on purpose!" >> /dev/stderr
exit 1
$ chmod +x B-fail.sh
Modify B.submit to do two things differently:
$ cat > B.submit
executable=B-fail.sh
universe=globus
globusscheduler=my-gatekeeper.cs.wisc.edu:/jobmanager-fork
log=results.log
error=B.error
notification=never
transfer_input_files=A.output
transfer_output_files=B.output
WhenToTransferOutput=ALWAYS
queue
<Ctrl-D>
$ cat B.submit
executable=B-fail.sh
universe=globus
globusscheduler=my-gatekeeper.cs.wisc.edu:/jobmanager-fork
log=results.log
error=results.error
notification=never
transfer_input_files=A.output
transfer_output_files=B.output
WhenToTransferOutput=ALWAYS
queue
Now create a script to check an error file. It will return a non-zero (error) code, if the specified error file exists and is non-empty.
$ cat > postscript_checker
#! /bin/sh
test ! -e "$1" || test -z "`cat $1`"
Ctrl-D
$ cat postscript_checker
#!/bin/sh
test ! -e "$1" || test -z "`cat $1`"
$ chmod a+x postscript_checker
Modify your ABCD.dag to add a line that specifies the new script as the POST script for node B: This script will check the error file and return an error code if the error file is non-empty (i.e. if node B wrote out error messages).
$ cat >> ABCD.dag
Script POST B postscript_checker B.error
<Ctrl-D>
$ cat ABCD.dag
Job A A.submit
Job B B.submit
Job C C.submit
Job D D.submit
Parent A Child B C
Parent B C Child D
Script POST B postscript_checker B.error
$
Submit the DAG again.
$ condor_submit_dag ABCD.dag
Checking your DAG input file and all submit files it references.
This might take a while...
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor : ABCD.dag.condor.sub
Log of DAGMan debugging messages : ABCD.dag.dagman.out
Log of Condor library debug messages : ABCD.dag.lib.out
Log of the life of condor_dagman itself : ABCD.dag.dagman.log
Condor Log file for all Condor jobs of this DAG: results.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 24.
-----------------------------------------------------------------------
Use watch_condor_q to watch the jobs until they finish.
In separate windows run "tail -f --lines=500 results.log" and "tail -f --lines=500 ABCD.dag.dagman.out" to monitor the job's progress.
$ tail -f ABCD.dag.dagman.out
Check your results:
$ cat ABCD.dag.dagman.outA little later...
4/12 22:34:51 Registering condor_event_timer...
4/12 22:34:52 Submitting Condor Job A ...
4/12 22:34:52 submitting: condor_submit -a 'dag_node_name = A' -a '+DAGManJobID = 36.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' A.submit 2>&1
4/12 22:34:52 assigned Condor ID (37.0.0)
4/12 22:34:52 Event: ULOG_SUBMIT for Condor Job A (37.0.0)
4/12 22:37:35 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job B (38.0.0)
4/12 22:37:35 POST Script of Job B failed with status 1
4/12 22:37:35 Of 4 nodes total:
4/12 22:37:35 Done Pre Queued Post Ready Un-Ready Failed
4/12 22:37:35 === === === === === === ===
4/12 22:37:35 2 0 0 0 0 1 1
4/12 22:37:35 ERROR: the following job(s) failed:
4/12 22:37:35 ---------------------- Job ----------------------
4/12 22:37:35 Node Name: B
4/12 22:37:35 NodeID: 1
4/12 22:37:35 Node Status: STATUS_ERROR
4/12 22:37:35 Error: POST Script failed with status 1
4/12 22:37:35 Job Submit File: B.submit
4/12 22:37:35 POST Script: postscript_checker B.error
4/12 22:37:35 Condor Job ID: (38.0.0)
4/12 22:37:35 Q_PARENTS: 0, <END>
4/12 22:37:35 Q_WAITING: <END>
4/12 22:37:35 Q_CHILDREN: 3, <END>
4/12 22:37:35 --------------------------------------- <END>
4/12 22:37:35 Aborting DAG...
4/12 22:37:35 Writing Rescue DAG to ABCD.dag.rescue...
4/12 22:37:35 **** condor_scheduniv_exec.36.0 (condor_DAGMAN) EXITING WITH STATUS 1
GMAN) EXITING WITH STATUS 0
DAGMan notices that one of the jobs failed. DAGMan ran as much of the DAG as possible and logged enough information to continue the run when the situation is resolved.
Look at the rescue DAG. It's the same structurally as your original DAG, but notes that finished are marked DONE. (DAGMan also reorganized the file.) When you submit the rescue DAG, DONE nodes will be skipped.
$ cat ABCD.dag.rescue
# Rescue DAG file, created after running
# the mydag.dag DAG file
#
# Total number of Nodes: 4
# Nodes premarked DONE: 3
# Nodes that failed: 1
# B,<ENDLIST>
JOB A A.submit DONE
JOB B B.submit
SCRIPT POST B postscript_checker B.error
JOB C C.submit DONE
JOB D D.submit
PARENT A CHILD B C
PARENT B CHILD D
PARENT C CHILD D
So we know there is a problem with the B step. Let's "fix" it (by
coping the non-failing script onto B-fail.sh).
$ rm B-fail.sh
$ cp B.sh B-fail.sh
Also delete B.error so that the POST script doesn't trigger on the message from last run.
$ rm B.error
Submitting the Rescue DAG
Now we can submit our rescue DAG. (If you didn't fix the problem, DAGMan would have generated another rescue DAG, this time "mydag.dag.rescue.rescue".) In separate windows run "tail -f --lines=500 results.log" and "tail -f --lines=500 ABCD.dag.dagman.out" to monitor the job's progress.
$ condor_submit_dag ABCD.dag.rescue
Checking your DAG input file and all submit files it references.
This might take a while...
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor : ABCD.dag.rescue.condor.sub
Log of DAGMan debugging messages : ABCD.dag.rescue.dagman.out
Log of Condor library debug messages : ABCD.dag.rescue.lib.out
Log of the life of condor_dagman itself : ABCD.dag.rescue.dagman.log
Condor Log file for all jobs of this DAG : results.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 27.
-----------------------------------------------------------------------
Continue to watch the DAGMan output ABCD.dag.rescue.dagman.out.
Eventually it should say:
4/12 23:06:10 Of 4 nodes total:
4/12 23:06:10 Done Pre Queued Post Ready Un-Ready Failed
4/12 23:06:10 === === === === === === ===
4/12 23:06:10 4 0 0 0 0 0 0
4/12 23:06:10 All jobs Completed!
4/12 23:06:10 **** condor_scheduniv_exec.40.0 (condor_DAGMAN) EXITING WITH STATUS 0
Note there is now D.output.
$ cat D.outputCongratulations! You're now ready for the Grid.
180