Condor-G and DAGMan Hands-On Lab

In this section we will learn to use DAGMan. DAGMan is a tool in Condor which allows you to submit sets of jobs with inter-dependecies specified between them (e.g. certain jobs may depend on the output of other jobs). DAGMan then submits the jobs in an order which is appropriate to satisfy their interdependencies (i.e. it only submits a job when all the job's "parents" (jobs it depends on) complete). DAGMan has many other powerful features such as PRE- and POST- scripts, rescue DAGs, and more.

The conceptual graph of dependencies between jobs is called a DAG (Directed Acyclic Graph), hence the for name DAGMan - "DAG Mananger". DAGs can be a simple as one node (no depenencies), linear (job "A" depends on outcome of "B", job "B" depends on "C", etc), or really complex dependency trees where some jobs depend on outcome several other jobs, etc.

Next we will create a file that describes our DAG. Create a minimal DAG for DAGMan. This DAG will have a single node. Note that the DAG description file associates symbolic names of jobs (e.g. HelloWorld) with their correspoding submit files (myjob.submit).

Because our DAG only has one node, we could've just as easily submitted myjob.submit via condor_submit directly, but we'll do it as a DAG for practice.

Submit the DAG description file to Condor with condor_submit_dag, then watch the run.

$ condor_submit_dag mydag.dag

Checking your DAG input file and all submit files it references.
This might take a while... 
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor : mydag.dag.condor.sub
Log of DAGMan debugging messages : mydag.dag.dagman.out
Log of Condor library debug messages : mydag.dag.lib.out
Log of the life of condor_dagman itself : mydag.dag.dagman.log

Condor Log file for all Condor jobs of this DAG: results.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 3.
-----------------------------------------------------------------------

$ condor_q 

-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   user??         7/10 17:33   0+00:00:03 R  0   2.6  condor_dagman -f -
   3.0   user??         7/10 17:33   0+00:00:00 I  0   0.0  myscript.sh TestJo

2 jobs; 1 idle, 1 running, 0 held

$ condor_q -globus
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   3.0   user??       UNSUBMITTED fork     my-gatekeeper.cs.wisc.edu   /tmp/username-cond

$ condor_q
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   user??         7/10 17:33   0+00:00:33 R  0   2.6  condor_dagman -f -
   3.0   user??         7/10 17:33   0+00:00:15 R  0   0.0  myscript.sh TestJo

2 jobs; 0 idle, 2 running, 0 held

$ condor_q -globus
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   3.0   user??       ACTIVE fork     my-gatekeeper.cs.wisc.edu   /tmp/username-cond

$ condor_q
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   user??         7/10 17:33   0+00:01:03 R  0   2.6  condor_dagman -f -
   3.0   user??         7/10 17:33   0+00:00:45 R  0   0.0  myscript.sh TestJo

2 jobs; 0 idle, 2 running, 0 held

$ condor_q -globus
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   3.0   user??       ACTIVE fork     my-gatekeeper.cs.wisc.edu   /tmp/username-cond

$ condor_q
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

$ condor_q -globus
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        

Ctrl-C

Notice that condor_dagman is running as a job itself, and that condor_dagman submits your real job without your direct intervention. You might happen to catch the "C" (completed) state as your job finishes, but that often goes by too quickly to notice.

Again, in another window you may want to run "tail -f --lines=500 results.log" in a second window to watch the job log file as your job runs. You might also want to watch DAGMan's log file with "tail -f --lines=500 mydag.dag.dagman.out" in a third window. (mydag.dag.dagman.out) in the same way in a third window. For the remainder of this tutorial, we suggest you re-run this command when you submit a DAG. This will allow you to see how typical DAGs progress. Use "Ctrl-C" to stop watching the file.

Looking at DAGMan's various files, we see that DAGMan itself ran as a Condor job (specifically, a Scheduler universe job). You can even see its submit and log files:

If you weren't watching the DAGMan output file with tail -f, you can examine the file with the following command:

Clean up your results. Be careful when deleting the mydag.dag.* files, you do not want to delete the mydag.dag file, just mydag.dag.* .

Condor-G and DAGMan Hands-On Lab

Part IV: A Simple DAG