Condor-G and DAGMan Hands-On Lab

 

Part IV: A Simple DAG

In this section we will learn to use DAGMan. DAGMan is a tool in Condor which allows you to submit sets of jobs with inter-dependecies specified between them (e.g. certain jobs may depend on the output of other jobs). DAGMan then submits the jobs in an order which is appropriate to satisfy their interdependencies (i.e. it only submits a job when all the job's "parents" (jobs it depends on) complete). DAGMan has many other powerful features such as PRE- and POST- scripts, rescue DAGs, and more.

The conceptual graph of dependencies between jobs is called a DAG (Directed Acyclic Graph), hence the for name DAGMan - "DAG Mananger". DAGs can be a simple as one node (no depenencies), linear (job "A" depends on outcome of "B", job "B" depends on "C", etc), or really complex dependency trees where some jobs depend on outcome several other jobs, etc.

Creating a DAG Description File

Next we will create a file that describes our DAG. Create a minimal DAG for DAGMan. This DAG will have a single node. Note that the DAG description file associates symbolic names of jobs (e.g. HelloWorld) with their correspoding submit files (myjob.submit).

$ cat > mydag.dag
Job HelloWorld myjob.submit
Ctrl-D
$ cat mydag.dag Job HelloWorld myjob.submit

Because our DAG only has one node, we could've just as easily submitted myjob.submit via condor_submit directly, but we'll do it as a DAG for practice.

Submitting a DAG

Submit the DAG description file to Condor with condor_submit_dag, then watch the run.

$ condor_submit_dag mydag.dag

Checking your DAG input file and all submit files it references.
This might take a while...
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor : mydag.dag.condor.sub
Log of DAGMan debugging messages : mydag.dag.dagman.out
Log of Condor library debug messages : mydag.dag.lib.out
Log of the life of condor_dagman itself : mydag.dag.dagman.log

Condor Log file for all Condor jobs of this DAG: results.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 3.
-----------------------------------------------------------------------
$ condor_q 

-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   user??         7/10 17:33   0+00:00:03 R  0   2.6  condor_dagman -f -
   3.0   user??         7/10 17:33   0+00:00:00 I  0   0.0  myscript.sh TestJo

2 jobs; 1 idle, 1 running, 0 held

$ condor_q -globus
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   3.0   user??       UNSUBMITTED fork     my-gatekeeper.cs.wisc.edu   /tmp/username-cond

$ condor_q
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
2.0 user?? 7/10 17:33 0+00:00:33 R 0 2.6 condor_dagman -f -
3.0 user?? 7/10 17:33 0+00:00:15 R 0 0.0 myscript.sh TestJo

2 jobs; 0 idle, 2 running, 0 held

$ condor_q -globus
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
ID OWNER STATUS MANAGER HOST EXECUTABLE
3.0 user?? ACTIVE fork my-gatekeeper.cs.wisc.edu /tmp/username-cond

$ condor_q -- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
2.0 user?? 7/10 17:33 0+00:01:03 R 0 2.6 condor_dagman -f -
3.0 user?? 7/10 17:33 0+00:00:45 R 0 0.0 myscript.sh TestJo

2 jobs; 0 idle, 2 running, 0 held

$ condor_q -globus
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
ID OWNER STATUS MANAGER HOST EXECUTABLE
3.0 user?? ACTIVE fork my-gatekeeper.cs.wisc.edu /tmp/username-cond

$ condor_q
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held

$ condor_q -globus
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
ID OWNER STATUS MANAGER HOST EXECUTABLE

Ctrl-C

Notice that condor_dagman is running as a job itself, and that condor_dagman submits your real job without your direct intervention. You might happen to catch the "C" (completed) state as your job finishes, but that often goes by too quickly to notice.

Again, in another window you may want to run "tail -f --lines=500 results.log" in a second window to watch the job log file as your job runs. You might also want to watch DAGMan's log file with "tail -f --lines=500 mydag.dag.dagman.out" in a third window. (mydag.dag.dagman.out) in the same way in a third window. For the remainder of this tutorial, we suggest you re-run this command when you submit a DAG. This will allow you to see how typical DAGs progress. Use "Ctrl-C" to stop watching the file.

Third window:

$ tail -f --lines=500 mydag.dag.dagman.out
7/10 10:36:43 ******************************************************
7/10 10:36:43 ** condor_scheduniv_exec.3.0 (CONDOR_DAGMAN) STARTING UP
7/10 10:36:43 ** $CondorVersion: 6.6.1 Feb 5 2004 $
7/10 10:36:43 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $
7/10 10:36:43 ** PID = 26844
7/10 10:36:43 ******************************************************
7/10 10:36:44 DaemonCore: Command Socket at <128.105.185.14:34571>
7/10 10:36:44 argv[0] == "condor_scheduniv_exec.3.0"
7/10 10:36:44 argv[1] == "-Debug"
7/10 10:36:44 argv[2] == "3"
7/10 10:36:44 argv[3] == "-Lockfile"
7/10 10:36:44 argv[4] == "mydag.dag.lock"
7/10 10:36:44 argv[5] == "-Condorlog"
7/10 10:36:44 argv[6] == "results.log"
7/10 10:36:44 argv[7] == "-Dag"
7/10 10:36:44 argv[8] == "mydag.dag"
7/10 10:36:44 argv[9] == "-Rescue"
7/10 10:36:44 argv[10] == "mydag.dag.rescue"
7/10 10:36:44 Condor log will be written to results.log
7/10 10:36:44 DAG Lockfile will be written to mydag.dag.lock
7/10 10:36:44 DAG Input file is mydag.dag
7/10 10:36:44 Rescue DAG will be written to mydag.dag.rescue
7/10 10:36:44 Parsing mydag.dag ...
7/10 10:36:44 Dag contains 1 total jobs
7/10 10:36:44 Bootstrapping...
7/10 10:36:44 Number of pre-completed jobs: 0
7/10 10:36:44 Submitting Job HelloWorld ...
7/10 10:36:44 	assigned Condor ID (7.0.0)
7/10 10:36:45 Event: ULOG_SUBMIT for Job HelloWorld (7.0.0)
7/10 10:36:45 0/1 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 10:37:05 Event: ULOG_GLOBUS_SUBMIT for Job HelloWorld (7.0.0)
7/10 10:37:05 Event: ULOG_EXECUTE for Job HelloWorld (7.0.0)
7/10 10:38:10 Event: ULOG_JOB_TERMINATED for Job HelloWorld (7.0.0)
7/10 10:38:10 Job HelloWorld completed successfully.
7/10 10:38:10 1/1 done, 0 failed, 0 submitted, 0 ready, 0 pre, 0 post
7/10 10:38:10 All jobs Completed!
7/10 10:38:10 **** condor_scheduniv_exec.6.0 (condor_DAGMAN) EXITING WITH STATUS 0

Verify your results:

$ ls -l
total 12
-rw-r--r--    1 user??  user??        28 Jul 10 10:35 mydag.dag
-rw-r--r--    1 user??  user??       523 Jul 10 10:36 mydag.dag.condor.sub
-rw-r--r--    1 user??  user??       608 Jul 10 10:38 mydag.dag.dagman.log
-rw-r--r--    1 user??  user??      1860 Jul 10 10:38 mydag.dag.dagman.out
-rw-r--r--    1 user??  user??        29 Jul 10 10:38 mydag.dag.lib.out
-rw-------    1 user??  user??         0 Jul 10 10:36 mydag.dag.lock
-rw-r--r--    1 user??  user??       175 Jul  9 18:13 myjob.submit
-rwxr-xr-x    1 user??  user??       194 Jul 10 10:36 myscript.sh
-rw-r--r--    1 user??  user??        31 Jul 10 10:37 results.error
-rw-------    1 user??  user??       833 Jul 10 10:38 results.log
-rw-r--r--    1 user??  user??       261 Jul 10 10:37 results.output
-rwxr-xr-x    1 user??  user??        81 Jul 10 10:35 watch_condor_q
$ cat results.error 
This is sent to standard error
$ cat results.output 
I'm process id 29149 on pc-26
This is sent to standard error
Thu Jul 10 10:38:44 CDT 2003
Running as binary /home/user??/.globus/.gass_cache/local/md5/aa/ceb9e04077256aaa2acf4dff670897/md5/27/2f50da149fc049d07b1c27f30b67df/data TEST 1
My name (argument 1) is TEST
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished.  Exiting
RESULT: 0 SUCCESS

Looking at DAGMan's various files, we see that DAGMan itself ran as a Condor job (specifically, a Scheduler universe job). You can even see its submit and log files:

$ ls
mydag.dag	      mydag.dag.dagman.log  mydag.dag.lib.out  myjob.submit  results.error  results.output
mydag.dag.condor.sub  mydag.dag.dagman.out  mydag.dag.lock     myscript.sh   results.log    watch_condor_q
$ cat mydag.dag.condor.sub
# Filename: mydag.dag.condor.sub
# Generated by condor_submit_dag mydag.dag
universe	= scheduler
executable	= /opt/vdt1.1.13/condor/bin/condor_dagman
getenv		= True
output		= mydag.dag.lib.out
error		= mydag.dag.lib.out
log		= mydag.dag.dagman.log
remove_kill_sig	= SIGUSR1
arguments	= -f -l . -Debug 3 -Lockfile mydag.dag.lock -Condorlog results.log -Dag mydag.dag -Rescue mydag.dag.rescue
environment	= _CONDOR_DAGMAN_LOG=mydag.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
queue
$ cat mydag.dag.dagman.log
000 (006.000.000) 07/10 10:36:43 Job submitted from host: <128.105.185.14:33785>
...
001 (006.000.000) 07/10 10:36:44 Job executing on host: <128.105.185.14:33785>
...
005 (006.000.000) 07/10 10:38:10 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
...

If you weren't watching the DAGMan output file with tail -f, you can examine the file with the following command:

$ cat mydag.dag.dagman.out

Clean up your results. Be careful when deleting the mydag.dag.* files, you do not want to delete the mydag.dag file, just mydag.dag.* .

$ rm mydag.dag.* results.*
<--- Previous    Next-->