In this section we will learn to use DAGMan. DAGMan is a tool in Condor which allows you to submit sets of jobs with inter-dependecies specified between them (e.g. certain jobs may depend on the output of other jobs). DAGMan then submits the jobs in an order which is appropriate to satisfy their interdependencies (i.e. it only submits a job when all the job's "parents" (jobs it depends on) complete). DAGMan has many other powerful features such as PRE- and POST- scripts, rescue DAGs, and more.
The conceptual graph of dependencies between jobs is called a DAG (Directed Acyclic Graph), hence the for name DAGMan - "DAG Mananger". DAGs can be a simple as one node (no depenencies), linear (job "A" depends on outcome of "B", job "B" depends on "C", etc), or really complex dependency trees where some jobs depend on outcome several other jobs, etc.
Creating a DAG Description File
Next we will create a file that describes our DAG. Create a minimal DAG for DAGMan. This DAG will have a single node. Note that the DAG description file associates symbolic names of jobs (e.g. HelloWorld) with their correspoding submit files (myjob.submit).
$ cat > mydag.dag
Job HelloWorld myjob.submit
Ctrl-D $ cat mydag.dag Job HelloWorld myjob.submit
Because our DAG only has one node, we could've just as easily submitted myjob.submit via condor_submit directly, but we'll do it as a DAG for practice.
Submitting a DAG
Submit the DAG description file to Condor with condor_submit_dag, then watch the run.
$ condor_submit_dag mydag.dag
Checking your DAG input file and all submit files it references.
This might take a while...
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor : mydag.dag.condor.sub
Log of DAGMan debugging messages : mydag.dag.dagman.out
Log of Condor library debug messages : mydag.dag.lib.out
Log of the life of condor_dagman itself : mydag.dag.dagman.log
Condor Log file for all Condor jobs of this DAG: results.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 3.
-----------------------------------------------------------------------
$ condor_q -- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 user?? 7/10 17:33 0+00:00:03 R 0 2.6 condor_dagman -f - 3.0 user?? 7/10 17:33 0+00:00:00 I 0 0.0 myscript.sh TestJo 2 jobs; 1 idle, 1 running, 0 held $ condor_q -globus -- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 3.0 user?? UNSUBMITTED fork my-gatekeeper.cs.wisc.edu /tmp/username-cond
$ condor_q
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
2.0 user?? 7/10 17:33 0+00:00:33 R 0 2.6 condor_dagman -f -
3.0 user?? 7/10 17:33 0+00:00:15 R 0 0.0 myscript.sh TestJo
2 jobs; 0 idle, 2 running, 0 held
$ condor_q -globus
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
ID OWNER STATUS MANAGER HOST EXECUTABLE
3.0 user?? ACTIVE fork my-gatekeeper.cs.wisc.edu /tmp/username-cond
$ condor_q -- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
2.0 user?? 7/10 17:33 0+00:01:03 R 0 2.6 condor_dagman -f -
3.0 user?? 7/10 17:33 0+00:00:45 R 0 0.0 myscript.sh TestJo
2 jobs; 0 idle, 2 running, 0 held
$ condor_q -globus
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
ID OWNER STATUS MANAGER HOST EXECUTABLE
3.0 user?? ACTIVE fork my-gatekeeper.cs.wisc.edu /tmp/username-cond
$ condor_q
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 held
$ condor_q -globus
-- Submitter: uml1.cs.wisc.edu : <128.105.185.14:35688> : uml1.cs.wisc.edu
ID OWNER STATUS MANAGER HOST EXECUTABLE
Ctrl-C
Notice that condor_dagman is running as a job itself, and that condor_dagman submits your real job without your direct intervention. You might happen to catch the "C" (completed) state as your job finishes, but that often goes by too quickly to notice.
Again, in another window you may want to run "tail -f --lines=500 results.log" in a second window to watch the job log file as your job runs. You might also want to watch DAGMan's log file with "tail -f --lines=500 mydag.dag.dagman.out" in a third window. (mydag.dag.dagman.out) in the same way in a third window. For the remainder of this tutorial, we suggest you re-run this command when you submit a DAG. This will allow you to see how typical DAGs progress. Use "Ctrl-C" to stop watching the file.
Third window:
$ tail -f --lines=500 mydag.dag.dagman.out 7/10 10:36:43 ****************************************************** 7/10 10:36:43 ** condor_scheduniv_exec.3.0 (CONDOR_DAGMAN) STARTING UP 7/10 10:36:43 ** $CondorVersion: 6.6.1 Feb 5 2004 $ 7/10 10:36:43 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $ 7/10 10:36:43 ** PID = 26844 7/10 10:36:43 ****************************************************** 7/10 10:36:44 DaemonCore: Command Socket at <128.105.185.14:34571> 7/10 10:36:44 argv[0] == "condor_scheduniv_exec.3.0" 7/10 10:36:44 argv[1] == "-Debug" 7/10 10:36:44 argv[2] == "3" 7/10 10:36:44 argv[3] == "-Lockfile" 7/10 10:36:44 argv[4] == "mydag.dag.lock" 7/10 10:36:44 argv[5] == "-Condorlog" 7/10 10:36:44 argv[6] == "results.log" 7/10 10:36:44 argv[7] == "-Dag" 7/10 10:36:44 argv[8] == "mydag.dag" 7/10 10:36:44 argv[9] == "-Rescue" 7/10 10:36:44 argv[10] == "mydag.dag.rescue" 7/10 10:36:44 Condor log will be written to results.log 7/10 10:36:44 DAG Lockfile will be written to mydag.dag.lock 7/10 10:36:44 DAG Input file is mydag.dag 7/10 10:36:44 Rescue DAG will be written to mydag.dag.rescue 7/10 10:36:44 Parsing mydag.dag ... 7/10 10:36:44 Dag contains 1 total jobs 7/10 10:36:44 Bootstrapping... 7/10 10:36:44 Number of pre-completed jobs: 0 7/10 10:36:44 Submitting Job HelloWorld ... 7/10 10:36:44 assigned Condor ID (7.0.0) 7/10 10:36:45 Event: ULOG_SUBMIT for Job HelloWorld (7.0.0) 7/10 10:36:45 0/1 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post 7/10 10:37:05 Event: ULOG_GLOBUS_SUBMIT for Job HelloWorld (7.0.0) 7/10 10:37:05 Event: ULOG_EXECUTE for Job HelloWorld (7.0.0) 7/10 10:38:10 Event: ULOG_JOB_TERMINATED for Job HelloWorld (7.0.0) 7/10 10:38:10 Job HelloWorld completed successfully. 7/10 10:38:10 1/1 done, 0 failed, 0 submitted, 0 ready, 0 pre, 0 post 7/10 10:38:10 All jobs Completed! 7/10 10:38:10 **** condor_scheduniv_exec.6.0 (condor_DAGMAN) EXITING WITH STATUS 0
Verify your results:
$ ls -l total 12 -rw-r--r-- 1 user?? user?? 28 Jul 10 10:35 mydag.dag -rw-r--r-- 1 user?? user?? 523 Jul 10 10:36 mydag.dag.condor.sub -rw-r--r-- 1 user?? user?? 608 Jul 10 10:38 mydag.dag.dagman.log -rw-r--r-- 1 user?? user?? 1860 Jul 10 10:38 mydag.dag.dagman.out -rw-r--r-- 1 user?? user?? 29 Jul 10 10:38 mydag.dag.lib.out -rw------- 1 user?? user?? 0 Jul 10 10:36 mydag.dag.lock -rw-r--r-- 1 user?? user?? 175 Jul 9 18:13 myjob.submit -rwxr-xr-x 1 user?? user?? 194 Jul 10 10:36 myscript.sh -rw-r--r-- 1 user?? user?? 31 Jul 10 10:37 results.error -rw------- 1 user?? user?? 833 Jul 10 10:38 results.log -rw-r--r-- 1 user?? user?? 261 Jul 10 10:37 results.output -rwxr-xr-x 1 user?? user?? 81 Jul 10 10:35 watch_condor_q $ cat results.error This is sent to standard error $ cat results.output I'm process id 29149 on pc-26 This is sent to standard error Thu Jul 10 10:38:44 CDT 2003 Running as binary /home/user??/.globus/.gass_cache/local/md5/aa/ceb9e04077256aaa2acf4dff670897/md5/27/2f50da149fc049d07b1c27f30b67df/data TEST 1 My name (argument 1) is TEST My sleep duration (argument 2) is 1 Sleep of 1 seconds finished. Exiting RESULT: 0 SUCCESS
Looking at DAGMan's various files, we see that DAGMan itself ran as a Condor job (specifically, a Scheduler universe job). You can even see its submit and log files:
$ ls mydag.dag mydag.dag.dagman.log mydag.dag.lib.out myjob.submit results.error results.output mydag.dag.condor.sub mydag.dag.dagman.out mydag.dag.lock myscript.sh results.log watch_condor_q $ cat mydag.dag.condor.sub # Filename: mydag.dag.condor.sub # Generated by condor_submit_dag mydag.dag universe = scheduler executable = /opt/vdt1.1.13/condor/bin/condor_dagman getenv = True output = mydag.dag.lib.out error = mydag.dag.lib.out log = mydag.dag.dagman.log remove_kill_sig = SIGUSR1 arguments = -f -l . -Debug 3 -Lockfile mydag.dag.lock -Condorlog results.log -Dag mydag.dag -Rescue mydag.dag.rescue environment = _CONDOR_DAGMAN_LOG=mydag.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0 queue $ cat mydag.dag.dagman.log 000 (006.000.000) 07/10 10:36:43 Job submitted from host: <128.105.185.14:33785> ... 001 (006.000.000) 07/10 10:36:44 Job executing on host: <128.105.185.14:33785> ... 005 (006.000.000) 07/10 10:38:10 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...
If you weren't watching the DAGMan output file with tail -f, you can examine the file with the following command:
$ cat mydag.dag.dagman.out
Clean up your results. Be careful when deleting the mydag.dag.* files, you do not want to delete the mydag.dag file, just mydag.dag.* .
$ rm mydag.dag.* results.*<--- Previous Next-->