A Simple DAG

What is DAGMan?

Your tutorial leader will introduce you to DAGMan and DAGs. In short, DAGMAn, lets you submit complex sequences of jobs as long as they can be expressed as a directed acylic graph. For example, you may wish to run a large parameter sweep but before the sweep run you need to prepare your data. After the sweep runs, you need to collate the results. This might look like this, assuming you want to sweep over five parameters:

DAGMan has many abilities, such as throttling jobs, recovery from failures, and more. More information about DAGMan can be found at in the Condor manual.

Submitting a simple DAG

Make sure that your submit file has only one queue command in it, as when we first wrote it. And we will just run vanilla universe jobs for now, though we could equally well run standard universe jobs.

Universe   = vanilla
Executable = simple
Arguments  = 4 10
Log        = simple.log
Output     = simple.out
Error      = simple.error
Queue

We are going to get a bit more sophisticated in submitting our jobs now. Let's have three windows open on nova. In one window, you'll submit the job. In another you will watch the queue, and in the third you will watch what DAGMan does. To prepare for this, we'll create a script to help watch the queue. Name it watch_condor_q. (Where it says Ctrl-D, type the character, not the full name. This will end the input for cat.)

nova 1% cat > watch_condor_q
#! /bin/sh
while true; do
     condor_q -dag
     sleep 10
done
Ctrl-D
nova 2% chmod a+x watch_condor_q 

If you like, modify watch_condor_q so it just watches your jobs, not everyone's.

Now we will create the most minimal DAG that can be created: a DAG with just one node.

nova 3% cat > simple.dag
Job Simple submit
Ctrl-D

In your first window, submit the job:

% rm -f simple.log simple.out 

% condor_submit_dag -force simple.dag

-----------------------------------------------------------------------
File for submitting this DAG to Condor           : simple.dag.condor.sub
Log of DAGMan debugging messages                 : simple.dag.dagman.out
Log of Condor library debug messages             : simple.dag.lib.out
Log of the life of condor_dagman itself          : simple.dag.dagman.log

Condor Log file for all Condor jobs of this DAG: simple.dag.dummy_log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 6098.
-----------------------------------------------------------------------

% condor_reschedule

In the second window, watch the queue.

% ./watch_condor_q

-- Submitter: nova.cs.tau.ac.il : <132.67.192.133:49346> : nova.cs.tau.ac.il
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
4589.0   doronn          3/30 18:07  19+09:26:01 I  0   0.0  go1               
...
6098.0   alainroy       12/20 18:18   0+00:00:05 R  0   2.4  condor_dagman -f -
6099.0    |-Simple      12/20 18:18   0+00:00:00 I  0   0.0  simple 4 10       

33 jobs; 13 idle, 20 running, 0 held

-- Submitter: nova.cs.tau.ac.il : <132.67.192.133:49346> : nova.cs.tau.ac.il
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
4589.0   doronn          3/30 18:07  19+09:26:01 I  0   0.0  go1               
...
6098.0   alainroy       12/20 18:18   0+00:00:36 R  0   2.4  condor_dagman -f -
6099.0    |-Simple      12/20 18:18   0+00:00:00 R  0   0.0  simple 4
 10      

-- Submitter: nova.cs.tau.ac.il : <132.67.192.133:49346> : nova.cs.tau.ac.il
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
4589.0   doronn          3/30 18:07  19+09:26:01 I  0   0.0  go1               
...
6098.0   alainroy       12/20 18:18   0+00:00:47 R  0   2.4  condor_dagman -f -
Ctrl-C

In the third window, watch what DAGMan does:

% tail -f --lines=500 simple.dag.dagman.out
12/20 18:18:35 ******************************************************
12/20 18:18:35 ** condor_scheduniv_exec.6098.0 (CONDOR_DAGMAN) STARTING UP
12/20 18:18:35 ** /specific/a/home/cc/cs/condor/hosts/nova/spool/cluster6098.ickpt.subproc0
12/20 18:18:35 ** $CondorVersion: 6.6.7 Oct 11 2004 $
12/20 18:18:35 ** $CondorPlatform: I386-LINUX_RH72 $
12/20 18:18:35 ** PID = 18327
12/20 18:18:35 ******************************************************
12/20 18:18:35 Using config file: /usr/local/lib/condor/etc/condor_config
12/20 18:18:35 Using local config files: /usr/local/lib/condor/etc/nodes/condor_config.nova
12/20 18:18:35 DaemonCore: Command Socket at <132.67.192.133:55227>
12/20 18:18:35 argv[0] == "condor_scheduniv_exec.6098.0"
12/20 18:18:35 argv[1] == "-Debug"
12/20 18:18:35 argv[2] == "3"
12/20 18:18:35 argv[3] == "-Lockfile"
12/20 18:18:35 argv[4] == "simple.dag.lock"
12/20 18:18:35 argv[5] == "-Dag"
12/20 18:18:35 argv[6] == "simple.dag"
12/20 18:18:35 argv[7] == "-Rescue"
12/20 18:18:35 argv[8] == "simple.dag.rescue"
12/20 18:18:35 argv[9] == "-Condorlog"
12/20 18:18:35 argv[10] == "simple.dag.dummy_log"
12/20 18:18:35 DAG Lockfile will be written to simple.dag.lock
12/20 18:18:35 DAG Input file is simple.dag
12/20 18:18:35 Rescue DAG will be written to simple.dag.rescue
12/20 18:18:35 All DAG node user log files:
12/20 18:18:35   /specific/a/home/cc/cs/alainroy/simple.log
12/20 18:18:35 Parsing simple.dag ...
12/20 18:18:35 Dag contains 1 total jobs
12/20 18:18:35 Deleting any older versions of log files...
12/20 18:18:35 Deleting older version of /specific/a/home/cc/cs/alainroy/simple.log
12/20 18:18:36 Bootstrapping...
12/20 18:18:36 Number of pre-completed jobs: 0
12/20 18:18:36 Registering condor_event_timer...
12/20 18:18:37 Submitting Condor Job Simple ...
12/20 18:18:37 submitting: condor_submit  -a 'dag_node_name = Simple'
               -a '+DAGManJobID = 6098.0' 
               -a 'submit_event_notes = DAG Node: $(dag_node_name)' submit 2>&1
12/20 18:18:41  assigned Condor ID (6099.0.0)
12/20 18:18:41 Just submitted 1 job this cycle...
12/20 18:18:41 Event: ULOG_SUBMIT for Condor Job Simple (6099.0.0)
12/20 18:18:41 Of 1 nodes total:
12/20 18:18:41  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
12/20 18:18:41   ===     ===      ===     ===     ===        ===      ===
12/20 18:18:41     0       0        1       0       0          0        0
12/20 18:19:21 Event: ULOG_EXECUTE for Condor Job Simple (6099.0.0)
12/20 18:19:26 Event: ULOG_JOB_TERMINATED for Condor Job Simple (6099.0.0)
12/20 18:19:26 Job Simple completed successfully.
12/20 18:19:26 Of 1 nodes total:
12/20 18:19:26  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
12/20 18:19:26   ===     ===      ===     ===     ===        ===      ===
12/20 18:19:26     1       0        0       0       0          0        0
12/20 18:19:26 All jobs Completed!
12/20 18:19:26 **** condor_scheduniv_exec.6098.0 (condor_DAGMAN) EXITING WITH STATUS 0

Now verify your results:

% cat simple.log 
000 (6099.000.000) 12/20 18:18:40 Job submitted from host: <132.67.192.133:49346>
    DAG Node: Simple
...
001 (6099.000.000) 12/20 18:19:17 Job executing on host: <132.67.105.226:33875>
...
005 (6099.000.000) 12/20 18:19:21 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...

% cat simple.out 
Thinking really hard for 120 seconds...
We calculated: 20

Looking at DAGMan's various files, we see that DAGMan itself ran as a Condor job (specifically, a scheduler universe job).

% ls simple.dag.*
simple.dag.condor.sub  simple.dag.dagman.log  simple.dag.dagman.out  
simple.dag.dummy_log    simple.dag.lib.out

% cat simple.dag.condor.sub
# Filename: simple.dag.condor.sub
# Generated by condor_submit_dag simple.dag
universe        = scheduler
executable      = /usr/local/bin/condor_dagman
getenv          = True
output          = simple.dag.lib.out
error           = simple.dag.lib.out
log             = simple.dag.dagman.log
remove_kill_sig = SIGUSR1
arguments       = -f -l . -Debug 3 -Lockfile simple.dag.lock -Dag simple.dag -Rescue simple.dag.rescue -Condorlog simple.dag.dummy_log
environment     = _CONDOR_DAGMAN_LOG=simple.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
queue

% cat simple.dag.dagman.log
000 (6098.000.000) 12/20 18:18:34 Job submitted from host: <132.67.192.133:49346>
...
001 (6098.000.000) 12/20 18:18:35 Job executing on host: <132.67.192.133:49346>
...
005 (6098.000.000) 12/20 18:19:26 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...

Clean up some of these files:

% rm simple.dag.*
Question Why does DAGMan run as a Condor job?

Next: A more complex DAG