Title:	Condor Practical
Subtitle:	A More Complex DAG
Tutor:	Alain Roy
Authors:	Alain Roy

7.0 A More Complex DAG

Each job in a DAGMan DAG must have only one queue command in it, so a DAG with multiple jobs has one submit file per job. Theoretically you can reuse submit files if you are careful and use the $(Cluster) macro, but that is rarely desirable. We will now make a DAG with four nodes in it: a setup node, two nodes that do analysis, and a cleanup node. For now, of course, all of these nodes will do the same thing, but hopefully the principle will be clear.

First, make sure that your submit file has only one queue command in it, as when we first wrote it:

Universe                = vanilla
Executable              = simple
Arguments               = 4 10
Log                     = simple.log
Output                  = simple.out
Error                   = simple.error
should_transfer_files   = YES
when_to_transfer_output = ON_EXIT
Queue

Now copy these files:

% cp submit job.setup.submit
% cp submit job.work1.submit
% cp submit job.work2.submit
% cp submit job.finalize.submit

Edit the various submit files. Change the output and error entries (but not the log entry) to point to results.NODE.output and results.NODE.error files where NODE is actually the middle word in the submit file (job.NODE.submit). So job.finalize.error would include:

Output = results.finalize.output
Error  = results.finalize.error

Here is one possible set of settings for the output entries:

% grep -i  '^output' job.*.submit
job.finalize.submit:Output = results.finalize.output
job.setup.submit:Output = results.setup.output
job.work1.submit:Output = results.work1.output
job.work2.submit:Output = results.work2.output

This is important so that the various nodes don't overwrite each other's output.

Leave the log entries alone. Old versions of DAGMan requires that all nodes output their logs in the same location. Condor will ensure that the different jobs will not overwrite each other's entries in the log. Newer versions of DAGMan lift this requirement, and allow each job to use its own log file -- but you may want to use one common log file anyway because it's convenient to have all of your job status information in a single place.

Log = simple.log

Also change the arguments entries so that the second argument is something unique to each node. This way each of our jobs will calculate something different and we can tell apart their outputs.

For job work1, change the second argument to 11 so that it looks something like:

Arguments = 4 11

Now construct your dag in the file called simple.dag:

Job  Setup job.setup.submit
Job  Work1 job.work1.submit
Job  Work2 job.work2.submit
Job  Final job.finalize.submit

PARENT Setup CHILD Work1 Work2
PARENT Work1 Work2 CHILD Final

Submit your new DAG and monitor it.

% condor_submit_dag simple.dag

Checking all your submit files for log file names.
This might take a while... 
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor           : simple.dag.condor.sub
Log of DAGMan debugging messages                 : simple.dag.dagman.out
Log of Condor library debug messages             : simple.dag.lib.out
Log of the life of condor_dagman itself          : simple.dag.dagman.log

Condor Log file for all jobs of this DAG         : /home/users/roy/condor-test/simple.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 29.
-----------------------------------------------------------------------

%  ./watch_condor_q 

DAGMan runs and submits the first job:
-- Submitter: osg-edu.cs.wisc.edu : <193.206.208.141:9603> : osg-edu.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  20.0   aroy            2/4  23:20   0+00:00:01 R  0   9.8  condor_dagman -f -
  21.0    |-Setup        2/4  23:20   0+00:00:00 I  0   9.8  simple 4 10       

2 jobs; 1 idle, 1 running, 0 held


That first job starts
-- Submitter: osg-edu.cs.wisc.edu : <193.206.208.141:9603> : osg-edu.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  20.0   aroy            2/4  23:20   0+00:00:01 R  0   9.8  condor_dagman -f -
  21.0    |-Setup        2/4  23:20   0+00:00:02 R  0   9.8  simple 4 10       

2 jobs; 1 idle, 1 running, 0 held


The first job finishes, but DAGMan hasn't reacted yet
-- Submitter: osg-edu.cs.wisc.edu : <193.206.208.141:9603> : osg-edu.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  20.0   aroy            2/4  23:20   0+00:00:31 R  0   9.8  condor_dagman -f -

1 jobs; 0 idle, 1 running, 0 held

The next two are submitted.
-- Submitter: osg-edu.cs.wisc.edu : <193.206.208.141:9603> : osg-edu.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  20.0   aroy            2/4  23:20   0+00:00:41 R  0   9.8  condor_dagman -f -
  22.0    |-Work1        2/4  23:21   0+00:00:00 I  0   9.8  simple 4 11       
  23.0    |-Work2        2/4  23:21   0+00:00:00 I  0   9.8  simple 4 10       

3 jobs; 2 idle, 1 running, 0 held


The next two jobs start up.
-- Submitter: osg-edu.cs.wisc.edu : <193.206.208.141:9603> : osg-edu.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  20.0   aroy            2/4  23:20   0+00:00:41 R  0   9.8  condor_dagman -f -
  22.0    |-Work1        2/4  23:21   0+00:00:01 R  0   9.8  simple 4 11       
  23.0    |-Work2        2/4  23:21   0+00:00:01 R  0   9.8  simple 4 10       

3 jobs; 2 idle, 1 running, 0 held


Those jobs have finished, and DAGMan will notice soon
-- Submitter: osg-edu.cs.wisc.edu : <193.206.208.141:9603> : osg-edu.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  20.0   aroy            2/4  23:20   0+00:01:02 R  0   9.8  condor_dagman -f -

1 jobs; 0 idle, 1 running, 0 held


Now our final node is running
-- Submitter: osg-edu.cs.wisc.edu : <193.206.208.141:9603> : osg-edu.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  20.0   aroy            2/4  23:20   0+00:01:32 R  0   9.8  condor_dagman -f -
  24.0    |-Final        2/4  23:21   0+00:00:04 R  0   9.8  simple 4 10       

2 jobs; 0 idle, 2 running, 0 held

All finished!
-- Submitter: osg-edu.cs.wisc.edu : <193.206.208.141:9603> : osg-edu.cs.wisc.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

You can see that the Final node wasn't run until after the Work nodes, which were not run until after the Setup node.

Examine your results:

% tail --lines=500 results.*.output
==> results.finalize.output <==
Thinking really hard for 4 seconds...
We calculated: 26

==> results.setup.output <==
Thinking really hard for 4 seconds...
We calculated: 20

==> results.work1.output <==
Thinking really hard for 4 seconds...
We calculated: 22

==> results.work2.output <==
Thinking really hard for 4 seconds...
We calculated: 24

Examine your log:

% cat simple.log
000 (021.000.000) 02/04 23:20:28 Job submitted from host: <193.206.208.141:9603>
    DAG Node: Setup
...
001 (021.000.000) 02/04 23:20:53 Job executing on host: <193.206.208.205:9674>
...
005 (021.000.000) 02/04 23:20:57 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        56  -  Run Bytes Sent By Job
        5052  -  Run Bytes Received By Job
        56  -  Total Bytes Sent By Job
        5052  -  Total Bytes Received By Job
...
000 (022.000.000) 02/04 23:21:03 Job submitted from host: <193.206.208.141:9603>
    DAG Node: Work1
...
000 (023.000.000) 02/04 23:21:03 Job submitted from host: <193.206.208.141:9603>
    DAG Node: Work2
...
001 (022.000.000) 02/04 23:21:13 Job executing on host: <193.206.208.205:9674>
...
001 (023.000.000) 02/04 23:21:15 Job executing on host: <193.206.208.214:9638>
...
005 (022.000.000) 02/04 23:21:17 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        56  -  Run Bytes Sent By Job
        5052  -  Run Bytes Received By Job
        56  -  Total Bytes Sent By Job
        5052  -  Total Bytes Received By Job
...
005 (023.000.000) 02/04 23:21:19 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        56  -  Run Bytes Sent By Job
        5052  -  Run Bytes Received By Job
        56  -  Total Bytes Sent By Job
        5052  -  Total Bytes Received By Job
...
000 (024.000.000) 02/04 23:21:28 Job submitted from host: <193.206.208.141:9603>
    DAG Node: Final
...
001 (024.000.000) 02/04 23:21:56 Job executing on host: <193.206.208.205:9674>
...
        005 (024.000.000) 02/04 23:22:00 Job terminated.
(1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        56  -  Run Bytes Sent By Job
        5052  -  Run Bytes Received By Job
        56  -  Total Bytes Sent By Job
        5052  -  Total Bytes Received By Job
...

Examine the DAGMan log:

% cat simple.dag.dagman.out
2/4 23:20:27 ******************************************************
2/4 23:20:27 ** condor_scheduniv_exec.20.0 (CONDOR_DAGMAN) STARTING UP
2/4 23:20:27 ** /usr/local/condor/bin/condor_dagman
2/4 23:20:27 ** $CondorVersion: 6.8.6 Sep 13 2007 $
2/4 23:20:27 ** $CondorPlatform: I386-LINUX_RH9 $
2/4 23:20:27 ** PID = 627
2/4 23:20:27 ** Log last touched time unavailable (No such file or directory)
2/4 23:20:27 ******************************************************
2/4 23:20:27 Using config source: /usr/local/condor/etc/condor_config
2/4 23:20:27 Using local config sources: 
2/4 23:20:27    /var/local/condor/condor_config.local
2/4 23:20:27 DaemonCore: Command Socket at <193.206.208.141:9609>
2/4 23:20:27 DAGMAN_SUBMIT_DELAY setting: 0
2/4 23:20:27 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
2/4 23:20:27 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
2/4 23:20:27 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
2/4 23:20:27 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114
2/4 23:20:27 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
2/4 23:20:27 DAGMAN_RETRY_NODE_FIRST setting: 0
2/4 23:20:27 DAGMAN_MAX_JOBS_IDLE setting: 0
2/4 23:20:27 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
2/4 23:20:27 DAGMAN_MUNGE_NODE_NAMES setting: 1
2/4 23:20:27 DAGMAN_DELETE_OLD_LOGS setting: 1
2/4 23:20:27 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
2/4 23:20:27 DAGMAN_ABORT_DUPLICATES setting: 0
2/4 23:20:27 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
2/4 23:20:27 argv[0] == "condor_scheduniv_exec.20.0"
2/4 23:20:27 argv[1] == "-Debug"
2/4 23:20:27 argv[2] == "3"
2/4 23:20:27 argv[3] == "-Lockfile"
2/4 23:20:27 argv[4] == "simple.dag.lock"
2/4 23:20:27 argv[5] == "-Condorlog"
2/4 23:20:27 argv[6] == "/condor/aroy/condor-test/simple.log"
2/4 23:20:27 argv[7] == "-Dag"
2/4 23:20:27 argv[8] == "simple.dag"
2/4 23:20:27 argv[9] == "-Rescue"
2/4 23:20:27 argv[10] == "simple.dag.rescue"
2/4 23:20:27 DAG Lockfile will be written to simple.dag.lock
2/4 23:20:27 DAG Input file is simple.dag
2/4 23:20:27 Rescue DAG will be written to simple.dag.rescue
2/4 23:20:27 All DAG node user log files:
2/4 23:20:27   /condor/aroy/condor-test/simple.log (Condor)
2/4 23:20:27 Parsing simple.dag ...
2/4 23:20:27 Dag contains 4 total jobs
2/4 23:20:27 Truncating any older versions of log files...
2/4 23:20:27 MultiLogFiles: truncating older version of /condor/aroy/condor-test/simple.log
2/4 23:20:27 Bootstrapping...
2/4 23:20:27 Number of pre-completed nodes: 0
2/4 23:20:27 Registering condor_event_timer...
2/4 23:20:28 Got node Setup from the ready queue
2/4 23:20:28 Submitting Condor Node Setup job(s)...
2/4 23:20:28 submitting: condor_submit 
                         -a dag_node_name' '=' 'Setup 
                         -a +DAGManJobId' '=' '20 
                         -a DAGManJobId' '=' '20 
                         -a submit_event_notes' '=' 'DAG' 'Node:' 'Setup 
                         -a +DAGParentNodeNames' '=' '"" 
                         job.setup.submit
2/4 23:20:28 From submit: Submitting job(s).
2/4 23:20:28 From submit: Logging submit event(s).
2/4 23:20:28 From submit: 1 job(s) submitted to cluster 21.
2/4 23:20:28 assigned Condor ID (21.0)
2/4 23:20:28 Just submitted 1 job this cycle...
2/4 23:20:28 Event: ULOG_SUBMIT for Condor Node Setup (21.0)
2/4 23:20:28 Number of idle job procs: 1
2/4 23:20:28 Of 4 nodes total:
2/4 23:20:28  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/4 23:20:28   ===     ===      ===     ===     ===        ===      ===
2/4 23:20:28     0       0        1       0       0          3        0
2/4 23:20:53 Event: ULOG_EXECUTE for Condor Node Setup (21.0)
2/4 23:20:53 Number of idle job procs: 0
2/4 23:20:58 Event: ULOG_JOB_TERMINATED for Condor Node Setup (21.0)
2/4 23:20:58 Node Setup job proc (21.0) completed successfully.
2/4 23:20:58 Node Setup job completed
2/4 23:20:58 Number of idle job procs: 0
2/4 23:20:58 Of 4 nodes total:
2/4 23:20:58  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/4 23:20:58   ===     ===      ===     ===     ===        ===      ===
2/4 23:20:58     1       0        0       0       2          1        0
2/4 23:21:03 Got node Work1 from the ready queue
2/4 23:21:03 Submitting Condor Node Work1 job(s)...
2/4 23:21:03 submitting: condor_submit 
                         -a dag_node_name' '=' 'Work1 
                         -a +DAGManJobId' '=' '20 
                         -a DAGManJobId' '=' '20 
                         -a submit_event_notes' '=' 'DAG' 'Node:' 'Work1 
                         -a +DAGParentNodeNames' '=' '"Setup" 
                         job.work1.submit
2/4 23:21:03 From submit: Submitting job(s).
2/4 23:21:03 From submit: Logging submit event(s).
2/4 23:21:03 From submit: 1 job(s) submitted to cluster 22.
2/4 23:21:03 assigned Condor ID (22.0)
2/4 23:21:03 Got node Work2 from the ready queue
2/4 23:21:03 Submitting Condor Node Work2 job(s)...
2/4 23:21:03 submitting: condor_submit 
                         -a dag_node_name' '=' 'Work2 
                         -a +DAGManJobId' '=' '20 
                         -a DAGManJobId' '=' '20 
                         -a submit_event_notes' '=' 'DAG' 'Node:' 'Work2 
                         -a +DAGParentNodeNames' '=' '"Setup" 
                         job.work2.submit
2/4 23:21:03 From submit: Submitting job(s).
2/4 23:21:03 From submit: Logging submit event(s).
2/4 23:21:03 From submit: 1 job(s) submitted to cluster 23.
2/4 23:21:03 assigned Condor ID (23.0)
2/4 23:21:03 Just submitted 2 jobs this cycle...
2/4 23:21:03 Event: ULOG_SUBMIT for Condor Node Work1 (22.0)
2/4 23:21:03 Number of idle job procs: 1
2/4 23:21:03 Event: ULOG_SUBMIT for Condor Node Work2 (23.0)
2/4 23:21:03 Number of idle job procs: 2
2/4 23:21:03 Of 4 nodes total:
2/4 23:21:03  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/4 23:21:03   ===     ===      ===     ===     ===        ===      ===
2/4 23:21:03     1       0        2       0       0          1        0
2/4 23:21:13 Event: ULOG_EXECUTE for Condor Node Work1 (22.0)
2/4 23:21:13 Number of idle job procs: 1
2/4 23:21:18 Event: ULOG_EXECUTE for Condor Node Work2 (23.0)
2/4 23:21:18 Number of idle job procs: 0
2/4 23:21:18 Event: ULOG_JOB_TERMINATED for Condor Node Work1 (22.0)
2/4 23:21:18 Node Work1 job proc (22.0) completed successfully.
2/4 23:21:18 Node Work1 job completed
2/4 23:21:18 Number of idle job procs: 0
2/4 23:21:18 Of 4 nodes total:
2/4 23:21:18  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/4 23:21:18   ===     ===      ===     ===     ===        ===      ===
2/4 23:21:18     2       0        1       0       0          1        0
2/4 23:21:23 Event: ULOG_JOB_TERMINATED for Condor Node Work2 (23.0)
2/4 23:21:23 Node Work2 job proc (23.0) completed successfully.
2/4 23:21:23 Node Work2 job completed
2/4 23:21:23 Number of idle job procs: 0
2/4 23:21:23 Of 4 nodes total:
2/4 23:21:23  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/4 23:21:23   ===     ===      ===     ===     ===        ===      ===
2/4 23:21:23     3       0        0       0       1          0        0
2/4 23:21:28 Got node Final from the ready queue
2/4 23:21:28 Submitting Condor Node Final job(s)...
2/4 23:21:28 submitting: condor_submit 
             -a dag_node_name' '=' 'Final 
             -a +DAGManJobId' '=' '20 
             -a DAGManJobId' '=' '20 
             -a submit_event_notes' '=' 'DAG' 'Node:' 'Final 
             -a +DAGParentNodeNames' '=' '"Work1,Work2" 
             job.finalize.submit
2/4 23:21:28 From submit: Submitting job(s).
2/4 23:21:28 From submit: Logging submit event(s).
2/4 23:21:28 From submit: 1 job(s) submitted to cluster 24.
2/4 23:21:28 assigned Condor ID (24.0)
2/4 23:21:28 Just submitted 1 job this cycle...
2/4 23:21:28 Event: ULOG_SUBMIT for Condor Node Final (24.0)
2/4 23:21:28 Number of idle job procs: 1
2/4 23:21:28 Of 4 nodes total:
2/4 23:21:28  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/4 23:21:28   ===     ===      ===     ===     ===        ===      ===
2/4 23:21:28     3       0        1       0       0          0        0
2/4 23:21:58 Event: ULOG_EXECUTE for Condor Node Final (24.0)
2/4 23:21:58 Number of idle job procs: 0
2/4 23:22:03 Event: ULOG_JOB_TERMINATED for Condor Node Final (24.0)
2/4 23:22:03 Node Final job proc (24.0) completed successfully.
2/4 23:22:03 Node Final job completed
2/4 23:22:03 Number of idle job procs: 0
2/4 23:22:03 Of 4 nodes total:
2/4 23:22:03  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/4 23:22:03   ===     ===      ===     ===     ===        ===      ===
2/4 23:22:03     4       0        0       0       0          0        0
2/4 23:22:03 All jobs Completed!
2/4 23:22:03 Note: 0 total job deferrals because of -MaxJobs limit (0)
2/4 23:22:03 Note: 0 total job deferrals because of -MaxIdle limit (0)
2/4 23:22:03 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
2/4 23:22:03 Note: 0 total POST script deferrals because of -MaxPost limit (0)
2/4 23:22:03 **** condor_scheduniv_exec.20.0 (condor_DAGMAN) EXITING WITH STATUS 0

Clean up your results. Be careful about deleting the simple.dag.* files, you do not want to delete the simple.dag file, just simple.dag.* .

% rm simple.dag.*
% rm results.*

Top

Next: Handling jobs that fail with DAGMan