Each job in a DAGMan DAG must have only one queue command in it, so a DAG with multiple jobs has one submit file per job. Theoretically you can reuse submit files if you are careful and use the $(Cluster) macro, but that is rarely desirable. We will now make a DAG with four nodes in it: a setup node, two nodes that do analysis, and a cleanup node. For now, of course, all of these nodes will do the same thing, but hopefully the principle will be clear.
First, make sure that your submit file has only one queue command in it, as when we first wrote it:
Universe = vanilla Executable = simple Arguments = 4 10 Log = simple.log Output = simple.out Error = simple.error Queue
Now copy these files:
% cp submit job.setup.submit % cp submit job.work1.submit % cp submit job.work2.submit % cp submit job.finalize.submit
Edit the various submit files. Change the output and error entries to point to results.NODE.output and results.NODE.error files where NODE is actually the middle word in the submit file (job.NODE.submit). So job.finalize.error would include:
Output = results.finalize.output Error = results.finalize.error
Here is one possible set of settings for the output entries:
% grep -i '^output' job.*.submit job.finalize.submit:Output = results.finalize.out job.setup.submit:Output = results.setup.out job.work1.submit:Output = results.work1.out job.work2.submit:Output = results.work2.out
This is important so that the various nodes don't overwrite each other's output.
Leave the log entries alone. Old versions of DAGMan requires that all nodes output their logs in the same location. Condor will ensure that the different jobs will not overwrite each other's entries in the log. Never versions of DAGMan lift this requirement, and allow each job to use its own log file -- but you may want to use one common log file anyway because it's convenient to have all of your job status information in a single place.
Log = simple.log
Also change the arguments entries so that the second argument is something unique to each node. This way each of our jobs will calculate something different and we can tell apart their outputs.
For job work1, change the second argument to 11 so that it looks something like:Arguments = 4 11
Now construct your dag in the file called simple.dag:
Job Setup job.setup.submit Job Work1 job.work1.submit Job Work2 job.work2.submit Job Final job.finalize.submit PARENT Setup CHILD Work1 Work2 PARENT Work1 Work2 CHILD Final
Submit your new DAG and monitor it.
% condor_submit_dag simple.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : simple.dag.condor.sub Log of DAGMan debugging messages : simple.dag.dagman.out Log of Condor library debug messages : simple.dag.lib.out Log of the life of condor_dagman itself : simple.dag.dagman.log Condor Log file for all Condor jobs of this DAG: simple.dag.dummy_log pSubmitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 6101. ----------------------------------------------------------------------- % ./watch_condor_q -- Submitter: roy@fnal.gov : <132.67.192.133:49346> : hal.fnal.gov ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 6101.0 roy 12/20 23:59 0+00:00:06 R 0 2.4 condor_dagman -f - 6102.0 |-Setup 12/20 23:59 0+00:00:00 I 0 0.0 simple 4 10 2 jobs; 1 idle, 1 running, 0 held -- Submitter: roy@fnal.gov : <132.67.192.133:49346> : hal.fnal.gov ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 6101.0 roy 12/20 23:59 0+00:00:17 R 0 2.4 condor_dagman -f - 6102.0 |-Setup 12/20 23:59 0+00:00:00 I 0 0.0 simple 4 10 2 jobs; 1 idle, 1 running, 0 held -- Submitter: roy@fnal.gov : <132.67.192.133:49346> : hal.fnal.gov ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 6101.0 roy 12/20 23:59 0+00:00:27 R 0 2.4 condor_dagman -f - 6102.0 |-Setup 12/20 23:59 0+00:00:00 I 0 0.0 simple 4 10 2 jobs; 1 idle, 1 running, 0 held -- Submitter: roy@fnal.gov : <132.67.192.133:49346> : hal.fnal.gov ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 6101.0 roy 12/20 23:59 0+00:00:38 R 0 2.4 condor_dagman -f - 1 jobs; 0 idle, 1 running, 0 held -- Submitter: roy@fnal.gov : <132.67.192.133:49346> : hal.fnal.gov ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 6101.0 roy 12/20 23:59 0+00:00:50 R 0 2.4 condor_dagman -f - 6103.0 |-Work1 12/20 23:59 0+00:00:00 I 0 0.0 simple 4 11 6104.0 |-Work2 12/20 23:59 0+00:00:00 I 0 0.0 simple 4 12 3 jobs; 2 idle, 1 running, 0 held -- Submitter: roy@fnal.gov : <132.67.192.133:49346> : hal.fnal.gov ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 6101.0 roy 12/20 23:59 0+00:01:01 R 0 2.4 condor_dagman -f - 6103.0 |-Work1 12/20 23:59 0+00:00:04 R 0 0.0 simple 4 11 6104.0 |-Work2 12/20 23:59 0+00:00:00 R 0 0.0 simple 4 12 3 jobs; 0 idle, 3 running, 0 held -- Submitter: roy@fnal.gov : <132.67.192.133:49346> : hal.fnal.gov ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 6101.0 roy 12/20 23:59 0+00:01:12 R 0 2.4 condor_dagman -f - 1 jobs; 0 idle, 1 running, 0 held -- Submitter: roy@fnal.gov : <132.67.192.133:49346> : hal.fnal.gov ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 6101.0 roy 12/20 23:59 0+00:01:22 R 0 2.4 condor_dagman -f - 6105.0 |-Final 12/21 00:00 0+00:00:00 I 0 0.0 simple 4 13 2 jobs; 1 idle, 1 running, 0 held -- Submitter: roy@fnal.gov : <132.67.192.133:49346> : hal.fnal.gov ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 6101.0 roy 12/20 23:59 0+00:01:33 R 0 2.4 condor_dagman -f - 6105.0 |-Final 12/21 00:00 0+00:00:00 I 0 0.0 simple 4 13 2 jobs; 1 idle, 1 running, 0 held -- Submitter: roy@fnal.gov : <132.67.192.133:49346> : hal.fnal.gov ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 6101.0 roy 12/20 23:59 0+00:01:43 R 0 2.4 condor_dagman -f - 6105.0 |-Final 12/21 00:00 0+00:00:00 I 0 0.0 simple 4 13 2 jobs; 1 idle, 1 running, 0 held -- Submitter: roy@fnal.gov : <132.67.192.133:49346> : hal.fnal.gov ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 6101.0 roy 12/20 23:59 0+00:01:54 R 0 2.4 condor_dagman -f - 6105.0 |-Final 12/21 00:00 0+00:00:03 R 0 0.0 simple 4 13 2 jobs; 0 idle, 2 running, 0 held -- Submitter: roy@fnal.gov : <132.67.192.133:49346> : hal.fnal.gov ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 6101.0 roy 12/20 23:59 0+00:02:05 R 0 2.4 condor_dagman -f - 1 jobs; 0 idle, 1 running, 0 held -- Submitter: roy@fnal.gov : <132.67.192.133:49346> : hal.fnal.gov ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
You can see that the Final node wasn't run until after the Work nodes, which were not run until after the Setup node.
Examine your results:
% tail --lines=500 results.*.out ==> results.finalize.out <== Thinking really hard for 4 seconds... We calculated: 26 ==> results.setup.out <== Thinking really hard for 4 seconds... We calculated: 20 ==> results.work1.out <== Thinking really hard for 4 seconds... We calculated: 22 ==> results.work2.out <== Thinking really hard for 4 seconds... We calculated: 24
Examine your log:
% cat simple.log 000 (6102.000.000) 12/20 23:59:14 Job submitted from host: <132.67.192.133:49346> DAG Node: Setup ... 001 (6102.000.000) 12/20 23:59:44 Job executing on host: <132.67.105.227:33607> ... 005 (6102.000.000) 12/20 23:59:48 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ... 000 (6103.000.000) 12/20 23:59:56 Job submitted from host: <132.67.192.133:49346> DAG Node: Work1 ... 000 (6104.000.000) 12/20 23:59:59 Job submitted from host: <132.67.192.133:49346> DAG Node: Work2 ... 001 (6103.000.000) 12/21 00:00:13 Job executing on host: <132.67.105.227:33607> ... 005 (6103.000.000) 12/21 00:00:17 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ... 001 (6104.000.000) 12/21 00:00:18 Job executing on host: <132.67.105.226:33875> ... 005 (6104.000.000) 12/21 00:00:23 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ... 000 (6105.000.000) 12/21 00:00:31 Job submitted from host: <132.67.192.133:49346> DAG Node: Final ... 001 (6105.000.000) 12/21 00:01:10 Job executing on host: <132.67.105.210:33755> ... 005 (6105.000.000) 12/21 00:01:14 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...
Examine the DAGMan log:
% cat simple.dag.dagman.out 12/20 23:59:12 ****************************************************** 12/20 23:59:12 ** condor_scheduniv_exec.6101.0 (CONDOR_DAGMAN) STARTING UP 12/20 23:59:12 ** /opt/condor-6.7.4/local/spool/cluster6101.ickpt.subproc0 12/20 23:59:12 ** $CondorVersion: 6.7.3 Dec 28 2004 $ 12/20 23:59:12 ** $CondorPlatform: I386-LINUX_RH9 $ 12/20 23:59:12 ** PID = 18662 12/20 23:59:12 ****************************************************** 12/20 23:59:12 Using config file: /opt/condor-6.7.3/etc/condor_config 12/20 23:59:12 Using local config files: /opt/condor-6.7.3/local/condor_config.local 12/20 23:59:12 DaemonCore: Command Socket at <132.67.192.133:34522> 12/20 23:59:12 argv[0] == "condor_scheduniv_exec.6101.0" 12/20 23:59:12 argv[1] == "-Debug" 12/20 23:59:12 argv[2] == "3" 12/20 23:59:12 argv[3] == "-Lockfile" 12/20 23:59:12 argv[4] == "simple.dag.lock" 12/20 23:59:12 argv[5] == "-Dag" 12/20 23:59:12 argv[6] == "simple.dag" 12/20 23:59:12 argv[7] == "-Rescue" 12/20 23:59:12 argv[8] == "simple.dag.rescue" 12/20 23:59:12 argv[9] == "-Condorlog" 12/20 23:59:12 argv[10] == "simple.dag.dummy_log" 12/20 23:59:12 DAG Lockfile will be written to simple.dag.lock 12/20 23:59:12 DAG Input file is simple.dag 12/20 23:59:12 Rescue DAG will be written to simple.dag.rescue 12/20 23:59:12 All DAG node user log files: 12/20 23:59:12 /specific/a/home/cc/cs/alainroy/simple.log 12/20 23:59:12 Parsing simple.dag ... 12/20 23:59:12 Dag contains 4 total jobs 12/20 23:59:12 Deleting any older versions of log files... 12/20 23:59:12 Deleting older version of /specific/a/home/cc/cs/alainroy/simple.log 12/20 23:59:12 Bootstrapping... 12/20 23:59:12 Number of pre-completed jobs: 0 12/20 23:59:12 Registering condor_event_timer... 12/20 23:59:13 Submitting Condor Job Setup ... 12/20 23:59:13 submitting: condor_submit -a 'dag_node_name = Setup' -a '+DAGManJobID = 6101.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.setup.submit 2>&1 12/20 23:59:14 assigned Condor ID (6102.0.0) 12/20 23:59:14 Just submitted 1 job this cycle... 12/20 23:59:14 Event: ULOG_SUBMIT for Condor Job Setup (6102.0.0) 12/20 23:59:14 Of 4 nodes total: 12/20 23:59:14 Done Pre Queued Post Ready Un-Ready Failed 12/20 23:59:14 === === === === === === === 12/20 23:59:14 0 0 1 0 0 3 0 12/20 23:59:44 Event: ULOG_EXECUTE for Condor Job Setup (6102.0.0) 12/20 23:59:49 Event: ULOG_JOB_TERMINATED for Condor Job Setup (6102.0.0) 12/20 23:59:49 Job Setup completed successfully. 12/20 23:59:49 Of 4 nodes total: 12/20 23:59:49 Done Pre Queued Post Ready Un-Ready Failed 12/20 23:59:49 === === === === === === === 12/20 23:59:49 1 0 0 0 2 1 0 12/20 23:59:55 Submitting Condor Job Work1 ... 12/20 23:59:55 submitting: condor_submit -a 'dag_node_name = Work1' -a '+DAGManJobID = 6101.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.work1.submit 2>&1 12/20 23:59:57 assigned Condor ID (6103.0.0) 12/20 23:59:58 Submitting Condor Job Work2 ... 12/20 23:59:58 submitting: condor_submit -a 'dag_node_name = Work2' -a '+DAGManJobID = 6101.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.work2.submit 2>&1 12/21 00:00:00 assigned Condor ID (6104.0.0) 12/21 00:00:00 Just submitted 2 jobs this cycle... 12/21 00:00:00 Event: ULOG_SUBMIT for Condor Job Work1 (6103.0.0) 12/21 00:00:00 Event: ULOG_SUBMIT for Condor Job Work2 (6104.0.0) 12/21 00:00:00 Of 4 nodes total: 12/21 00:00:00 Done Pre Queued Post Ready Un-Ready Failed 12/21 00:00:00 === === === === === === === 12/21 00:00:00 1 0 2 0 0 1 0 12/21 00:00:15 Event: ULOG_EXECUTE for Condor Job Work1 (6103.0.0) 12/21 00:00:20 Event: ULOG_JOB_TERMINATED for Condor Job Work1 (6103.0.0) 12/21 00:00:20 Job Work1 completed successfully. 12/21 00:00:20 Event: ULOG_EXECUTE for Condor Job Work2 (6104.0.0) 12/21 00:00:20 Of 4 nodes total: 12/21 00:00:20 Done Pre Queued Post Ready Un-Ready Failed 12/21 00:00:20 === === === === === === === 12/21 00:00:20 2 0 1 0 0 1 0 12/21 00:00:25 Event: ULOG_JOB_TERMINATED for Condor Job Work2 (6104.0.0) 12/21 00:00:25 Job Work2 completed successfully. 12/21 00:00:25 Of 4 nodes total: 12/21 00:00:25 Done Pre Queued Post Ready Un-Ready Failed 12/21 00:00:25 === === === === === === === 12/21 00:00:25 3 0 0 0 1 0 0 12/21 00:00:31 Submitting Condor Job Final ... 12/21 00:00:31 submitting: condor_submit -a 'dag_node_name = Final' -a '+DAGManJobID = 6101.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.finalize.submit 2>&1 12/21 00:00:32 assigned Condor ID (6105.0.0) 12/21 00:00:32 Just submitted 1 job this cycle... 12/21 00:00:32 Event: ULOG_SUBMIT for Condor Job Final (6105.0.0) 12/21 00:00:32 Of 4 nodes total: 12/21 00:00:32 Done Pre Queued Post Ready Un-Ready Failed 12/21 00:00:32 === === === === === === === 12/21 00:00:32 3 0 1 0 0 0 0 12/21 00:01:12 Event: ULOG_EXECUTE for Condor Job Final (6105.0.0) 12/21 00:01:17 Event: ULOG_JOB_TERMINATED for Condor Job Final (6105.0.0) 12/21 00:01:17 Job Final completed successfully. 12/21 00:01:17 Of 4 nodes total: 12/21 00:01:17 Done Pre Queued Post Ready Un-Ready Failed 12/21 00:01:17 === === === === === === === 12/21 00:01:17 4 0 0 0 0 0 0 12/21 00:01:17 All jobs Completed! 12/21 00:01:17 **** condor_scheduniv_exec.6101.0 (condor_DAGMAN) EXITING WITH STATUS 0
Clean up your results. Be careful about deleting the simple.dag.* files, you do not want to delete the simple.dag file, just simple.dag.* .
% rm simple.dag.* % rm results.*