|
||||||||||||
|
7.0 A More Complex DAGEach job in a DAGMan DAG must have only one queue command in it, so a DAG with multiple jobs has one submit file per job. Theoretically you can reuse submit files if you are careful and use the $(Cluster) macro, but that is rarely desirable. We will now make a DAG with four nodes in it: a setup node, two nodes that do analysis, and a cleanup node. For now, of course, all of these nodes will do the same thing, but hopefully the principle will be clear. First, make sure that your submit file has only one queue command in it, as when we first wrote it:
Universe = vanilla Executable = simple.bat Arguments = 4 10 Log = simple.log.txt Output = simple.out.txt Error = simple.err.txt should_transfer_files = YES when_to_transfer_output = ON_EXIT Queue Now copy these files:
C:\condor-test> copy simple.sub job.setup.sub 1 file(s) copied. C:\condor-test> copy simple.sub job.work1.sub 1 file(s) copied. C:\condor-test> copy simple.sub job.work2.sub 1 file(s) copied. C:\condor-test> copy simple.sub job.finalize.sub 1 file(s) copied. Edit the various submit files. Change the output and error entries (but not the log entry) to point to results.NODE.output and results.NODE.error files where NODE is actually the middle word in the submit file (job.NODE.submit). So job.finalize.error would include:
Output = results.finalize.out.txt Error = results.finalize.err.txt This is important so that the various nodes don't overwrite each other's output. You can verify your changes with the find command. For example: C:\condor-test>find "Output" job.*.sub ---------- JOB.FINALIZE.SUB Output = results.finalize.out.txt ---------- JOB.SETUP.SUB Output = results.setup.out.text ---------- JOB.WORK1.SUB Output = results.work1.out.txt ---------- JOB.WORK2.SUB Output = results.work2.out.txt C:\condor-test>find "Error" job.*.sub ---------- JOB.FINALIZE.SUB Error = results.finalize.err.txt ---------- JOB.SETUP.SUB Error = results.setup.err.text ---------- JOB.WORK1.SUB Error = results.work1.err.txt ---------- JOB.WORK2.SUB Error = results.work2.err.txt Leave the log entries alone. Old versions of DAGMan requires that all nodes output their logs in the same location. Condor will ensure that the different jobs will not overwrite each other's entries in the log. Newer versions of DAGMan lift this requirement, and allow each job to use its own log file -- but you may want to use one common log file anyway because it's convenient to have all of your job status information in a single place.
Log = simple.log.txt Also change the arguments entries so that the second argument is something unique to each node. This way each of our jobs will calculate something different and we can tell apart their outputs. For job work1, change the second argument to 11 so that it looks something like:
Arguments = 4 11 Now construct your dag in the file called simple.dag:
Job Setup job.setup.sub Job Work1 job.work1.sub Job Work2 job.work2.sub Job Final job.finalize.sub PARENT Setup CHILD Work1 Work2 PARENT Work1 Work2 CHILD Final Submit your new DAG and monitor it.
C:\condor-test> condor_submit_dag simple.dag Checking all your submit files for log file names. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : simple.dag.condor.sub Log of DAGMan debugging messages : simple.dag.dagman.out Log of Condor library output : simple.dag.lib.out Log of Condor library error messages : simple.dag.lib.err Log of the life of condor_dagman itself : simple.dag.dagman.log Condor Log file for all jobs of this DAG : C:\condor-test\simple.log.txt Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 11. ----------------------------------------------------------------------- C:\condor-test> watch_condor_q.bat DAGMan runs and submit the first job: -- Submitter: lab-21 : <129.215.30.181:2207> : lab-21 ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 11.0 Administrator 11/27 12:24 0+00:00:19 R 0 1.2 condor_dagman.exe 12.0 |-Setup 11/27 12:24 0+00:00:00 I 0 0.0 simple.bat 4 10 2 jobs; 1 idle, 1 running, 0 held That first job starts -- Submitter: lab-21 : <129.215.30.181:2207> : lab-21 ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 11.0 Administrator 11/27 12:24 0+00:00:37 R 0 1.2 condor_dagman.exe 12.0 |-Setup 11/27 12:24 0+00:00:03 R 0 0.0 simple.bat 4 10 2 jobs; 0 idle, 2 running, 0 held The first job finishes, but DAGMan hasn't reacted yet -- Submitter: lab-21 : <129.215.30.181:2207> : lab-21 ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 11.0 Administrator 11/27 12:24 0+00:00:46 R 0 1.2 condor_dagman.exe 1 jobs; 0 idle, 1 running, 0 held The next two jobs start up. -- Submitter: lab-21 : <129.215.30.181:2207> : lab-21 ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 11.0 Administrator 11/27 12:24 0+00:00:56 R 0 1.2 condor_dagman.exe 13.0 |-Work1 11/27 12:24 0+00:00:01 R 0 0.0 simple.bat 4 11 14.0 |-Work2 11/27 12:24 0+00:00:01 R 0 0.0 simple.bat 4 10 3 jobs; 0 idle, 3 running, 0 held Those jobs have finished, and DAGMan will notice soon -- Submitter: lab-21 : <129.215.30.181:2207> : lab-21 ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 11.0 Administrator 11/27 12:24 0+00:01:05 R 0 1.2 condor_dagman.exe 1 jobs; 0 idle, 1 running, 0 held Now our final node is running -- Submitter: lab-21 : <129.215.30.181:2207> : lab-21 ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 11.0 Administrator 11/27 12:24 0+00:01:14 R 0 1.2 condor_dagman.exe 15.0 |-Final 11/27 12:25 0+00:00:03 R 0 0.0 simple.bat 4 10 2 jobs; 1 idle, 1 running, 0 held All finished! -- Submitter: lab-21 : <129.215.30.181:2207> : lab-21 ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held Ctrl + C Terminate batch job (Y/N)? y You can see that the Final node wasn't run until after the Work nodes, which were not run until after the Setup node. Examine your results:
C:\condor-test> more results.* results.setup.out.txt Thinking really hard for 4 seconds... 1 2 3 4 5 6 7 8 9 10 results.work1.out.txt Thinking really hard for 4 seconds... 1 2 3 4 5 6 7 8 9 10 11 results.work2.out.txt Thinking really hard for 4 seconds... 1 2 3 4 5 6 7 8 9 10 11 12 Thinking really hard for 4 seconds... results.finalize.out.txt 1 2 3 4 5 6 7 8 9 10 11 12 13 Examine your log:
C:\condor-test> more simple.log.txt 000 (012.000.000) 11/27 12:24:23 Job submitted from host: <129.215.30.181:2207> DAG Node: Setup ... 001 (012.000.000) 11/27 12:24:47 Job executing on host: <129.215.30.173:2217> ... 005 (012.000.000) 11/27 12:24:51 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 84 - Run Bytes Sent By Job 431 - Run Bytes Received By Job 84 - Total Bytes Sent By Job 431 - Total Bytes Received By Job ... 000 (013.000.000) 11/27 12:24:59 Job submitted from host: <129.215.30.181:2207> DAG Node: Work1 ... 000 (014.000.000) 11/27 12:24:59 Job submitted from host: <129.215.30.181:2207> DAG Node: Work2 ... 001 (014.000.000) 11/27 12:25:07 Job executing on host: <129.215.30.164:2192> ... 001 (013.000.000) 11/27 12:25:08 Job executing on host: <129.215.30.173:2217> ... 005 (014.000.000) 11/27 12:25:11 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 84 - Run Bytes Sent By Job 431 - Run Bytes Received By Job 84 - Total Bytes Sent By Job 431 - Total Bytes Received By Job ... 005 (013.000.000) 11/27 12:25:12 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 88 - Run Bytes Sent By Job 431 - Run Bytes Received By Job 88 - Total Bytes Sent By Job 431 - Total Bytes Received By Job ... 000 (015.000.000) 11/27 12:25:19 Job submitted from host: <129.215.30.181:2207> DAG Node: Final ... 001 (015.000.000) 11/27 12:25:29 Job executing on host: <129.215.30.173:2217> ... 005 (015.000.000) 11/27 12:25:33 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 84 - Run Bytes Sent By Job 431 - Run Bytes Received By Job 84 - Total Bytes Sent By Job 431 - Total Bytes Received By Job ... Examine the DAGMan log:
C:\condor-test> more simple.dag.dagman.out 11/20 10:52:29 ****************************************************** 11/20 10:52:29 ** condor_scheduniv_exec.11.0 (CONDOR_DAGMAN) STARTING UP 11/20 10:52:29 ** C:\condor\bin\condor_dagman.exe 11/20 10:52:29 ** $CondorVersion: 6.9.5 Nov 18 2007 $ 11/20 10:52:29 ** $CondorPlatform: INTEL-WINNT50 $ 11/20 10:52:29 ** PID = 5224 11/20 10:52:29 ** Log last touched 11/20 10:20:38 11/20 10:52:29 ****************************************************** 11/20 10:52:29 Using config source: C:\condor\condor_config 11/20 10:52:29 Using local config sources: 11/20 10:52:29 C:\condor/condor_config.local 11/20 10:52:29 DaemonCore: Command Socket at <128.105.48.96:64256> 11/20 10:52:29 DAGMAN_SUBMIT_DELAY setting: 0 11/20 10:52:29 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6 11/20 10:52:29 DAGMAN_STARTUP_CYCLE_DETECT setting: 0 11/20 10:52:29 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5 11/20 10:52:29 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114 11/20 10:52:29 DAGMAN_RETRY_SUBMIT_FIRST setting: 1 11/20 10:52:29 DAGMAN_RETRY_NODE_FIRST setting: 0 11/20 10:52:29 DAGMAN_MAX_JOBS_IDLE setting: 0 11/20 10:52:29 DAGMAN_MAX_JOBS_SUBMITTED setting: 0 11/20 10:52:29 DAGMAN_MUNGE_NODE_NAMES setting: 1 11/20 10:52:29 DAGMAN_DELETE_OLD_LOGS setting: 1 11/20 10:52:29 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0 11/20 10:52:29 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0 11/20 10:52:29 DAGMAN_ABORT_DUPLICATES setting: 1 11/20 10:52:29 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1 11/20 10:52:29 DAGMAN_PENDING_REPORT_INTERVAL setting: 600 11/20 10:52:29 argv[0] == "condor_scheduniv_exec.11.0" 11/20 10:52:29 argv[1] == "-Debug" 11/20 10:52:29 argv[2] == "3" 11/20 10:52:29 argv[3] == "-Lockfile" 11/20 10:52:29 argv[4] == "simple.dag.lock" 11/20 10:52:29 argv[5] == "-Condorlog" 11/20 10:52:29 argv[6] == "C:\condor-test\simple.log.txt" 11/20 10:52:29 argv[7] == "-Dag" 11/20 10:52:29 argv[8] == "simple.dag" 11/20 10:52:29 argv[9] == "-Rescue" 11/20 10:52:29 argv[10] == "simple.dag.rescue" 11/20 10:52:29 DAG Lockfile will be written to simple.dag.lock 11/20 10:52:29 DAG Input file is simple.dag 11/20 10:52:29 Rescue DAG will be written to simple.dag.rescue 11/20 10:52:29 All DAG node user log files: 11/20 10:52:29 C:\condor-test\simple.log.txt (Condor) 11/20 10:52:29 Parsing simple.dag ... 11/20 10:52:29 Dag contains 4 total jobs 11/20 10:52:29 Truncating any older versions of log files... 11/20 10:52:29 MultiLogFiles: truncating older version of C:\condor-test\simple.log.txt 11/20 10:52:29 Sleeping for 12 seconds to ensure ProcessId uniqueness 11/20 10:52:41 WARNING: ProcessId not confirmed unique 11/20 10:52:41 Bootstrapping... 11/20 10:52:41 Number of pre-completed nodes: 0 11/20 10:52:41 Registering condor_event_timer... 11/20 10:52:42 Submitting Condor Node Setup job(s)... 11/20 10:52:42 submitting: condor_submit -a dag_node_name' '=' 'Setup -a +DAGManJobId' '=' \ '11 -a DAGManJobId' '=' '11 -a submit_event_notes' '=' 'DAG' 'Node:' 'Setup \ -a +DAGParentNodeNames' '=' '"" job.setup.sub 11/20 10:52:42 From submit: Submitting job(s). 11/20 10:52:42 From submit: Logging submit event(s). 11/20 10:52:42 From submit: 1 job(s) submitted to cluster 12. 11/20 10:52:42 assigned Condor ID (12.0) 11/20 10:52:42 Just submitted 1 job this cycle... 11/20 10:52:42 Event: ULOG_SUBMIT for Condor Node Setup (12.0) 11/20 10:52:42 Number of idle job procs: 1 11/20 10:52:42 Of 4 nodes total: 11/20 10:52:42 Done Pre Queued Post Ready Un-Ready Failed 11/20 10:52:42 === === === === === === === 11/20 10:52:42 0 0 1 0 0 3 0 11/20 10:53:22 Event: ULOG_EXECUTE for Condor Node Setup (12.0) 11/20 10:53:22 Number of idle job procs: 0 11/20 10:53:27 Event: ULOG_JOB_TERMINATED for Condor Node Setup (12.0) 11/20 10:53:27 Node Setup job proc (12.0) completed successfully. 11/20 10:53:27 Node Setup job completed 11/20 10:53:27 Number of idle job procs: 0 11/20 10:53:27 Of 4 nodes total: 11/20 10:53:27 Done Pre Queued Post Ready Un-Ready Failed 11/20 10:53:27 === === === === === === === 11/20 10:53:27 1 0 0 0 2 1 0 11/20 10:53:32 Submitting Condor Node Work1 job(s)... 11/20 10:53:32 submitting: condor_submit -a dag_node_name' '=' 'Work1 -a +DAGManJobId' '=' \ '11 -a DAGManJobId' '=' '11 -a submit_event_notes' '=' 'DAG' 'Node:' 'Work1 \ -a +DAGParentNodeNames' '=' '"Setup" job.work1.sub 11/20 10:53:32 From submit: Submitting job(s). 11/20 10:53:32 From submit: Logging submit event(s). 11/20 10:53:32 From submit: 1 job(s) submitted to cluster 13. 11/20 10:53:32 assigned Condor ID (13.0) 11/20 10:53:32 Submitting Condor Node Work2 job(s)... 11/20 10:53:32 submitting: condor_submit -a dag_node_name' '=' 'Work2 -a +DAGManJobId' '=' \ '11 -a DAGManJobId' '=' '11 -a submit_event_notes' '=' 'DAG' 'Node:' 'Work2 \ -a +DAGParentNodeNames' '=' '"Setup" job.work2.sub 11/20 10:53:33 From submit: Submitting job(s). 11/20 10:53:33 From submit: Logging submit event(s). 11/20 10:53:33 From submit: 1 job(s) submitted to cluster 14. 11/20 10:53:33 assigned Condor ID (14.0) 11/20 10:53:33 Just submitted 2 jobs this cycle... 11/20 10:53:33 Event: ULOG_SUBMIT for Condor Node Work1 (13.0) 11/20 10:53:33 Number of idle job procs: 1 11/20 10:53:33 Event: ULOG_SUBMIT for Condor Node Work2 (14.0) 11/20 10:53:33 Number of idle job procs: 2 11/20 10:53:33 Of 4 nodes total: 11/20 10:53:33 Done Pre Queued Post Ready Un-Ready Failed 11/20 10:53:33 === === === === === === === 11/20 10:53:33 1 0 2 0 0 1 0 11/20 10:53:43 Event: ULOG_EXECUTE for Condor Node Work2 (14.0) 11/20 10:53:43 Number of idle job procs: 1 11/20 10:53:43 Event: ULOG_EXECUTE for Condor Node Work1 (13.0) 11/20 10:53:43 Number of idle job procs: 0 11/20 10:53:48 Event: ULOG_JOB_TERMINATED for Condor Node Work2 (14.0) 11/20 10:53:48 Node Work2 job proc (14.0) completed successfully. 11/20 10:53:48 Node Work2 job completed 11/20 10:53:48 Number of idle job procs: 0 11/20 10:53:48 Event: ULOG_JOB_TERMINATED for Condor Node Work1 (13.0) 11/20 10:53:48 Node Work1 job proc (13.0) completed successfully. 11/20 10:53:48 Node Work1 job completed 11/20 10:53:48 Number of idle job procs: 0 11/20 10:53:48 Of 4 nodes total: 11/20 10:53:48 Done Pre Queued Post Ready Un-Ready Failed 11/20 10:53:48 === === === === === === === 11/20 10:53:48 3 0 0 0 1 0 0 11/20 10:53:53 Submitting Condor Node Final job(s)... 11/20 10:53:53 submitting: condor_submit -a dag_node_name' '=' 'Final -a +DAGManJobId' '=' \ '11 -a DAGManJobId' '=' '11 -a submit_event_notes' '=' 'DAG' 'Node:' 'Final \ -a +DAGParentNodeNames' '=' '"Work1,Work2" job.finalize.sub 11/20 10:53:53 From submit: Submitting job(s). 11/20 10:53:53 From submit: Logging submit event(s). 11/20 10:53:53 From submit: 1 job(s) submitted to cluster 15. 11/20 10:53:53 assigned Condor ID (15.0) 11/20 10:53:53 Just submitted 1 job this cycle... 11/20 10:53:53 Event: ULOG_SUBMIT for Condor Node Final (15.0) 11/20 10:53:53 Number of idle job procs: 1 11/20 10:53:53 Of 4 nodes total: 11/20 10:53:53 Done Pre Queued Post Ready Un-Ready Failed 11/20 10:53:53 === === === === === === === 11/20 10:53:53 3 0 1 0 0 0 0 11/20 10:54:03 Event: ULOG_EXECUTE for Condor Node Final (15.0) 11/20 10:54:03 Number of idle job procs: 0 11/20 10:54:08 Event: ULOG_JOB_TERMINATED for Condor Node Final (15.0) 11/20 10:54:08 Node Final job proc (15.0) completed successfully. 11/20 10:54:08 Node Final job completed 11/20 10:54:08 Number of idle job procs: 0 11/20 10:54:08 Of 4 nodes total: 11/20 10:54:08 Done Pre Queued Post Ready Un-Ready Failed 11/20 10:54:08 === === === === === === === 11/20 10:54:08 4 0 0 0 0 0 0 11/20 10:54:08 All jobs Completed! 11/20 10:54:08 Note: 0 total job deferrals because of -MaxJobs limit (0) 11/20 10:54:08 Note: 0 total job deferrals because of -MaxIdle limit (0) 11/20 10:54:08 Note: 0 total job deferrals because of node category throttles 11/20 10:54:08 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 11/20 10:54:08 Note: 0 total POST script deferrals because of -MaxPost limit (0) 11/20 10:54:08 **** condor_scheduniv_exec.11.0 (condor_DAGMAN) EXITING WITH STATUS 0 Clean up your results. Be careful about deleting the simple.dag.* files, you do not want to delete the simple.dag file, just simple.dag.* (or all the .txt files, since they are all output).
C:\condor-test> del *txt |
|||||||||||
|